In the digital age, the internet serves as a vast repository of information, documenting the collective knowledge and activities of our modern world. However, a significant portion of this content is ephemeral. According to a recent analysis by the Pew Research Center, a substantial percentage of web pages that existed a decade ago have disappeared, highlighting the issue of “digital decay.”
The analysis utilized data from the Internet Archive’s Common Crawl, with annual samples of approximately one million pages taken from 2013 to 2023. Additionally, a real-time sample of tweets collected from March to April 2023 was monitored for three months to assess their availability over time.
Disappearance of Web Pages
As of October 2023, about 25% of all web pages that existed between 2013 and 2023 are no longer accessible. This includes pages that have been deleted or removed, even if the website itself is still operational. The situation is more pronounced for older content, with 38% of pages from 2013 now inaccessible compared to only 8% of those from 2023.
The analysis examined the prevalence of broken links across various online spaces, including government sites, news websites, Wikipedia, and social media:
- Government Sites: Approximately 21% of government web pages contain at least one non-functional link. Local government sites are particularly vulnerable, with 6% of the 42 million links on the sampled 500,000 pages pointing to inaccessible content.
- News Websites: Similarly, 23% of news web pages have at least one broken link. This trend is consistent across both high-traffic and low-traffic news sites. Of the 500,000 news pages sampled, containing over 14 million links, 5% were found to be non-functional.
- Wikipedia: More than half (54%) of English Wikipedia pages have at least one reference link pointing to a non-existent page. A sample of 50,000 English Wikipedia pages revealed that 11% of reference links are no longer accessible. Additionally, 2% of these pages had all their reference links broken.
- Social Media: On the social media platform X (formerly Twitter), about 20% of tweets disappear from public view within a few months of being posted. This disappearance often results from accounts being made private, suspended, or deleted. A sample of nearly 5 million tweets showed that 18% were no longer publicly visible after a few months. Tweets in Turkish or Arabic were more prone to disappear. Most deletions occurred within days to weeks of posting.
Implications of Digital Decay
The findings from the Pew Research Center highlight the transient nature of online content. With a significant portion of web pages, links, and social media posts vanishing over time, the concept of the internet as a permanent repository of information is challenged. This digital decay affects a wide range of online spaces, from government and news sites to information sources like Wikipedia and social media platforms. Consequently, it raises important questions about digital information preservation and the reliability of the internet as a long-term archive of human knowledge.
This underscores the importance of online content archiving. Initiatives like the Internet Archive provide valuable resources for researchers and academics seeking access to deleted sites. Additionally, long-term vision projects, such as the Eternal Access Project, store digital copies of threatened media, ensuring their preservation.