Guide to Internet Archiving

0

At Snopes, archiving web links is key to our fact-checking practice. And thanks to the many archival resources on the Internet, this practice has become easier than ever. Keeping records on the Internet is essential to understand not only web history, but also to help us know if a tweet has already been deleted or if someone has changed a statement on a web page.

But this is not unique to our roles as fact checkers. Governments also maintain archives of each administration’s websites, for the sake of transparency and public access. Former US President Donald Trump’s White House website is trumpwhitehouse.archives.gov, while Barack Obama’s White House website is at obamawhitehouse.archives.gov. And the Clinton administration created the first White House website in 1994. These sites are labeled as “historical, ‘frozen in time’ material.” Certain Federal sites are “harvested” and saved by the Federal Depository Library Program Web Archive, which aims to “provide permanent public access to Federal Agency Web content.”

Estimates of the average lifespan of a web page vary over time. In 1997 Scientific American estimated it to be 44 days, and The New Yorker in 2015 suggested it could be 100 days. But some web pages can be taken down within hours, especially if they are politically sensitive in nature.

In 2014, when Malaysia Airlines Flight 17 was shot down over Ukrainian airspace, a Ukrainian separatist leader, Igor Girkin, also known as Strelkov, reportedly wrote: “We just shot down a plane , an AN-26”. While an AN-26 is a Soviet-built military cargo plane, the photographs on the post appeared to be those of a Boeing 777. The Wayback Machine recorded the post, which was deleted from Strelkov’s page just hours later. later. By the time a reporter tweeted a photo of the recorded webpage writing, “Grabbing Donetsk militant Strelkov’s claim to have shot down what appears to have been MH17”, Strelkov’s page had been edited and the assertion deleted. The only proof of this message was the screenshot saved on archive.org. Although the post may have been misleading, the incident exposed the Internet Archive’s role in collecting receipts that became useful for journalistic investigations.

The Internet Archive (archive.org) is considered one of the largest archives of its kind on the Internet, with approximately 625 billion web pages saved since its inception in 1996. Its Wayback Machine allows users to browse 25 years of history of the web, and the organization partners with the Federal Depository Library Program and other organizations through Archive-It.

Internet Archive is not the only online database. Others include archive.today, perma.cc, the UK Web Archive (specific to UK sites and a collaboration with UK legal deposit libraries) and Time Travel. Wikipedia also has a long list of international archiving efforts.

How to archive a webpage

However, the easiest site to start with is archive.org. Here, you just need to enter a link in the Wayback Machine to see if it already exists, by clicking on “Browse History”. Below there is another option to “Save page now” and create a new link.

If you want to browse the history of a webpage, you will be taken to all past instances where it has been archived, organized like a calendar, down to the month, day and time of recording. You can click on a date (indicated by a blue bubble) to go to a webpage. The larger the bubble, the more a page was archived that day. We should note that a green link indicates that a webpage has been redirected and may not work, so users should click on blue links.

The top of the search results page also tells users how many times a web page has been archived and the date range. The top bar shows the years the pages were saved while the calendar below allows us to click on the month, day and time.

Archive.org also has an extensive collection of books that we have often relied on in our research.

On archive.today, you can also find if a link has already been archived and archive one yourself.

How do we know that archived pages are not manipulated?

While people have taken screenshots of web pages and tweets in the past, manipulating simple images is easier than editing an already archived web page. According to the Social Science Research Council (SSRC):

Also, screenshots are static. There can be no interaction with the page – no scrolling, no hovering, no clicking on links, or even revealing which web pages the page’s links refer to.

Web archives, on the other hand, save the entire content of a web page, including its HTML source code and embedded images, style sheets, or JavaScript source. While reading, the user can interact with the archived page, including clicking links to find out what the web page was connected to. Additionally, public web archives are created and stored by independent archival organizations, such as the Internet Archive. We are confident that the contents of these public web archives have not been altered or maliciously manipulated.

However, archived links aren’t perfect and come with a range of possible issues, according to the SSRC:

Although web archives provide a valuable service, they are not perfect, and archiving a web page is very different from archiving a physical object or even a static file such as a PDF. Web pages have become increasingly complex over the years, with many loading hundreds or even thousands of images, style sheets and JavaScript resources, which can include advertisements and trackers. These JavaScript resources are executed by web browsers and many of their interactions cannot be captured by all web archives. The embedded and linked nature of HTML makes it difficult to directly proofread archived web pages, so web archives have to make some limited transformations to the original web page. This includes rewriting the links and locations of embedded resources so that they are loaded from the archive instead of the live web. This prevents someone from viewing a webpage captured in 2012, for example, and seeing an ad from 2018 embedded in that webpage from 2012.

With all the imperfections of online archive resources, here at Snopes we’ve still relied on them for many fact checks, including those on the Twitter history of public figures like Raphael Warnock, old quotes magazines, and more.

Sources:

“Presidential White House Archived Websites.” National Archives, January 9, 2017, https://www.archives.gov/presidential-libraries/archived-websites. Accessed November 10, 2022.

“Archive.Ph.” https://archive.ph/. Accessed November 10, 2022.

Emery, David. “Is this military manual ‘Mayonnaise Safety’ real?” Snopes, August 8, 2022, https://www.snopes.com/fact-check/mayonnaise-safety-military-handbook/. Accessed November 10, 2022.

Evon, Dan. “Did Trump write ‘Never Admit Defeat’ in ‘Art of the Deal’? Snopes, November 10, 2020, https://www.snopes.com/fact-check/trump-art-of-the-deal/. Accessed November 10, 2022.

“Federal Depository Library Program Web Archive.” Archive it. https://archive-it.org/home/FDLPwebarchive?fc=meta_Creator%3AU.S.+Department+of+Health+and+Human+Services. Accessed November 10, 2022.

“How Web Archivists and Other Digital Sleuths Uncover the Mystery of MH17.” Washington Post. www.washingtonpost.com, https://www.washingtonpost.com/news/the-intersect/wp/2014/07/21/how-web-archivists-and-other-digital-sleuths-are-unraveling-the- mystery-of-mh17/. Accessed November 10, 2022.

“Internet Archive: About AI.” https://archive.org/about/. Accessed November 10, 2022.

“Internet Archive: Backtracking Machine.” https://archive.org/web/. Accessed November 10, 2022.

Lepore, Jill. “What the Web Said Yesterday.” The New Yorker, January 19, 2015. www.newyorker.com, https://www.newyorker.com/magazine/2015/01/26/cobweb. Accessed November 10, 2022.

Liles, Jordan. “Did Raphael Warnock tweet about ‘the meaning of Easter’?” Snopes, April 18, 2022, https://www.snopes.com/fact-check/warnock-easter-tweet/. Accessed November 10, 2022.

Liles, Jordan. “‘Handmaid’s Tale’ tweet deleted from CNN host Brian Stelter’s Twitter account.” Snopes, September 2, 2021, https://www.snopes.com/fact-check/brian-stelter-handmaids-tale-cnn/. Accessed November 10, 2022.

“List of Web Archiving Initiatives.” Wikipedia, November 7, 2022. https://en.wikipedia.org/w/index.php?title=List_of_Web_archiving_initiatives&oldid=1120507741. Accessed November 10, 2022.

MacGuill, Dan. “Did Wired magazine publish ‘frighteningly accurate’ predictions of the 21st century in 1997?” Snopes, November 27, 2021, https://www.snopes.com/fact-check/wired-1997-predictions/. Accessed November 10, 2022.

“On the Importance of Web Archiving.” Articles, https://items.ssrc.org/parameters/on-the-importance-of-web-archiving/. Accessed November 10, 2022.

“Preserving the Internet.” Scientific American: Article—Special Report, 1997, https://web.archive.org/web/19970504212157/https://www.sciam.com/0397issue/0397kahle.html. Accessed November 10, 2022.

“The White House.” Whitehouse.Gov, March 12, 2015, https://obamawhitehouse.archives.gov/homepage. Accessed November 10, 2022.

“The White House.” Whitehouse.Gov, https://trumpwhitehouse.archives.gov/. Accessed November 10, 2022.

“Time travel.” https://timetravel.mementoweb.org/. Accessed November 10, 2022.

“UKWA Home.” https://www.webarchive.org.uk/ukwa/. Accessed November 10, 2022.

“Web evidence indicates pro-Russian rebels shot down MH17.” Christian Science Monitor, July 17, 2014. Christian Science Monitor, https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of- MH17. Accessed November 10, 2022.

“Websites change. Permalinks don’t.” Perma, https://perma.cc. Accessed November 10, 2022.

Share.

About Author

Comments are closed.