The purpose of a web archive is to capture the content from a live web site, preserve it, and make it available for later use. The Internet Archive, for example, notes that the point of a web archive is to attempt to “recreate the same experience a user would have had if they had visited the live site on the day it was archived” (LaCalle 2015). Though some attempts have been made to record a site using methods such as screen shots, screen recorders, and making a PDF of a web page, these are not web archives in the strictest sense (Brügger 2018). A web archive captures the content of a site as well as the underlying code (html and css) and any embedded content, and makes it available in a WARC file. A WARC file is the Library of Congress preferred archival file format for websites (Library of Congress 2018). The use of an archival format helps ensure that sites captured in this manner will be available for later use. Algorithms called web crawlers (sometimes called robots or spiders) capture content blocks of a web site and save it as a WARC file. These crawlers are predominately open source, which means that they are free to use and are extensible. It also means, however, that there is no guarantee of support services should something go wrong. They also often need additional software, such as an interface, to be usable. The use of such crawlers requires a great deal of training and knowledge. As a result, many organizations choose to use subscription services that utilize open source software but provide the needed interfaces and support. NYARC, like many other cultural institutions, chose Archive-It, from the Internet Archive, as its web archiving management tool.
The Archive-It service allows institutions to build, manage, and preserve collections and provides a public-facing interface that is full-text searchable. The Archive-It web application, or administrative portal, allows institutions and their staff to schedule crawls as well as review and browse their collection. It also allows for oversight of the crawler including stipulating the seed urls (or starting points for a crawl) and setting scoping rules, which delimit what is captured. After a crawl, Archive-It creates digital captures of websites, replicating the content and design of the original live URL. The archived versions of websites are accessible to the public via Archive-It partner websites. Separate versions, or “historical instances,” of updated websites are accessible. For example, if a gallery hosts monthly exhibitions, then a monthly capture of the website’s new iteration is made and added to the archive. Like Wayback Machine, it is possible to browse a website through time and access past versions.
While Archive-It is extremely helpful in managing a collection, web archiving itself is hardly a flawless process. Often, desired content is not captured by initial crawls or fully scoped during a test crawl. As a result, seed scope may require adjustments and images or dynamic content may need to be added. Since capturing valuable content is essential to (re)creating the experience of the original site, quality assurance process (QA) is necessary to maintain the overall health of a web archive. QAing for web archives involves systematically clicking through each URL within a website in order to ensure it is functioning properly. If an aspect of the website is not present after the original capture, these missed URLs are “patched in”, meaning there is a supplementary patch crawl to capture this targeted content. Content from a QA crawl will be available within 24 hours after it is completed and will be fully searchable within 48 hours.
Much of the NYARC fellows’ time is dedicated to QAing various websites from the NYARC Archive-It Collection. These include NYARC institutional partner websites such as the Brooklyn Museum, the Frick Collection, or MoMA. The museum websites each have different designs, ranging from traditional layouts (i.e. the Frick) to more dynamic (i.e. MoMA). QAing for museum websites, in general, requires a great deal of oversight thanks to the size of the websites and the numerous images in their digital collections. As a result, it is a delicate (and sometimes frustrating) process. A compounding factor in doing this work is the reliance of most museum websites on images. If images are missing after a crawl, there are often large quantities of them that are absent. The difficulty of capturing images stems, in part, from how they are stored and accessed. Digital collection images are not simply embedded in a site (i.e. linked in a href or src in html). Rather, they are stored in a Content Management System (CMS) or database and pulled into the website with a query. This, in turn, requires careful QAing because crawling a database (1) may not be desirable and (2) could put a strain on a web archive’s data limit. Because of these limitations, QAing requires deciding which pictures are absolutely necessary to the integrity of the website (i.e. are they on the homepage or the first page of a collection?).
For the Frick Collection site, which was QAed during the Spring semester, these questions were certainly at play. For instance, many images were missing from the “Collections” pages, including the “Painting,” “Sculpture,” “Paper,” “Furniture,” and “Coins” pages. Before patching, broken image links and overlapping labels were visible on all the archived site pages (pictured). Patching in these images took several weeks, in part because Archive-It underwent a server update early in the semester (which caused downtime) and partly because the there was several hundred images to repair. After including these missing images, the archived site looks much closer to the live site (pictured).
Another web archiving tool that can be utilized to create web archives and for QAing is Webrecorder, a cloud-based web archiving service developed by Rhizome, a non-profit operated out of the New Museum and dedicated to the preservation of born digital art. Like NYARC, Rhizome received support from the Mellon foundation to develop their web archiving program. Due to Rhizome’s interest in Net Art, Webrecorder is a tool designed especially to capture dynamic content, and can also capture interactive websites that alter as they are accessed and affected by the user. Unlike Archive-It, which sets crawlers to obtain large swaths of content from a website, Webrecorder has the web archiver view the website in question through a proxy and “records” the interactions between the user and the website, capturing content as it responds to requests while it is being interacted with and creating a WARC file. While Webrecorder is excellent at capturing dynamic content, it does require a time consuming process in which the user must interact with every function of the website and click on each individual URL in order to thoroughly web archive a website. This aspect to Webrecorder, however, makes it extremely user friendly; not only is this open source service free, it is intuitive and accessible simply via a web browser.
A valuable element that is part of Webrecorder is the autoscroll function that allows content to be captured that is only activated by the user’s interactive scrolling through a website in order to generate content. Social media pages are set up to require the user to interactive with content on timelines by scrolling; approaching such a web page with a traditional crawler approach will not capture such content. Webrecorder’s auto scroll function is able to set the browser to automatically scroll through to the bottom of a page, generating and capturing content. This tool was helpful when the Brooklyn Museum contacted NYARC to request that we archive five social media accounts on both Twitter and Facebook that they were preparing to retire and replace (pictured).
In general, the nature of web archiving is an iterative process wherein regular crawls are scheduled followed by QAing protocols. The QAing process is time sensitive work and, ideally, should take place soon after a scheduled crawl. This is because patch crawls capture any missing content from live sites. If these sites change, this content may be lost forever. While QAing and patch crawling can be time consuming and tedious work, it is an essential part of web archiving. As with content appraisal and preservation, QAing contributes to the sustainability of web archives and ensures that content will be available to the public, and to researcher, now and in the future.
Brügger, Niels. (2018) The Archived Web: Doing History in the Digital Age. Cambridge, MA: MIT Press.
LaCalle, Maria. (2015). Archive-It: Archiving and Preserving Web Content [Presentation Slides]. Preserving and Archiving Special Interest Group (PASIG), San Diego, CA. Retrieved from http://web.stanford.edu/group/dlss/pasig/PASIG_March2015/20150313_Presentations/LaCalle_Archive-It.pdf
Library of Congress. (2018). WARC, Web ARChive file format. Sustainability of Digital Formats: Planning for Library of Congress Collections. Retrieved from https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.