During our time at NYARC, we have encountered new technologies specific to capturing the web for archival purposes. Since the first efforts to index and capture the web in the 1990’s, technologies have greatly improved. This, in large part, has occurred as a result of extending open source software and the recognition that capturing the ephemeral web is a necessity for preserving web-based cultural heritage material. Even with technological advances, however, capturing the web with the current tools available is an imperfect process that requires a great deal of curation, assessment, and quality assurance (QA). At NYARC, some of the tools that we have directly encountered, or have learned a great deal about, are provided in the glossary below.
Archive-It: a subscription model web archiving service created by the Internet Archive. This service allows institutions to build, manage, and preserve collections and provides a public-facing interface for browsing. The Archive-It web application, or administrative portal, allows institutions to schedule crawls, review, and browse their collection.
Archives Unleashed Project: a digital humanities initiative aimed at facilitating scholarly use of web archives. Through the development of an open source data analytics toolkit specifically for web archives, it offers the ability to work with web archival data on a large scale. The project provides scholars with useful derivatives, including gephi network visualizations and raw text data pulled from the collection.
Brozzler: an open-source web crawler that uses a browser (Chrome or Chromium) to capture web pages and embedded URLs. Current dependencies include Python 3.4 or later, RethinkDB, and a Chrome browser. Note: there is no graphical user interface (GUI) for this software and, depending on institutional needs, may require some custom built software.
Crawler: a program (also called a robot or spider) that indexes or ‘captures’ material on the live web. Parameters, such as seed urls and scoping rules, can be given to a crawler. Web archiving crawlers make copies and store information as they move from one site to the next. Most crawler store archived material the WARC file format.
Crawler Trap: a set of web pages that, intentionally or unintentionally, cause a crawler to make an infinite number of requests for “new documents”. This infinite looping causes an abundance of unneeded data, which may exceed data limits if using a subscription services such as Archive-It. Scoping rules can be added to crawls to avoid crawler traps, including avoiding calendars and certain types of dynamic content.
Crawl Report: a report generated after each crawl and can be accessed through the Archive-It Partner portal. The “Hosts” tab in the report contains a graphic indicating how much data and how many documents were captured for a given crawl. The “File Type” tab provides a count of how many files types (i.e. html, images, etc.) were captured.
Heritrix: Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Is is released under an Apache 2.0 license, is written in Java, and requires a Linux operating system.
Internet Archive: A San Francisco-based, 501(c)(3) nonprofit digital library that contains the world’s largest web archive and maintains the Wayback Machine. Founded in 1996 by Brewster Kahle, the Internet Archive provides public access not only to archived web sites, but also to software, games, audio, films, and a large collection of digitized public-domain books.
LGA (Longitudinal Graph Analysis): An archival web graph file that includes a complete list of which URIs link to which URIs, along with a timestamp, from a collection’s origin through its current capture. LGA are ~1% the size of a collection’s aggregate WARC files. LGA can be requested through Archive-It Research Services (ARS).
Open Source: A term that refers to the ability to use, build upon, and share source code. Open-source code is available to the general public for any use, including commercial, and can be modified from the original. Open source content can be released under an open source license (i.e. Apache License 2.0, MIT, GNU, etc.) that aligns with those values. Heritrix and Brozzler are open source software released under an Apache 2.0 license.
Patch Crawl: (See QA)
Proxy Mode: Proxy Mode is an “offline browsing” mode that allows for evaluation of archive sites and circumvents leaks from the live web. Using proxy mode can be used by QA analyst to check the quality and completeness of archived content in a collection. The Archive-It Wayback Proxy Mode toggle for Firefox is available for Archive-It subscription partners.
robots.txt File: a file included in the html of a website that acts as an exclusionary tool requesting that a URL not be captured by web crawlers. By default, Archive-It respects robots.txt requests. These requests can be ignored, however, at the discretion of the web archivist and/or with the permission of the website owner.
Scope: limits an administrator assigns to a crawler when performing crawls. Web archivist using Archive-It are able to adjust the range of content that they would like the crawler to archive.
Scheduled Crawl: a regular protocol for the frequency with which a website is crawled. Websites may be crawled frequently and can be configured to the capture sites as they are updated. This crawl option is essential for sites that experience regular updates.
Seeds: a user-defined URL that establishes the starting point for a web crawler. Seeds also become the main access point (i.e. landing page) for the archived site.
Tableau Public: a free service allowing users to create web-based, interactive data visualizations that are both downloadable and can be embedded into websites.
Test Crawl: A crawl that is evaluative and meant to be used before adding permanent content to an Archive-It collection. Test crawls allow for an accurate picture of how seeds will capture and replay. Adjustments and scoping rules can be made after a test crawl to optimize the success of future production crawls.
Webrecorder: a free open-source web archiving service developed by the non-profit Rhizome’s digital preservation program that is focused on accurately capturing and playing back dynamic content. Rather than establishing a seed scope and allowing crawlers to make large swath captures of websites, Webrecorder users click manually through individual web pages and their content in order to perform captures.
WANE (Web Archive Named Entities): A file generated by using named-entity recognition (NER) tools to create a list of all the people, places, and organizations mentioned in an archived site. WANE dataset are generated using the Stanford Named Entity Recognizer software and, like LGA files, are ~1% the size of the corresponding WARC file. WANE files are structured as JSON files and can be requested through Archive-It Research Services (ARS).
WARC (Web ARChive): An archival file format used to store aggregated web content as a sequence of content blocks by a web crawler. Content blocks in a WARC file may contain resources in any format (images, audiovisual files, and pdfs) and can contain embedded or linked content in HTML pages. The WARC file format is the LOC preferred format for websites. Most web crawlers, such as Heritrix and Webrecorder, save web content in the WARC file format.
WAT (Web Archive Transformation): Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls. This specification is meant to simplify the analysis of large datasets produced by web crawling and provides an optimal file format (JSON) that can be analyzed in a distributed processing environment such as Hadoop. WAT files can be requested through Archive-It Research Services (ARS).
Wayback Machine: A digital archive of the World Wide Web launched by the Internet Archive in 2001.The services provided by the Wayback Machine allows users to search version of a web page across time and, as of April 2019, contains more than 357 billion web pages. Data for the Wayback Machine is accumulated using a web crawler, and the data is stored on a large cluster of Linux nodes.
QA: Wayback QA is an automated quality assurance tool that scans the Wayback page and identifies documents that were not captured initially by the crawler. QA analysts working in the Archive-It web application have the option to patch missing content back into Wayback page via a Patch Crawl. Patch crawls typically become viewable in the Wayback Machine within 24 hours.