LITERATURE REVIEW

Belovari, S. (2017). Historians and web archives. Archivaria, 83, 59-79. Retrieved from https://muse.jhu.edu/

This article explores how historians can leverage web archives for historical research. The author explains, however, that historians may not readily take advantage of web archives because they may be concerned with (1) replicating their methodologies and processes and (2) may not be able to find essential and authoritative records. After a thought experiment that illustrates the shortcomings of web archive for future use, Belovari identifies current issues found in 21 International Internet Preservation Consortium (IIPC) partner web archives in 2015. The issues that she identifies include: discovery (no consortial portal, broken links); content, scope, and, appraisal issues (limited scope, little appraisal information); searching, browsing, and contextualizing (incomplete records, additional dead links); assistance; provenance and original order (unclear redirects); and web native features. Although these issues are difficult to remedy due to resources, labor, and technology, Belovari suggest that a more complete historical record can be created by approaching web archives through the lens of archival practice and principles and by keeping the needs of historians in mind.  

Brugger, N. & Finnermann, N. (2013). The web and digital humanities: Theoretical and methodological concerns. Journal of Broadcasting & Electronic Media, 57(1), 66-80. doi: https://doi.org/10.1080/08838151.2012.761699

In this article, the authors approach web archives as both a complex and an incomplete historical record. The article presents three perspectives on digital materials within the humanities, two relating to digital humanities and one related to Internet Studies. Major consideration is given to the Internet and the nature of web archives as a particular kind of born digital material, specifically one that “is ‘reborn’, unique, and deficient.” This formulation indicates that the archive is a re-presentation of materials already in existence, it is an artifact unto itself, and because it can never capture the content in its totality, it is never entirely sufficient. A particular set of attributes are give to web archives including their hypertextuality (site organization; linked content), interactivity (web annotations, comments, tweets), and multimodality (images, videos, sound). As with traditional archival appraisal, attention is given to web archiving strategies as being part of the conversation surrounding what to keep, how to keep it, and what should be left behind.

Costea, M. (2018). Report on the Scholarly use of Web Archives. Aarhus, Denmark: Netlab. Retrieved from
http://netlab.dk/wp-content/uploads/2018/02/Costea_Report_on_the_Scholarly_Use_of_Web_Archives.pdf

This study examines the ways in which users interact with and conceptualize of web archives. Costea collected data through interviews and an online survey, revealing a general lack of knowledge within the social sciences and the humanities. From interviews with those researchers who were familiar with web archives, Costea gleaned information related to the details of their interactions, as well as their hopes for developments in the future that would improve their user experience. Some of the reasons for a lack of use were that potential users were not familiar with web archives and, viewing them as a novelty, were unsure of their legitimacy as a resource for their research. They also had concerns about how to cite such resources. For those already experienced in engaging with web archives for their research, respondents emphasized a desire to have improved discoverability options, such as federated searching, rather than searching being limited to one specific website. They also expressed that they wanted search functionalities enabling them to filter search results and isolate different file types, such as video and images, within a website–and to have a keyword search summary with extensive metadata that would be able to describe the contents of websites and webpages.

Milligan, I. (2016). Lost in the infinite archive: The promise and pitfalls of web archives. International Journal of Humanities and Computing, 10(1), 78-94. doi: 10.3366/ijhac.2016.0161

This article highlights the challenges–as well as the opportunities–posed by web archives. As the author explains, there is much potential in web archiving collections that are expressive of the voices of “everyday” people. Blogs, journals, and other non-academic texts, for example,  allow for more nuanced understanding of the world and documenting them via web archiving attempts to capture the different modes of expression written by “everyday people of various classes, genders, ethnicities, and ages.” The shear amount of volume on the web, however, creates a tension between large amounts of content and the realities of capturing and preserving that content. Some of the drawbacks the author notes include the lack of discovery tools for web archives, incomplete captures, and isolated collections, that may or not be representative of a given topic or society. In identifying (and anticipating) the needs of future historians, the author draws on first-hand research with the Internet Archive and Archive-It web archiving teams to identify datasets that may be of use to researchers. These data include WARC files that comprise the Wide Web Scraping of the Web, WAT files that provide metadata, and other various web archives crawled by the Archive-It team. Through exploring these three web archiving case studies, the author is able to explain three ways to create data from web archives as well as the challenges that come with data creation.These challenges include technical considerations, such as computing power and processing, as well as more ethical considerations such as copyright and fair use.

Maemura, E. (2018). What’s cached is prologue: Reviewing recent web archives research towards supporting scholarly use. Proceedings of the Association for Information Science and Technology, 55(1), 327-336.  https://doi.org/10.1002/pra2.2018.14505501036

This insightful paper provides a landscape of web archive research and identifies three common challenges faced in supporting scholarly use of web archives. The three challenges identified include (1) exploring and organization a collection, (2) critically examining collection materials, and (3) approaches to ethical consent. The first challenge highlights several instances where web archives can be made into usable data. This includes derivative datasets drawn from WARC files (i.e. WAT, WANE, and LGA data) as well as inquiry into the underlying structure of a website (internal and external linking). While the author highlights many use cases, she also notes that creating useful data from web archives will be an on-going challenge. The second challenge–critically examining collection materials–has many implications for researchers and information professionals alike. As the author notes, web archives should not be approached in the same manner as the live web. In turn, web archives need to be examined for absences or inconsistencies, triangulated with other resources, and look at the documentation for the resource, including the metadata. The third challenge–considerations of ethics and consent–is particularly important for web archives. An ethical frame that considers legal implications, accurate representation of people and facts, privacy, and permissions need to be part of one’s methodology when using web archives as a resource. In addition to these three challenges, the author provides very helpful recommendations, including considering the human and non-human labor that goes into creating and sustaining a web archive.

Niu, J. (2012). Functionalities of Web Archives. D-Lib Magazine, 18(2/4). doi:10.1045/march2012-niu2

This 2012 report addresses the different functionalities of web archives that are provided to users. Researcher Jinfang Niu examined ten of the largest web archives and their respective users’ expectations of and experience with the collections, comparing the existing state of web archives with the functionality checklist proposed by the International Internet Preservation Consortium (IIPC) in order to begin establishing an industry standard. Recognizing the limited amount of studies related to the potential use and functionalities of web archives, the IIPC created a list of baseline and advanced features to ideally be offered in conjunction with content. These include more basic elements such as the ability to search using a variety of parameters, such as Boolean search, search by keyword, federated searching, searching based on content type (ex. image, video), etc. One of the major elements absent is data mining, a functionality on the checklist not offered by any of the web archives evaluated. Since Niu’s report the Archives Unleashed Project has developed software tools that have expanded the ability of web archive users to conduct data analysis for their scholarly purposes, perhaps pointing toward third party institutions developing tools for data analytics rather than each individual web archive, which may not have the requisite resources necessary to develop advanced functionalities.

Stirling, P., Chevallier, P., & Illien, G. (2012). Web archives for researchers: Representations, expectations, and potential uses, D-Lib Magazine, 18(¾), doi:10.1045/march2012-stirling

This report details the results of a qualitative study by three scholars (Peter Stirling, Philippe Chevalier, and Gildas Illien) at the Bibliothèque nationale de France (BnF) in response to the new responsibility given to the BnF after passage in 2007 of legislation in France requiring web archives be collected, preserved, and made accessible by the national library. The study consists of interviews with different library users conducting research at the BnF and was conducted with the goal of increasing global interest in web archiving and learning how their use might be increased. The researchers interviewed reported their concern when it came to citing materials on the web—as with there is no established methodology when it comes to using websites within scientific research as evidentiary support for or illustration of a sociological or historical fact. There is also concern with link rot, as there is no way to ensure the resources linked to will remain online or at the URL cited. An interesting note is that some of the researchers interviewed already collect their own personal web archives, printing out websites or making PDFs or screenshots. Others interviewed relied on the Internet Archive, but were frustrated by the lack of dynamic content captured and broken external links. Some were aware of BnF’s efforts, though had not utilized its resources, anticipating the resource would only be of use to future scholars or for scholars of web history.

The report concludes that collection policies for web archives must not be afraid to be selective when limiting the scope of their collections, as this allows the collection to be comprehensible to potential users and more akin to print publications. Researchers are willing to accept not having infinite materials if they understand the reasoning behind limited scope and the collection is meaningfully defined. The researchers suggest following a mixed model like that of BnF’s, which combines large-scale crawls with focused crawls created through manual selection.The association of web archives with prestigious institutions raises the legitimacy of web archives and helps researchers more comfortably cite such resources. As organizations like BnF and NYARC continue to grow their collections, it seems likely that smaller organizations will begin to follow suit as understanding of and demand for web archives increases.

Szydlowski, Nick (2010) Archiving the Web: It’s Going to Have to Be a Group Effort, The Serials Librarian, 59:1, 35-39, DOI: 10.1080/03615260903534908

This article emphasizes the need for smaller-sized institutions to begin web archiving projects of their own and not assume that the online content relevant to their collections is being captured by larger institutions like the Internet Archive and the Library of Congress. Szydlowski points out that the Internet Archive only performs captures websites periodically and not necessarily at the same rate of the website’s actual updates. Additionally, valuable local government documents published online with no print equivalent, as well as locally-produced institutional content may also not be captured sufficiently to satisfy future research needs. One of the major challenges posed to all web archivers, but especially to smaller institutions possessing limited resources, are database-backed content management systems that store content outside of the URL’s HTML. Rather, content is stored in an online database, where it is pulled from when triggered by a user request.