How are web archives created? Technical aspects of web content capture

Download PDF

Web archives are collections produced by libraries and other heritage institutions to permanently preserve online heritage. They often contain large amounts of material stored from the web through the use of web crawlers. From a usage perspective, they are often unpredictable, non-transparent and inconsistent data sources that contain numerous content gaps. In addition to the various social, legislative and institutional circumstances under which they are created, their specific characteristics are largely defined by the heterogeneous, ephemeral and fluid nature of the world wide web. Because they present numerous challenges to their users, it is important for them to be aware of the circumstances that influence the nature of web archives and, consequently, the opportunities and pitfalls of using archived data. To shed light on the background of these relatively poorly understood data sources, this paper, through a review of foundational and other relevant literature, describes primarily the technical aspects of web archives creation. It focuses on the fundamental characteristics of the world wide web in the context of preservation, different approaches to capturing web content, their limitations and the impact of these circumstances on the nature of web archives, which differ in many ways from more traditional and established data sources.

Keywordshttps://doi.org/10.3359/oz2530002

Related Posts