research-article

To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages

Authors:

John Berlin,

Mat Kelly,

Michael L. Nelson,

Michele C. WeigleAuthors Info & Claims

ACM Transactions on the Web, Volume 17, Issue 4

Article No.: 28, Pages 1 - 49

https://doi.org/10.1145/3589206

Published: 11 July 2023 Publication History

Get Access

Abstract

When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.

References

[1]

Scott G. Ainsworth. 2015. Original Header Replay Considered Coherent. (2015). Retrieved from https://ws-dl.blogspot.com/2015/08/2015-08-28-original-header-replay.html. Accessed November 1, 2020.

Abstract

References

Cited By

Index Terms

Recommendations

Web Archiving and Digital Libraries (WADL)

Web archiving and digital libraries (WADL)

Life span of web pages: a survey of 10 million pages collected in 2001

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations