skip to main content
research-article

To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages

Published: 11 July 2023 Publication History

Abstract

When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.

References

[1]
Scott G. Ainsworth. 2015. Original Header Replay Considered Coherent. (2015). Retrieved from https://ws-dl.blogspot.com/2015/08/2015-08-28-original-header-replay.html. Accessed November 1, 2020.
[2]
Scott G. Ainsworth, Michael L. Nelson, and Herbert Van de Sompel. 2014. A Framework for Evaluation of Composite Memento Temporal Coherence. Technical Report. Old Dominion University. arXiv:1402.0928.
[3]
Scott G. Ainsworth, Michael L. Nelson, and Herbert Van de Sompel. 2015. Only one out of five archived web pages existed as presented. In Proceedings of the 26th ACM Conference on Hypertext and Social Media. 257–266. DOI:
[4]
Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson. 2017. Client-side reconstruction of composite mementos using serviceworker. In Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. 1–4. DOI:
[5]
Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel. 2013. Profiling web archive coverage for top-level domain and content language. In Proceedings of the International Conference on Theory and Practice of Digital Libraries. 60–71. DOI:
[6]
Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel. 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries 14, 3 (2014), 149–166. DOI:
[7]
Mohamed Aturban, Michael L. Nelson, and Michele C. Weigle. 2021. Where did the web archive go?. In Proceedings of the International Conference on Theory and Practice of Digital Libraries. Springer, 73–84. DOI:
[8]
Jefferson Bailey, Abigail Grotke, Kristine Hanna, Cathy Hartman, Edward McCain, Christie Moffatt, and Nicholas Taylor. 2013. Web Archiving in the United States: A 2013 Survey. (2013). Retrieved from https://blogs.loc.gov/thesignal/2014/10/results-from-the-2013-ndsa-u-s-web-archiving-survey/. Accessed November 1, 2020.
[9]
Vangelis Banos and Yannis Manolopoulos. 2016. A quantitative approach to evaluate Website Archivability using the CLEAR+ method. International Journal on Digital Libraries 17, 1 (2016), 119–141. DOI:
[10]
Adam Barth, Charles Reis, Collin Jackson, and Google Chrome Team. 2008. The Security Architecture of the Chromium Browser. (2008). Retrieved from https://seclab.stanford.edu/websec/chromium/chromium-security-architecture.pdf. Accessed November 1, 2020.
[11]
John Berlin. 2017. CNN.com has been unarchivable since November 1st, 2016. (2017). Retrieved from https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html. Accessed November 1, 2020.
[12]
John Berlin. 2018. To Relive The Web: A Framework for the Transformation and Archival Replay of Web Pages. Master’s thesis. Old Dominion University. Retrieved from https://digitalcommons.odu.edu/computerscience_etds/38/.
[13]
Tim Berners-Lee, Roy T. Fielding, and Larry Masinter. 2005. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986. Retrieved from https://www.rfc-editor.org/rfc/rfc3986.txt.
[14]
Katherine E. Boss, Vicky Rampin, Remi Rampin, Fernando Chirigati, and Brian Hoffman. 2019. Saving data journalism: Using ReproZip-Web to capture dynamic websites for future reuse. In Proceedings of iPres. 5 (2019), 305–310. DOI:
[15]
Niels Brügger. 2011. Web archiving – between past, present, and future. The Handbook of Internet Studies (2011), 24–42. DOI:
[16]
Niels Brügger and Ian Milligan. 2018. The SAGE Handbook of Web History. SAGE. DOI:
[17]
Justin F. Brunelle. 2012. Zombies in the Archives. (2012). Retrieved from https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html. Accessed November 1, 2020.
[18]
Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson. 2014. Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries(2014), 321–330. DOI:
[19]
Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson. 2015. Not all mementos are created equal: Measuring the impact of missing resources. International Journal on Digital Libraries 16, 3 (2015), 283–301. DOI:
[20]
Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson. 2016. The impact of JavaScript on archivability. International Journal on Digital Libraries 17, 2 (2016), 95–117. DOI:
[21]
Edgar Crook. 2009. Web archiving in a web 2.0 world. The Electronic Library 27, 5 (2009), 831–836. DOI:
[22]
Jack Cushman. 2017. WARCgames. (May2017). Retrieved from https://github.com/harvard-lil/warcgames.
[23]
Jack Cushman and Ilya Kreymer. 2017. Thinking like a hacker: Security Considerations for High-Fidelity Web Archives. Presented at International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017. (June2017).
[24]
Deborah R. Eltgrowth. 2009. Best evidence and the wayback machine: Toward a workable authentication standard for archived internet evidence. Fordham Law Review 78, 1 (2009), 181.
[25]
Gunther Eysenbach and Mathieu Trudel. 2005. Going, going, still there: Using the WebCite service to permanently archive cited web pages. Journal of Medical Internet Research 7, 5 (2005), e920. DOI:
[26]
Matthew Farrell, Edward McCain, Maria Praetzellis, Grace Thomas, and Paige Walker. 2018. Web Archiving in the United States: A 2017 Survey. (2018). Retrieved from https://ndsa.org/2018/12/12/announcing-publication-of-ndsa-s-2017-web-archiving-survey-report.html. Accessed November 1, 2020.
[27]
Ian Fette and Alexey Melnikov. 2011. The WebSocket Protocol. RFC 6455. Retrieved from https://www.rfc-editor.org/rfc/rfc6455.txt. Accessed November 1, 2020.
[28]
Sydney L. Forde, Robert E. Gutsche Jr, and Juliet Pinto. 2023. Exploring “ideological correction” in digital news updates of Portland protests and police violence. Journalism 24, 1 (2023), 157–176. DOI:
[29]
Robert Fox. 2001. Turning back 10 billion (web) pages of time. Communications of the ACM 44, 1 (2001), 9–10.
[30]
Lesley Frew. 2022. Web Archiving in Popular Media II: User Tasks of Journalists. (Aug.2022). Retrieved from https://ws-dl.blogspot.com/2022/08/2022-08-04-web-archiving-in-popular.html. Accessed August 5, 2022.
[31]
Ayush Goel, Jingyuan Zhu, Ravi Netravali, and Harsha V. Madhyastha. 2022. Jawa: Web archival in the era of JavaScript. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation.805–820. Retrieved from https://www.usenix.org/conference/osdi22/presentation/goel.
[32]
Daniel Gomes, Elena Demidova, Jane Winters, and Thomas Risse. 2021. The Past Web: Exploring Web Archives. Springer. DOI:
[33]
Daniel Gomes, João Miranda, and Miguel Costa. 2011. A survey on web archiving initiatives. In Proceedings of the International Conference on Theory and Practice of Digital Libraries.408–420. DOI:
[34]
Mark Graham. 2019. The Wayback Machine’s Save Page Now is New and Improved. (2019). Retrieved from http://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/. Accessed November 1, 2020.
[35]
Ilya Grigorik. 2018. Resource Hints. (2018). Retrieved from https://www.w3.org/TR/resource-hints/. Accessed November 1, 2020.
[36]
Jordan Harband, Shu yu Guo, Michael Ficarra, and Kevin Gibbons. 2021. ECMA-262, 12th edition, June 2021: ECMAScript® 2021 Language Specification. (2021). Retrieved from https://262.ecma-international.org/12.0/. Accessed November 16, 2022.
[37]
Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient web archive access, extraction and derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 83–92. DOI:
[38]
Helge Holzmann, Nick Ruest, Jefferson Bailey, Alex Dempsey, Samantha Fritz, Peggy Lee, and Ian Milligan. 2022. ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH: Archive, Big Data, Concurrent, Distributed, Efficient, Flexible. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 1–11.
[39]
International Internet Preservation Consortium (IIPC). 2015. OpenWayback. (2015). Retrieved from https://iipc.github.io/openwayback/2.1.0.RC.1/administrator_manual.html. Accessed November 1, 2020.
[40]
Internet Archive. 2022. News stories about the Internet Archive, filtered for “Wayback Machine”. (2022). Retrieved from https://archive.org/about/news-stories/search?mentions-search=Wayback+Machine. Accessed November 16, 2022.
[41]
Internet Archive Developer Portal. Memento API. (n. d.). Retrieved from https://archive.readme.io/docs/memento. Accessed December 18, 2022.
[42]
ISO 28500. 2009. WARC (Web ARChive) file format. (August2009). Retrieved from https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml. Accessed November 1, 2020.
[43]
Brewster Kahle. 2021. Reflections as the Internet Archive turns 25. (2021). Retrieved from https://blog.archive.org/2021/07/21/reflections-as-the-internet-archive-turns-25/. Accessed August 1, 2021.
[44]
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson. 2013. On the change in archivability of websites over time. In Proceedings of the International Conference on Theory and Practice of Digital Libraries. 35–47. DOI:
[45]
Mat Kelly, Michael L. Nelson, and Michele C. Weigle. 2014. The archival acid test: Evaluating archive performance on advanced HTML and JavaScript. In Proceedings of the 14th IEEE/ACM Joint Conference on Digital Libraries. 25–28. DOI:
[46]
Martin Klein, Harihar Shankar, Lyudmila Balakireva, and Herbert Van de Sompel. 2019. The memento tracer framework: Balancing quality and scalability for web archiving. In Proceedings of the International Conference on Theory and Practice of Digital Libraries. 163–176. DOI:
[47]
Ilya Kreymer. 2013. PyWb - Web Archiving Tools for All. (December2013). Retrieved from https://github.com/ikreymer/pywb. Accessed November 1, 2020.
[48]
Ilya Kreymer. 2019. Wombat. (July2019). Retrieved from https://github.com/webrecorder/wombat. Accessed November 1, 2020.
[49]
Ilya Kreymer. 2020. A New Phase for Webrecorder Project, Conifer and ReplayWeb.page. (2020). Retrieved from https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html. Accessed August 1, 2021.
[50]
Adam Kriesberg and Amelia Acker. 2022. The second US presidential social media transition: How private platforms impact the digital preservation of public records. Journal of the Association for Information Science and Technology 73, 11 (2022), 1529–1542. DOI:
[51]
Kalev Leetaru. 2015. How Much Of The Internet Does The Wayback Machine Really Archive? (2015). Retrieved from https://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/#1f64c0679446. Accessed November 1, 2020.
[52]
Kalev Leetaru. 2017. Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web. (2017). Retrieved from https://www.forbes.com/sites/kalevleetaru/2017/02/24/are-web-archives-failing-the-modern-web-video-social-media-dynamic-pages-and-the-mobile-web/. Accessed November 1, 2020.
[53]
Ada Lerner, Tadayoshi Kohno, and Franziska Roesner. 2017. Rewriting history: Changing the archived web from the present. In Proceedings of the ACM Conference on Computer and Communications Security. 1741–1755. DOI:
[54]
Jimmy Lin, Ian Milligan, Douglas W. Oard, Nick Ruest, and Katie Shilton. 2020. We could, but should we?: Ethical considerations for providing access to GeoCities and other historical digital collections. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval. 135–144. DOI:
[55]
Daniela Major and Daniel Gomes. 2021. Web Archives Preserve Our Digital Collective Memory. Springer International Publishing, 11–19. DOI:
[56]
Julien Masanès. 2006. Web archiving: Issues and methods. In Proceedings of the Web Archiving. Springer, 1–53. DOI:
[57]
Ian Milligan. 2019. History in the Age of Abundance?: How the Web Is Transforming Historical Research. McGill-Queen’s Press-MQUP.
[58]
Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. 2004. An introduction to heritrix, an open source archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop. 109–115.
[59]
Michael L. Nelson. 2013a. Archive.is Supports Memento. (2013). Retrieved from https://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html. Accessed November 1, 2020.
[60]
Michael L. Nelson. 2013b. Game Walkthroughs As A Metaphor for Web Preservation. (2013). Retrieved from https://ws-dl.blogspot.com/2013/05/2013-05-25-game-walkthroughs-as.html. Accessed November 1, 2020.
[61]
Michael L. Nelson. 2014. “Refresh” For Zombies, Time Jumps. (2014). Retrieved from https://ws-dl.blogspot.com/2014/07/2014-07-14-refresh-for-zombies-time.html. Accessed November 1, 2020.
[62]
Michael L. Nelson. 2020. At the nexus of the CNI keynote and Rosenthal’s response: “It’s not an easy thing to meet your maker.”. (2020). Retrieved from https://ws-dl.blogspot.com/2020/03/2020-03-07-at-nexus-of-cni-keynote-and.html. Accessed March 8, 2020.
[63]
Mark Nottingham. 2014. URI Design and Ownership. RFC 7320. Retrieved from https://www.rfc-editor.org/rfc/rfc7320.txt. Accessed November 1, 2020.
[64]
James L. Quarles III and Richard A. Crudo. 2014. [Way]back to the future: Using the wayback machine in patent litigation. Landslide Magazine 6, 3 (2014), 16.
[65]
Charles Reis, Adam Barth, and Carlos Pizano. 2009. Browser security: Lessons from Google Chrome. Communications of the ACM 52, 8 (2009), 45–49. DOI:
[66]
Brenda Reyes Ayala. 2022. Correspondence as the primary measure of information quality for web archives: A human-centered grounded theory study. International Journal on Digital Libraries 23, 1 (2022), 19–31. DOI:
[67]
David S. H. Rosenthal. 2012. Harvesting and Preserving the Future Web. (2012). Retrieved from https://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html. Accessed November 1, 2020.
[68]
David S. H. Rosenthal. 2017. Security Issues for Web Archives. (2017). Retrieved from https://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html. Accessed November 1, 2020.
[69]
Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. 2020. The archives unleashed project: Technology, process, and community to improve scholarly access to web archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 157–166. DOI:
[70]
Tim Sherratt and Andrew Jackson. 2022. GLAM-Workbench/web-archives.
[71]
Hunter Stern. 2011. Fetch Chain Processors. (2011). Retrieved from https://webarchive.jira.com/wiki/display/Heritrix/Fetch+Chain+Processors. Accessed November 1, 2020.
[72]
Brad Tofel. 2007. Wayback for accessing web archives. In Proceedings of the 7th International Web Archiving Workshop. 27–37.
[73]
Masashi Toyoda and Masaru Kitsuregawa. 2012. The history of web archiving. In Proceedings of the IEEE, Vol. 100. IEEE, 1441–1443. DOI:Special Centennial Issue.
[74]
Herbert Van de Sompel, Michael Nelson, and Robert Sanderson. 2013. HTTP Framework for Time-Based Access to Resource States – Memento. RFC 7089. Retrieved from https://www.rfc-editor.org/rfc/rfc7089.txt. Accessed November 1, 2020.
[75]
Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar. 2009. Memento: Time Travel for the Web. Technical Report arXiv:0911.1112.
[76]
Anne van Kesteren. 2016. Defining the WindowProxy, Window, and Location objects. (2016). Retrieved from https://blog.whatwg.org/windowproxy-window-and-location. Accessed November 1, 2020.
[77]
Anne van Kesteren. 2020. Cross-Origin Resource Sharing. (2020). Retrieved from https://www.w3.org/TR/2020/SPSD-cors-20200602/. Accessed December 18, 2022.
[78]
W3C. 2022. Cascading Style Sheets. (2022). Retrieved from https://www.w3.org/Style/CSS/Overview.en.html. Accessed December 18, 2022.
[79]
Takuya Watanabe, Eitaro Shioji, Mitsuaki Akiyama, and Tatsuya Mori. 2020. Melting pot of origins: Compromising the intermediary web services that rehost websites. In Proceedings of the Network and Distributed Systems Security Symposium. 15. DOI:
[80]
Michele C. Weigle. 2022. Using Web Archives in Disinformation Research. (Sept.2022). Retrieved fromhttps://ws-dl.blogspot.com/2022/09/2022-09-28-using-web-archives-in.html. Accessed September 30, 2022.
[81]
Joel Weinberger, Frederik Braun, Devdatta Akhawe, and Francois Marier. 2016. Subresource Integrity. (62016). Retrieved from https://w3c.github.io/webappsec-subresource-integrity/. Accessed November 1, 2020.
[82]
Mike West, Adam Barth, and Dan Veditz. 2016. Content Security Policy Level 2. (122016). Retrieved from https://www.w3.org/TR/CSP2/. Accessed November 1, 2020.
[83]
WHATWG Working Group. 2017a. DOM Living Standard. (2017). Retrieved from https://dom.spec.whatwg.org/. Accessed November 1, 2020.
[84]
WHATWG Working Group. 2017b. WebIDL Level 1. (2017). Retrieved from https://www.w3.org/TR/WebIDL-1/. Accessed November 1, 2020.
[85]
WHATWG Working Group. 2022. HTML Living Standard. (2022). Retrieved from https://html.spec.whatwg.org/. Accessed December 18, 2022.
[86]
Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. 2014. Perma: Scoping and addressing the problem of link and reference rot in legal citations. Legal Information Management 14, 2 (2014), 88–99. DOI:

Cited By

View all
  • (2024)Local Government Cybersecurity Landscape: A Systematic Review and Conceptual FrameworkApplied Sciences10.3390/app1413550114:13(5501)Online publication date: 25-Jun-2024
  • (2023)Hashes are not suitable to verify fixity of the public archived webPLOS ONE10.1371/journal.pone.028687918:6(e0286879)Online publication date: 9-Jun-2023

Index Terms

  1. To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 17, Issue 4
      November 2023
      331 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/3608910
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 July 2023
      Online AM: 27 March 2023
      Accepted: 12 February 2023
      Revised: 20 December 2022
      Received: 11 November 2020
      Published in TWEB Volume 17, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Web archiving
      2. JavaScript
      3. replay
      4. client-side
      5. Internet Archive

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)292
      • Downloads (Last 6 weeks)44
      Reflects downloads up to 24 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Local Government Cybersecurity Landscape: A Systematic Review and Conceptual FrameworkApplied Sciences10.3390/app1413550114:13(5501)Online publication date: 25-Jun-2024
      • (2023)Hashes are not suitable to verify fixity of the public archived webPLOS ONE10.1371/journal.pone.028687918:6(e0286879)Online publication date: 9-Jun-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media