Abstract
Web archives do not always capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users’ perceptions of damage are not accurately estimated by the proportion of missing embedded resources. In fact, the proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17 % and an improvement by 51 % if the mementos have a damage rating delta \(>\)0.30. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from a damage rating of 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time. Alternatively, the damage in WebCite is increasing over time (going from 0.375 in 2007 to 0.475 in 2014), while the missing embedded resources remain constant (13 % of the resources are missing on average). Finally, we investigate the impact of JavaScript on the damage of the archives, showing that a crawler that can archive JavaScript-dependent representations will reduce memento damage by 13.5 %.
Similar content being viewed by others
Notes
According to the text at https://archive.org/web/ at the time of authoring.
We executed the wget command with parameters as follows: wget -E -H -k -K -p http://www.xkcd.com/.
Live Web resources may have missing embedded resources, and this results in a calculated \(D_{m_0} >\) 0.
The Internet Archive performs URI canonicalization very well and is assumed to not be a source of missing resources.
The Internet Archive has recently added an on-demand archiving utility at http://archive.org/web/ under the heading “Save Page Now” [33].
Archive.today lists the resources it saves and does not save in its FAQ page at http://archive.today/faq.html.
“Undamaged” mementos are mementos without purposefully removed embedded resources. Note that some live Web resources may have damage because they are missing embedded resources, and this damage is reflected in our undamaged and subsequently intentionally damaged mementos.
References
Ainsworth, S.G., Nelson, M.L.: Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. Int. J. Digit. Librar. 1–16 (2014). doi:10.1007/s00799-014-0120-4
Alnoamany, Y., Alsum, A., Weigle, M., Nelson, M.: Who and what links to the internet archive. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 346–357. ACM (2013). doi:10.1007/978-3-642-40501-3_35
Archive.today: Archive.today (2013). http://archive.today/
Ayala, B.R., Phillips, M.E., Ko, L.: Technical report. Current Quality Assurance Practices in Web Archiving (2014)
Banos, V., Manolopoulos, Y.: A Quantitative approach to evaluate website archivability using the CLEAR+ Method. Int. J. Digit. Librar. 1–24 (2015). http://link.springer.com/article/10.1007%2Fs00799-015-0144-4
Banos, V., Yunhyong, K., Ross, S., Manolopoulos, Y.: CLEAR: A credible method to evaluate website archivability. In: Proceedings of the 9th International Conference on Preservation of Digital Objects (2013)
Ben Saad, M., Ganarski, S.: Archiving the web using page changes patterns: A case study. In: Proceedings of the 11th Annual International Joint Conference on Digital Libraries, pp. 113–122 (2011). doi:10.1145/1998076.1998098
Ben Saad, M., Ganarski, S.: Archiving the web using page changes patterns: a case study. Int. J. Digit. Libr. 13(1), 33–49 (2012). doi:10.1007/s00799-012-0094-z
Ben Saad, M., Pehlivan, Z., Ganarski, S.: Coherence-oriented crawling and navigation using patterns for web archives. In: Proceedings of the First International Conference on Theory and Practice of Digital Libraries, pp. 421–433 (2011)
Brunelle, J.F.: Google and JavaScript. http://ws-dl.blogspot.com/2014/06/2014-06-18-google-and-javascript.html (2014)
Brunelle, J.F.: Fixing links on the live web, breaking them in the archive. http://ws-dl.blogspot.com/2015/02/2015-02-17-fixing-links-on-live-web.html (2015)
Brunelle, J.F., Kelly, M., Weigle, M.C., Nelson, M.L.: The Impact of JavaScript on archivability. Int. J. Digit. Libr. 1–23 (2015). doi:10.1007/s00799-015-0140-8
Brunelle, J.F., Nelson, M.L.: Zombies in the archives. http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html (2012)
Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. In: Proceedings of the 35th International Conference on Very Large Data Bases 2, pp. 586–597 (2009). doi:10.1007/s00778-011-0219-9
Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006). doi:10.1016/j.patrec.2005.10.010
Fersini, E., Messina, E., Archetti, F.: Enhancing web page classification through image-block importance analysis. Inf. Process. Manag. 44(4), 1431–1447 (2008). doi:10.1016/j.ipm.2007.11.003
GNU: Introduction to GNU Wget. http://www.gnu.org/software/wget/ (2013)
Gray, G., Martin, S.: Choosing a sustainable web archiving method: A comparison of capture quality. D-Lib Mag. 19(5) (2013). doi:10.1045/may2013-gray
Howell, B.A.: Proving web history: how to use the internet archive. J. Internet Law 9(8), 3–9 (2006)
Jack, P.: ExtractorHTML Extract-JavaScript. https://webarchive.jira.com/wiki/display/Heritrix/ExtractorHTML+extract-javascript
Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 35–47 (2013). doi:10.1007/978-3-642-40501-3_5
Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PLoS One 9(12), e115,253 (2014). doi:10.1371/journal.pone.0115253
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450 (2010). doi:10.1145/1718487.1718542
Marshall, C.C., Shipman, F.M.: On the institutional archiving of social media. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 1–10 (2012). doi:10.1145/2232817.2232819
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)
Negulescu, K.C.: Web archiving @ the internet archive. Presentation at the 2010 Digital Preservation Partners Meeting, 2010http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt
Nelson, M.L.: Archive.is supports memento. http://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html (2013)
Nelson, M.L.: 2014–07-14: ”Refresh” For Zombies, Time Jumps.http://ws-dl.blogspot.com/2014/07/2014-07-14-refresh-for-zombies-time.html (2014)
PhantomJS: PhantomJS. http://phantomjs.org/ (2013)
Rademacher, P., Lengyel, J., Cutrell, E., Whitted, T.: Measuring the perception of visual realism in images. In: Rendering Techniques 2001, Eurographics, p. 235–247. Springer (2001). doi:10.1007/978-3-7091-6242-2_22
Reed, S.: Introduction to umbra. https://webarchive.jira.com/wiki/display/ARIH/Introduction+to+Umbra (2014)
Rossi, A.: Fixing broken links on the internet. https://blog.archive.org/2013/10/25/fixing-broken-links/ (2013)
SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Proceedings of the Second International Conference on Theory and Practice of Digital Libraries, pp. 125–137 (2012). doi:10.1007/978-3-642-33290-6_14
SalahEldeen, H.M., Nelson, M.L.: Reading the correct history?: Modeling temporal intention in resource sharing. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 257–266 (2013)
SalahEldeen, H.M., Nelson, M.L.: Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 333–345 (2013). doi:10.1007/978-3-642-40501-3_34
Sigursson, K.: Incremental crawling with Heritrix. In: Proceedings of the 5th International Web Archiving Workshop (2005)
Singh, R., Bhhatarai, B.D.: Information-theoretic identification of content pages for analyzing user information needs and actions on the multimedia web. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1806–1810 (2009). doi:10.1145/1529282.1529686
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the 13th International Conference on World Wide Web, pp. 203–211 (2004). doi:10.1145/988672.988700
Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26. ACM (2009)
Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: Visual analysis of coherence defects in web archiving. In: Proceedings of The 9th International Web Archiving Workshop, pp. 27–37 (2009)
Sun, Y., Zhuang, Z., Giles, C.L.: A large-scale study of robots.txt. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 1123–1124 (2007)
Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of the 7th International Web Archiving Workshop (2007)
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time travel for the websites technical report. arXiv:0911.1112, Los Alamos National Laboratory (2009)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305 (2003). doi:10.1145/956750.956785
Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. Image Represent. 19(1), 30–41 (2008). doi:10.1109/TMM.2013.2268053
Acknowledgments
This work was supported in part by the National Science Foundation (NSF) (IIS 1009392), the Library of Congress, and the National Endowment for the Humanities (NEH) Digital Humanities Implementation Grant (DHIG) (HK-50181-14).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Brunelle, J.F., Kelly, M., SalahEldeen, H. et al. Not all mementos are created equal: measuring the impact of missing resources. Int J Digit Libr 16, 283–301 (2015). https://doi.org/10.1007/s00799-015-0150-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-015-0150-6