Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3383583.3398513acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

Published: 01 August 2020 Publication History

Abstract

The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building---all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.

References

[1]
Maria José Afanador-Llach, James Baker, Adam Crymble, Víctor Gayol, Martin Grandjean, Jennifer Isasi, Francois Dominic Laramée, Zoe LeBlanc, Matthew Lincoln, Sarah Melton, Jose Antonio Motilla, Joshua G. Ortiz Baco, Sofia Papastamkou, Jessica Parr, Marie Puren, Riva Quiroga, Antonio Rojas Castro, Anna-Maria Sichani, Anandi Silva Knuppel, Amanda Visconti, and Brandon Walsh. 2019. 2019 Programming Historian Deposit release. https://doi.org/10.5281/zenodo.3525082
[2]
Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: An Open Source Software for Exploring and Manipulating Networks. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media. San Jose, California, 361--362.
[3]
Neils Brügger. 2018. The Archived Web. Doing History in the Digital Age .MIT Press, Cambridge, Massachusetts.
[4]
Niels Brügger and Ian Milligan (Eds.). 2018. The SAGE Handbook of Web History .SAGE Publications Limited.
[5]
Niels Brügger and Ralph Schroeder (Eds.). 2017. The Web as History: Using Web Archives to Understand the Past and the Present .UCL Press.
[6]
Ryan Deschamps, Samantha Fritz, Jimmy Lin, Ian Milligan, and Nick Ruest. 2019 a. The Cost of a WARC: Analyzing Web Archives in the Cloud. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 261--264.
[7]
Ryan Deschamps, Nick Ruest, Jimmy Lin, Samantha Fritz, and Ian Milligan. 2019 b. The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration of Web Archives. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 337--338.
[8]
Gabriel A. Devenyi, Rémi Emonet, Rayna M. Harris, Kate L. Hertweck, Damien Irving, Ian Milligan, and Greg Wilson. 2018. Ten Simple Rules for Collaborative Lesson Development. PLOS Computational Biology, Vol. 14, 3 (03 2018), 1--8.
[9]
Matthew Farrell, Edward McCain, Maria Praetzellis, Grace Thomas, and Paige Walker. 2017. Web Archiving in the United States: A 2017 Survey. Technical Report. National Digital Stewardship Alliance. https://osf.io/ht6ay/
[10]
Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL 2016). Newark, New Jersey, 83--92.
[11]
Andrew Jackson, Jimmy Lin, Ian Milligan, and Nick Ruest. 2016. Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities. In Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2016). Newark, New Jersey, 103--106.
[12]
Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. 2017. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, Vol. 10, 4 (2017), Article 22.
[13]
Ian Milligan. 2019. History in the Age of Abundance? How the Web is Transforming Historical Research. McGill-Queen's University Press.
[14]
Ian Milligan, Nathalie Casemajor, Samantha Fritz, Jimmy Lin, Nick Ruest, Matthew S. Weber, and Nicholas Worby. 2019. Building Community and Tools for Analyzing Web Archives through Datathons. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 265--268.
[15]
Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for Literary History .Verso.
[16]
Nick Ruest. 2020. Ministry of Environment of Québec (2011--2014) Web Archive Collection Derivatives. https://doi.org/10.5281/zenodo.3599771
[17]
Matthew S. Weber and Philip M. Napoli. 2018. Journalism History, Web Archives, and New Methods for Understanding the Evolution of Digital Journalism. Digital Journalism, Vol. 6, 9 (2018), 1186--1205.
[18]
Jane Winters. 2017. Coda: Web Archives for Humanities Research -- Some Reflections. In The Web as History: Using Web Archives to Understand the Past and the Present, Niels Brügger and Ralph Schroeder (Eds.). UCL Press, 238--248.
[19]
Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. 2019. Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit. In Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019). Urbana-Champaign, Illinois, 436--437.

Cited By

View all
  • (2024)Conclusion: A Highly transformative age for web archivesExploring the Archived Web during a Highly Transformative Age10.36253/979-12-215-0413-2.29Online publication date: 2024
  • (2024)Digital curation practices on web and social media archiving in libraries and archivesJournal of Librarianship and Information Science10.1177/09610006241252661Online publication date: 26-Jul-2024
  • (2024)Are Users of Digital Archives Ready for the AI Era? Obstacles to the Application of Computational Research Methods and New OpportunitiesJournal on Computing and Cultural Heritage 10.1145/363112516:4(1-16)Online publication date: 24-Jan-2024
  • Show More Cited By

Index Terms

  1. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
    August 2020
    611 pages
    ISBN:9781450375856
    DOI:10.1145/3383583
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. apache spark
    2. cloud platform
    3. technology adoption

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    JCDL '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 415 of 1,482 submissions, 28%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Conclusion: A Highly transformative age for web archivesExploring the Archived Web during a Highly Transformative Age10.36253/979-12-215-0413-2.29Online publication date: 2024
    • (2024)Digital curation practices on web and social media archiving in libraries and archivesJournal of Librarianship and Information Science10.1177/09610006241252661Online publication date: 26-Jul-2024
    • (2024)Are Users of Digital Archives Ready for the AI Era? Obstacles to the Application of Computational Research Methods and New OpportunitiesJournal on Computing and Cultural Heritage 10.1145/363112516:4(1-16)Online publication date: 24-Jan-2024
    • (2024)TrendMachine: A Temporal Webpage Resilience PortalProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00023(93-97)Online publication date: 26-Jun-2024
    • (2023)Imagining permanence on the web: Tracing the meanings of long-term preservation among the subjects of web archivesNew Media & Society10.1177/14614448231187031Online publication date: 22-Jul-2023
    • (2023)Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital researchConvergence: The International Journal of Research into New Media Technologies10.1177/1354856523116475930:1(167-189)Online publication date: 30-Mar-2023
    • (2023)To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web PagesACM Transactions on the Web10.1145/358920617:4(1-49)Online publication date: 11-Jul-2023
    • (2023)Web archive analytics: Blind spots and silences in distant readings of the archived webDigital Scholarship in the Humanities10.1093/llc/fqad01438:3(1033-1048)Online publication date: 19-Apr-2023
    • (2023)The Holocaust Archival Material Knowledge GraphThe Semantic Web – ISWC 202310.1007/978-3-031-47243-5_20(362-379)Online publication date: 27-Oct-2023
    • (2023)Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web ArchivesLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_19(220-229)Online publication date: 26-Sep-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media