Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2487788.2488116acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A survey of web archive search architectures

Published: 13 May 2013 Publication History

Abstract

Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of management are important aspects to take in consideration when choosing a system architecture. We compare these aspects and initialize the discussion of which search architecture is more suitable for a large-scale web archive.

References

[1]
R. Ackland. Virtual Observatory for the Study of online Networks (VOSON) - progress and plans. In Proc. of the 1st International Conference on e-Social Science, 2005.
[2]
O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. ACM SIGIR Forum, 41(2):35--41, 2007.
[3]
A. Anand, S. Bedathur, K. Berberich, and R. Schenkel. Index maintenance for time-travel text search. In Proc. of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 235--244, 2012.
[4]
A. Anand, S. Bedathur, K. Berberich, R. Schenkel, and C. Tryfonopoulos. EverLast: a distributed architecture for preserving the web. In Proc. of the 2009 Joint International Conference on Digital Libraries, pages 331--340, 2009.
[5]
W. Arms, D. Huttenlocher, J. Kleinberg, M. Macy, and D. Strang. From Wayback machine to Yesternet: new opportunities for social science. In Proc. of the 2nd International Conference on e-Social Science, 2006.
[6]
W. Y. Arms, S. Aya, P. Dmitriev, B. Kot, R. Mitchell, and L. Walle. A research library based on the historical collections of the Internet Archive. D-Lib Magazine, 12(2), 2006.
[7]
R. Baeza-Yate and B. Ribeiro-Neto. Modern information retrieval: the concepts and technology behind search. Addison-Wesley Professional, 2011.
[8]
L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: the Google cluster architecture. IEEE Micro Magazine, pages 22--28, 2003.
[9]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of the 30th SIGIR Conference on Research and Development in Information Retrieval, 2007.
[10]
M. Burner and B. Kahle. The Archive File Format. http://www.archive.org/web/researcher/ArcFileFormat.php, September 1996.
[11]
M. Costa and M. J. Silva. Characterizing search behavior in web archives. In Proc. of the 1st International Temporal Web Analytics Workshop, 2011.
[12]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[13]
I. M. Foundation. Web archiving in Europe. Technical report, CommerceNet Labs, 2010.
[14]
D. Gomes, J. Miranda, and M. Costa. A survey on web archiving initiatives. In Proc. of the International Conference on Theory and Practice of Digital Libraries, 2011.
[15]
D. Gomes, A. Nogueira, J. Miranda, and M. Costa. Introducing the Portuguese web archive initiative. In Proc. of the 8th International Web Archiving Workshop, 2008.
[16]
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., 2004.
[17]
E. Jaffe and S. Kirkpatrick. Architecture of the Internet Archive. In Proc. of SYSTOR 2009: The Israeli Experimental Systems Conference, pages 1--10, 2009.
[18]
J. Masanès. Web Archiving. Springer-Verlag New York Inc., 2006.
[19]
J. Masanès. LiWA news#3: Living web archives. http://liwa-project.eu/images/videos/Liwa_Newsletter-3.pdf, March 2011.
[20]
J. Michel, Y. Shen, A. Aiden, A. Veres, M. Gray, J. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176, 2011.
[21]
A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proc. of the 13th International Conference on World Wide Web, pages 1--12, 2004.
[22]
M. Ras and S. van Bussel. Web archiving user survey. Technical report, National Library of the Netherlands (Koninklijke Bibliotheek), 2007.
[23]
A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proc. of the IFIP/ACM International Conference on Distributed Systems Platforms, pages 329--350, 2001.
[24]
M. Stack. Full text searching of web archive collections. In Proc. of the 5th International Web Archiving Workshop, 2005.
[25]
R. Steinmetz. Peer-to-peer systems and applications, volume 3485. Springer-Verlag New York Inc., 2005.
[26]
I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 149--160, 2001.
[27]
B. Tofel. 'Wayback' for Accessing Web Archives. In Proc. of the 7th International Web Archiving Workshop, 2007.
[28]
G. Weikum, N. Ntarmos, M. Spaniol, P. Triantafillou, A. A. Benczur, S. Kirkpatrick, P. Rigaux, and M. Williamson. Longitudinal analytics on web archive data: It's about time! In Proc. of the 5th Conference on Innovative Data Systems Research, pages 199--202, 2011.
[29]
T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010.
[30]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2):6, 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web
May 2013
1636 pages
ISBN:9781450320382
DOI:10.1145/2487788

Sponsors

  • NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
  • CGIBR: Comite Gestor da Internet no Brazil

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. portuguese web archive
  2. temporal search

Qualifiers

  • Research-article

Conference

WWW '13
Sponsor:
  • NICBR
  • CGIBR
WWW '13: 22nd International World Wide Web Conference
May 13 - 17, 2013
Rio de Janeiro, Brazil

Acceptance Rates

WWW '13 Companion Paper Acceptance Rate 831 of 1,250 submissions, 66%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012International Journal on Digital Libraries10.1007/s00799-024-00397-225:3(537-553)Online publication date: 1-Sep-2024
  • (2022)Robots Still Outnumber Humans in Web Archives, But Less Than BeforeLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_19(245-259)Online publication date: 20-Sep-2022
  • (2021)A Holistic View on Web ArchivesThe Past Web10.1007/978-3-030-63291-5_8(85-99)Online publication date: 1-Jul-2021
  • (2019)A Framework for Web Archiving and Guaranteed RetrievalData Management, Analytics and Innovation10.1007/978-981-13-9364-8_16(205-215)Online publication date: 25-Sep-2019
  • (2018)Micro Archives as Rich Digital Object RepresentationsProceedings of the 10th ACM Conference on Web Science10.1145/3201064.3201110(353-357)Online publication date: 15-May-2018
  • (2018)The colors of the national WebInternational Journal on Digital Libraries10.1007/s00799-016-0202-619:1(95-106)Online publication date: 1-Mar-2018
  • (2017)WarcbaseJournal on Computing and Cultural Heritage 10.1145/309757010:4(1-30)Online publication date: 31-Jul-2017
  • (2017)Exploring Web Archives Through Temporal Anchor TextsProceedings of the 2017 ACM on Web Science Conference10.1145/3091478.3091500(289-298)Online publication date: 25-Jun-2017
  • (2016)Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly ActivitiesProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries10.1145/2910896.2910912(103-106)Online publication date: 19-Jun-2016
  • (2016)Analyzing web archives through topic and event focused sub-collectionsProceedings of the 8th ACM Conference on Web Science10.1145/2908131.2908175(291-295)Online publication date: 22-May-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media