Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3477495.3536321acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

ClueWeb22: 10 Billion Web Documents with Rich Information

Published: 07 July 2022 Publication History

Abstract

ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search.
This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning.

References

[1]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et almbox. 2016. Ms MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
[2]
Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. Technical Report. NIST.
[3]
Charles L Clarke, Nick Craswell, and Ellen M Voorhees. 2012. Overview of the TREC 2012 Web Track. Technical Report. NIST.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2019. 4171--4186.
[5]
Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021).
[6]
William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961 (2021).
[7]
Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tieyan Liu, and Arnold Overwijk. 2021. Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021.
[8]
Kaixin Ma, Hao Cheng, Xiaodong Liu, Eric Nyberg, and Jianfeng Gao. 2021. Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text. arXiv preprint arXiv:2110.08417 (2021).
[9]
Microsoft. 2019. BlingFire. https://github.com/microsoft/BlingFire
[10]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[11]
Carnegie Mellon University. 2009. ClueWeb09. http://lemurproject.org/clueweb09/
[12]
Carnegie Mellon University. 2012. ClueWeb12. http://lemurproject.org/clueweb12/
[13]
Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, and Arnold Overwijk. 2019. Open Domain Web Keyphrase Extraction Beyond Language Modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP 2019. http://arxiv.org/abs/1911.02671
[14]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations, ICLR 2021.

Cited By

View all
  • (2024)How to Leverage Personal Textual Knowledge for Personalized Conversational Information RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679939(3954-3958)Online publication date: 21-Oct-2024
  • (2024)TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge AssistantsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657860(819-829)Online publication date: 11-Jul-2024
  • (2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. ClueWeb22: 10 Billion Web Documents with Rich Information

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2022
      3569 pages
      ISBN:9781450387323
      DOI:10.1145/3477495
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 July 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clueweb
      2. dataset
      3. web corpus

      Qualifiers

      • Short-paper

      Funding Sources

      Conference

      SIGIR '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)227
      • Downloads (Last 6 weeks)29
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)How to Leverage Personal Textual Knowledge for Personalized Conversational Information RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679939(3954-3958)Online publication date: 21-Oct-2024
      • (2024)TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge AssistantsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657860(819-829)Online publication date: 11-Jul-2024
      • (2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
      • (2024)Advancing General Sensor Data Synthesis by Integrating LLMs and Domain-Specific Generative ModelsIEEE Sensors Letters10.1109/LSENS.2024.34707488:11(1-4)Online publication date: Nov-2024
      • (2024)Cross-Domain Integration for General Sensor Data Synthesis: Leveraging LLMs and Domain-Specific Generative Models in Collaborative EnvironmentsIEEE Sensors Journal10.1109/JSEN.2024.348093224:24(42311-42326)Online publication date: 15-Dec-2024
      • (2024)Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web ArchivesProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00021(71-81)Online publication date: 26-Jun-2024
      • (2024)GenG: An LLM-Based Generic Time Series Data Generation Approach for Edge Intelligence via Cross-Domain CollaborationIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS61880.2024.10620716(1-6)Online publication date: 20-May-2024
      • (2024)Sabiá in Action: An Investigation of its Abilities in Aspect-Based Sentiment Analysis, Hate Speech Detection, Irony Detection, and Question-Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650878(1-8)Online publication date: 30-Jun-2024
      • (2024)TeenyTinyLlama: Open-source tiny language models trained in Brazilian PortugueseMachine Learning with Applications10.1016/j.mlwa.2024.10055816(100558)Online publication date: Jun-2024
      • (2024)Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_10(87-96)Online publication date: 24-Mar-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media