short-paper

Public Access

ClueWeb22: 10 Billion Web Documents with Rich Information

Authors:

Arnold Overwijk,

Chenyan Xiong,

Jamie CallanAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3360 - 3362

https://doi.org/10.1145/3477495.3536321

Published: 07 July 2022 Publication History

PDF eReader

Abstract

ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search.

This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning.

References

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et almbox. 2016. Ms MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).

Google Scholar

[2]

Charles L Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. Technical Report. NIST.

Google Scholar

[3]

Charles L Clarke, Nick Craswell, and Ellen M Voorhees. 2012. Overview of the TREC 2012 Web Track. Technical Report. NIST.

Google Scholar

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2019. 4171--4186.

Google Scholar

[5]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021).

Google Scholar

[6]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint arXiv:2101.03961 (2021).

Google Scholar

[7]

Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tieyan Liu, and Arnold Overwijk. 2021. Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021.

Crossref

Google Scholar

[8]

Kaixin Ma, Hao Cheng, Xiaodong Liu, Eric Nyberg, and Jianfeng Gao. 2021. Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text. arXiv preprint arXiv:2110.08417 (2021).

Google Scholar

[9]

Microsoft. 2019. BlingFire. https://github.com/microsoft/BlingFire

Google Scholar

[10]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).

Google Scholar

[11]

Carnegie Mellon University. 2009. ClueWeb09. http://lemurproject.org/clueweb09/

Google Scholar

[12]

Carnegie Mellon University. 2012. ClueWeb12. http://lemurproject.org/clueweb12/

Google Scholar

[13]

Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, and Arnold Overwijk. 2019. Open Domain Web Keyphrase Extraction Beyond Language Modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP 2019. http://arxiv.org/abs/1911.02671

Crossref

Google Scholar

[14]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations, ICLR 2021.

Google Scholar

Cited By

View all

Mo FZhao LHuang KDong YHuang DNie JSerra ESpezzano F(2024)How to Leverage Personal Textual Knowledge for Personalized Conversational Information RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679939(3954-3958)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679939
Aliannejadi MAbbasiantaeb ZChatterjee SDalton JAzzopardi LHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge AssistantsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657860(819-829)Online publication date: 11-Jul-2024
https://doi.org/10.1145/3626772.3657860
Chen QGeng XRosset CBuractaon CLu JShen TZhou KXiong CGong YBennett PCraswell NXie XYang FTower BRao NDong AJiang WLiu ZLi MLiu CLi ZMajumder RNeville JOakley ARisvik KSimhadri HVarma MWang YYang LYang MZhang CChua TNgo CKumar RLauw HKa-Wei Lee R(2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648327
Show More Cited By

Index Terms

ClueWeb22: 10 Billion Web Documents with Rich Information
1. Information systems
  1. Information retrieval
  2. World Wide Web

Recommendations

MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels
WWW '24: Companion Proceedings of the ACM Web Conference 2024

Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked ...
Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus
GoTAL '08: Proceedings of the 6th international conference on Advances in Natural Language Processing

In this paper, we propose a set of language resources for building Turkish language processing applications. Specifically, we present a finite-state implementation of a morphological parser, an averaged perceptron-based morphological disambiguator, and ...
Towards realistic known-item topics for the ClueWeb
IIIX '12: Proceedings of the 4th Information Interaction in Context Symposium

Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily ...

Comments

Information & Contributors

Information

Published In

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

NSF (National Science Foundation)

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
613
Total Downloads

Downloads (Last 12 months)227
Downloads (Last 6 weeks)29

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mo FZhao LHuang KDong YHuang DNie JSerra ESpezzano F(2024)How to Leverage Personal Textual Knowledge for Personalized Conversational Information RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679939(3954-3958)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679939
Aliannejadi MAbbasiantaeb ZChatterjee SDalton JAzzopardi LHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge AssistantsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657860(819-829)Online publication date: 11-Jul-2024
https://doi.org/10.1145/3626772.3657860
Chen QGeng XRosset CBuractaon CLu JShen TZhou KXiong CGong YBennett PCraswell NXie XYang FTower BRao NDong AJiang WLiu ZLi MLiu CLi ZMajumder RNeville JOakley ARisvik KSimhadri HVarma MWang YYang LYang MZhang CChua TNgo CKumar RLauw HKa-Wei Lee R(2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648327
Zhou XJia QHu Y(2024)Advancing General Sensor Data Synthesis by Integrating LLMs and Domain-Specific Generative ModelsIEEE Sensors Letters10.1109/LSENS.2024.34707488:11(1-4)Online publication date: Nov-2024
https://doi.org/10.1109/LSENS.2024.3470748
Zhou XHu YJia QXie R(2024)Cross-Domain Integration for General Sensor Data Synthesis: Leveraging LLMs and Domain-Specific Generative Models in Collaborative EnvironmentsIEEE Sensors Journal10.1109/JSEN.2024.348093224:24(42311-42326)Online publication date: 15-Dec-2024
https://doi.org/10.1109/JSEN.2024.3480932
Frew LNelson MWeigle MKlein MBen-David AJäschke RKelly M(2024)Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web ArchivesProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00021(71-81)Online publication date: 26-Jun-2024
https://dl.acm.org/doi/10.1109/JCDL57899.2023.00021
Zhou XJia QHu YXie RHuang TYu F(2024)GenG: An LLM-Based Generic Time Series Data Generation Approach for Edge Intelligence via Cross-Domain CollaborationIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS61880.2024.10620716(1-6)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOMWKSHPS61880.2024.10620716
Da Rocha Junqueira JLopes ÉDa S. M. CSilva FCarvalho EFreitas LBrisolara U(2024)Sabiá in Action: An Investigation of its Abilities in Aspect-Based Sentiment Analysis, Hate Speech Detection, Irony Detection, and Question-Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650878(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650878
Corrêa NFalk SFatimah SSen ADe Oliveira N(2024)TeenyTinyLlama: Open-source tiny language models trained in Brazilian PortugueseMachine Learning with Applications10.1016/j.mlwa.2024.10055816(100558)Online publication date: Jun-2024
https://doi.org/10.1016/j.mlwa.2024.100558
Hendriksen GHiemstra Dde Vries A(2024)Weighted AUReC: Handling Skew in Shard Map Quality Estimation for Selective SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_10(87-96)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56066-8_10
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus

Towards realistic known-item topics for the ClueWeb

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF

eReader

Login options

Full Access

Abstract

References

Cited By

Index Terms

Recommendations

MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus

Towards realistic known-item topics for the ClueWeb

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations