RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Chang, Yaoyao; Cui, Lei; Dong, Li; Huang, Shaohan; Huang, Yangyu; Huang, Yupan; Li, Scarlett; Lv, Tengchao; Ma, Shuming; Sun, Qinzheng; Wang, Wenhui; Wei, Furu; Xin, Ying; Yang, Mao; Yin, Qiufeng; Zhang, Xingxing

Abstract:Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \url{this https URL}.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.03398 [cs.CL]
	(or arXiv:2412.03398v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.03398

Computer Science > Computation and Language

Title:RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators