YuLan: An Open-source Large Language Model

Zhu, Yutao; Zhou, Kun; Mao, Kelong; Chen, Wentong; Sun, Yiding; Chen, Zhipeng; Cao, Qian; Wu, Yihan; Chen, Yushuo; Wang, Feng; Zhang, Lei; Li, Junyi; Wang, Xiaolei; Wang, Lei; Zhang, Beichen; Dong, Zican; Cheng, Xiaoxue; Chen, Yuhan; Tang, Xinyu; Hou, Yupeng; Ren, Qiangqiang; Pang, Xincheng; Xie, Shufang; Zhao, Wayne Xin; Dou, Zhicheng; Mao, Jiaxin; Lin, Yankai; Song, Ruihua; Xu, Jun; Chen, Xu; Yan, Rui; Wei, Zhewei; Hu, Di; Huang, Wenbing; Gao, Ze-Feng; Chen, Yueguo; Lu, Weizheng; Wen, Ji-Rong

Abstract:Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.19853 [cs.CL]
	(or arXiv:2406.19853v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.19853

Computer Science > Computation and Language

Title:YuLan: An Open-source Large Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators