Dynamic data sampler for cross-language transfer learning in large language models

Li, Yudong; Feng, Yuhao; Zhou, Wen; Zhao, Zhe; Shen, Linlin; Hou, Cheng; Hou, Xianxu

Computer Science > Computation and Language

arXiv:2405.10626 (cs)

[Submitted on 17 May 2024]

Title:Dynamic data sampler for cross-language transfer learning in large language models

Authors:Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Comments:	Accepted by ICASSP 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.10626 [cs.CL]
	(or arXiv:2405.10626v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.10626

Submission history

From: Yudong Li [view email]
[v1] Fri, 17 May 2024 08:40:51 UTC (3,065 KB)

Computer Science > Computation and Language

Title:Dynamic data sampler for cross-language transfer learning in large language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Dynamic data sampler for cross-language transfer learning in large language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators