CINO: A Chinese Minority Pre-trained Language Model

Yang, Ziqing; Xu, Zihang; Cui, Yiming; Wang, Baoxin; Lin, Min; Wu, Dayong; Chen, Zhigang

Computer Science > Computation and Language

arXiv:2202.13558 (cs)

[Submitted on 28 Feb 2022 (v1), last revised 21 Sep 2022 (this version, v2)]

Title:CINO: A Chinese Minority Pre-trained Language Model

Authors:Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, Zhigang Chen

View PDF

Abstract:Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at this http URL.

Comments:	Accepted to COLING 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2202.13558 [cs.CL]
	(or arXiv:2202.13558v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2202.13558

Submission history

From: Ziqing Yang [view email]
[v1] Mon, 28 Feb 2022 06:02:06 UTC (198 KB)
[v2] Wed, 21 Sep 2022 01:43:35 UTC (119 KB)

Computer Science > Computation and Language

Title:CINO: A Chinese Minority Pre-trained Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CINO: A Chinese Minority Pre-trained Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators