Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

Xu, Shusheng; Liu, Yichen; Yi, Xiaoyu; Zhou, Siyuan; Li, Huizi; Wu, Yi

Computer Science > Computation and Language

arXiv:2112.06494 (cs)

[Submitted on 13 Dec 2021 (v1), last revised 14 Dec 2021 (this version, v2)]

Title:Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

Authors:Shusheng Xu, Yichen Liu, Xiaoyu Yi, Siyuan Zhou, Huizi Li, Yi Wu

View PDF

Abstract:We present Native Chinese Reader (NCR), a new machine reading comprehension (MRC) dataset with particularly long articles in both modern and classical Chinese. NCR is collected from the exam questions for the Chinese course in China's high schools, which are designed to evaluate the language proficiency of native Chinese youth. Existing Chinese MRC datasets are either domain-specific or focusing on short contexts of a few hundreds of characters in modern Chinese only. By contrast, NCR contains 8390 documents with an average length of 1024 characters covering a wide range of Chinese writing styles, including modern articles, classical literature and classical poetry. A total of 20477 questions on these documents also require strong reasoning abilities and common sense to figure out the correct answers. We implemented multiple baseline models using popular Chinese pre-trained models and additionally launched an online competition using our dataset to examine the limit of current methods. The best model achieves 59% test accuracy while human evaluation shows an average accuracy of 79%, which indicates a significant performance gap between current MRC models and native Chinese speakers. We release the dataset at this https URL.

Comments:	17 pages, 1 fiugres, accepted by NeurIPS 2021 Track on Datasets and Benchmarks
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2112.06494 [cs.CL]
	(or arXiv:2112.06494v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2112.06494

Submission history

From: Shusheng Xu [view email]
[v1] Mon, 13 Dec 2021 09:11:38 UTC (661 KB)
[v2] Tue, 14 Dec 2021 04:25:40 UTC (667 KB)

Computer Science > Computation and Language

Title:Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators