Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel; Paolo Rosso; Julian Brooke; Alexandra L. Uitdenbogerd

doi:10.18653/v1/W18-1605

Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel, Paolo Rosso, Julian Brooke, Alexandra Uitdenbogerd

Abstract

In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.

Anthology ID:: W18-1605
Volume:: Proceedings of the Second Workshop on Stylistic Variation
Month:: June
Year:: 2018
Address:: New Orleans
Editors:: Julian Brooke, Lucie Flekova, Moshe Koppel, Thamar Solorio
Venue:: Style-Var
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 39–43
Language:
URL:: https://aclanthology.org/W18-1605
DOI:: 10.18653/v1/W18-1605
Bibkey:
Cite (ACL):: Francisco Rangel, Paolo Rosso, Julian Brooke, and Alexandra Uitdenbogerd. 2018. Cross-corpus Native Language Identification via Statistical Embedding. In Proceedings of the Second Workshop on Stylistic Variation, pages 39–43, New Orleans. Association for Computational Linguistics.
Cite (Informal):: Cross-corpus Native Language Identification via Statistical Embedding (Rangel et al., Style-Var 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-1605.pdf

PDF Cite Search