Give your Text Representation Models some Love: the Case for Basque

Agerri, Rodrigo; Vicente, Iñaki San; Campos, Jon Ander; Barrena, Ander; Saralegi, Xabier; Soroa, Aitor; Agirre, Eneko

Computer Science > Computation and Language

arXiv:2004.00033 (cs)

[Submitted on 31 Mar 2020 (v1), last revised 2 Apr 2020 (this version, v2)]

Title:Give your Text Representation Models some Love: the Case for Basque

Authors:Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

View PDF

Abstract:Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

Comments:	Accepted at LREC 2020; 8 pages, 7 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2004.00033 [cs.CL]
	(or arXiv:2004.00033v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2004.00033

Submission history

From: Rodrigo Agerri [view email]
[v1] Tue, 31 Mar 2020 18:01:56 UTC (22 KB)
[v2] Thu, 2 Apr 2020 11:46:52 UTC (22 KB)

Computer Science > Computation and Language

Title:Give your Text Representation Models some Love: the Case for Basque

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Give your Text Representation Models some Love: the Case for Basque

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators