Investigating Multi-source Active Learning for Natural Language Inference

Snijders, Ard; Kiela, Douwe; Margatina, Katerina

Computer Science > Computation and Language

arXiv:2302.06976 (cs)

[Submitted on 14 Feb 2023]

Title:Investigating Multi-source Active Learning for Natural Language Inference

Authors:Ard Snijders, Douwe Kiela, Katerina Margatina

View PDF

Abstract:In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.

Comments:	23 pages. Accepted for publication at the European Chapter of the Association of Computational Linguistics (EACL) 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2302.06976 [cs.CL]
	(or arXiv:2302.06976v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2302.06976

Submission history

From: Ard Snijders [view email]
[v1] Tue, 14 Feb 2023 11:10:18 UTC (25,123 KB)

Computer Science > Computation and Language

Title:Investigating Multi-source Active Learning for Natural Language Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Investigating Multi-source Active Learning for Natural Language Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators