Search | arXiv e-print repository

Self-supervised Speech Representations Still Struggle with African American Vernacular English

Authors: Kalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen

Abstract: Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American Eng… ▽ More Underperformance of ASR systems for speakers of African American Vernacular English (AAVE) and other marginalized language varieties is a well-documented phenomenon, and one that reinforces the stigmatization of these varieties. We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. Additionally, the models have higher word error rates on utterances with more phonological and morphosyntactic features of AAVE. Despite the success of SSL speech models in improving ASR for low resource varieties, SSL pre-training alone may not bridge the gap between AAVE and MAE. Our code is publicly available at https://github.com/cmu-llab/s3m-aave. △ Less

Submitted 26 August, 2024; originally announced August 2024.

Comments: INTERSPEECH 2024

arXiv:2407.04158 [pdf, ps, other]

ELCC: the Emergent Language Corpus Collection

Authors: Brendon Boldt, David Mortensen

Abstract: We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora collected from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex tasks like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristi… ▽ More We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora collected from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex tasks like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, limits the variety of languages that are studied, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable new directions of research which focus their purview on the properties of emergent languages themselves rather than on experimental apparatus. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 18 pages, 3 figures

ACM Class: I.2.7; I.6.m

arXiv:2407.03456 [pdf, other]

XferBench: a Data-Driven Benchmark for Emergent Language

Authors: Brendon Boldt, David Mortensen

Abstract: In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the "quality" of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language -- the better t… ▽ More In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the "quality" of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language -- the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark's validity using human, synthetic, and emergent language baselines. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 15 pages, 5 figures

ACM Class: I.2.7; I.6.m

Journal ref: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1475-1489

arXiv:2407.03302 [pdf, other]

A Review of the Applications of Deep Learning-Based Emergent Communication

Authors: Brendon Boldt, David Mortensen

Abstract: Emergent communication, or emergent language, is the field of research which studies how human language-like communication systems emerge de novo in deep multi-agent reinforcement learning environments. The possibilities of replicating the emergence of a complex behavior like language have strong intuitive appeal, yet it is necessary to complement this with clear notions of how such research can b… ▽ More Emergent communication, or emergent language, is the field of research which studies how human language-like communication systems emerge de novo in deep multi-agent reinforcement learning environments. The possibilities of replicating the emergence of a complex behavior like language have strong intuitive appeal, yet it is necessary to complement this with clear notions of how such research can be applicable to other fields of science, technology, and engineering. This paper comprehensively reviews the applications of emergent communication research across machine learning, natural language processing, linguistics, and cognitive science. Each application is illustrated with a description of its scope, an explication of emergent communication's unique role in addressing it, a summary of the extant literature working towards the application, and brief recommendations for near-term research directions. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 49 pages, 15 figures

ACM Class: I.2.7; I.6.m

Journal ref: Transactions on Machine Learning Research, February 2024

arXiv:2406.16521 [pdf, other]

Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback

Authors: Jimin Sohn, Jeihee Cho, Junyong Lee, Songmu Heo, Ji-Eun Han, David R. Mortensen

Abstract: Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative f… ▽ More Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative feedback has not yet been explored, even though positive and negative feedback are both necessary to grow self-motivation. To facilitate self-motivation, we propose CArrot and STICk (CASTIC) dataset, consisting of 12,590 sentences with 5 different strategies for enhancing self-motivation. Our data and code are publicly available at here. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 10 pages, 8 figures

arXiv:2406.16030 [pdf, other]

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Authors: Jimin Sohn, Haeji Jung, Alex Cheng, Jooeon Kang, Yilin Du, David R. Mortensen

Abstract: Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages. In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages. Our experiments show that our method significa… ▽ More Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages. In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages. Our experiments show that our method significantly outperforms baseline models in extremely low-resource languages, with the highest average F-1 score (46.38%) and lowest standard deviation (12.67), particularly demonstrating its robustness with non-Latin scripts. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures, 5 tables

arXiv:2406.12725 [pdf]

Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Authors: Atharva Naik, Kexun Zhang, Nathaniel Robinson, Aravind Mysore, Clayton Marr, Hong Sng Rebecca Byrnes, Anna Cai, Kalvin Chang, David Mortensen

Abstract: Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and co… ▽ More Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.05930 [pdf]

Semisupervised Neural Proto-Language Reconstruction

Authors: Liang Lu, Peirong Xie, David R. Mortensen

Abstract: Existing work implementing comparative reconstruction of ancestral languages (proto-languages) has usually required full supervision. However, historical reconstruction models are only of practical value if they can be trained with a limited amount of labeled data. We propose a semisupervised historical reconstruction task in which the model is trained on only a small amount of labeled data (cogna… ▽ More Existing work implementing comparative reconstruction of ancestral languages (proto-languages) has usually required full supervision. However, historical reconstruction models are only of practical value if they can be trained with a limited amount of labeled data. We propose a semisupervised historical reconstruction task in which the model is trained on only a small amount of labeled data (cognate sets with proto-forms) and a large amount of unlabeled data (cognate sets without proto-forms). We propose a neural architecture for comparative reconstruction (DPD-BiReconstructor) incorporating an essential insight from linguists' comparative method: that reconstructed words should not only be reconstructable from their daughter words, but also deterministically transformable back into their daughter words. We show that this architecture is able to leverage unlabeled cognate sets to outperform strong semisupervised baselines on this novel task. △ Less

Submitted 12 August, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024; v2: correct typo

arXiv:2404.15690 [pdf, other]

Neural Proto-Language Reconstruction

Authors: Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen

Abstract: Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neu… ▽ More Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neural machine translation model for the reconstruction task. We find that with the additional VAE structure, the Transformer model has a better performance on the WikiHan dataset, and the data augmentation step stabilizes the training. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2403.18769 [pdf]

Improved Neural Protoform Reconstruction via Reflex Prediction

Authors: Liang Lu, Jingzhi Wang, David R. Mortensen

Abstract: Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computatio… ▽ More Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computational linguists have attempted to operationalize comparative reconstruction through various computational models, the most successful of which have been supervised encoder-decoder models, which treat the problem of predicting protoforms given sets of reflexes as a sequence-to-sequence problem. We argue that this framework ignores one of the most important aspects of the comparative method: not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. Leveraging another line of research -- reflex prediction -- we propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model. We show that this more complete implementation of the comparative method allows us to surpass state-of-the-art protoform reconstruction methods on three of four Chinese and Romance datasets. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted to LREC-COLING 2024

arXiv:2403.17856 [pdf, other]

Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

Authors: David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler

Abstract: Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to w… ▽ More Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility -- the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models -- two proprietary models (GPT-3.5 and GPT-4), three open-source models (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: LREC-COLING 2024

arXiv:2403.17760 [pdf, other]

Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

Authors: Shijia Zhou, Leonie Weissweiler, Taiqi He, Hinrich Schütze, David R. Mortensen, Lori Levin

Abstract: In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort… ▽ More In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM's understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don't adequately represent their meaning or capture the lexical properties of phrasal heads. △ Less

Submitted 29 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: LREC-COLING 2024

arXiv:2403.13169 [pdf, other]

Wav2Gloss: Generating Interlinear Glossed Text from Speech

Authors: Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free transl… ▽ More Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches. △ Less

Submitted 5 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: ACL 2024 camera ready version

arXiv:2402.14279 [pdf, other]

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Authors: Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen

Abstract: Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provi… ▽ More Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provide a novel solution for robust multilingual language modeling by employing phonemic representations (specifically, using phonemes as input tokens to LMs rather than subwords). We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation, which is further justified by a theoretical analysis of the cross-lingual performance gap. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.12998 [pdf]

Phonotactic Complexity across Dialects

Authors: Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen

Abstract: Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites),… ▽ More Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites), which enables a more fair comparison across varieties. Even at the dialect level, we find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model-a result previously documented only at the language level. A generalized additive model (GAM) shows that dialects with low phonotactic complexity concentrate around the capital regions, which we hypothesize to correspond to prior hypotheses that language varieties of greater or more diverse populations show reduced phonotactic complexity. We also experiment with incorporating the auxiliary task of predicting syllable constituency, but do not find an increase in the negative correlation observed. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted to COLING-LREC 2024

arXiv:2402.01582 [pdf]

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

Authors: Kalvin Chang, Nathaniel R. Robinson, Anna Cai, Ting Chen, Annie Zhang, David R. Mortensen

Abstract: We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound c… ▽ More We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound change steps between historical protoforms and their modern descendants, replacing a linguistic expert in part of a parsimony-based phylogenetic inference algorithm. In our best experiments on Tukanoan languages, this method produces trees with a Generalized Quartet Distance of 0.12 from a tree that used expert annotations, a significant improvement over other semi-automated baselines. We discuss potential benefits and drawbacks to our neural approach and parsimony-based tree prediction. We also experiment with a minimal generalization learner for automatic sound law induction, finding it comparably effective to sound laws from expert annotation. Our code is publicly available at https://github.com/cmu-llab/aiscp. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted to LChange 2023

arXiv:2311.00835 [pdf, other]

Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Typing

Authors: Yanlin Feng, Adithya Pratapa, David R Mortensen

Abstract: Ultra-fine entity typing plays a crucial role in information extraction by predicting fine-grained semantic types for entity mentions in text. However, this task poses significant challenges due to the massive number of entity types in the output space. The current state-of-the-art approaches, based on standard multi-label classifiers or cross-encoder models, suffer from poor generalization perfor… ▽ More Ultra-fine entity typing plays a crucial role in information extraction by predicting fine-grained semantic types for entity mentions in text. However, this task poses significant challenges due to the massive number of entity types in the output space. The current state-of-the-art approaches, based on standard multi-label classifiers or cross-encoder models, suffer from poor generalization performance or inefficient inference. In this paper, we present CASENT, a seq2seq model designed for ultra-fine entity typing that predicts ultra-fine types with calibrated confidence scores. Our model takes an entity mention as input and employs constrained beam search to generate multiple types autoregressively. The raw sequence probabilities associated with the predicted types are then transformed into confidence scores using a novel calibration method. We conduct extensive experiments on the UFET dataset which contains over 10k types. Our method outperforms the previous state-of-the-art in terms of F1 score and calibration error, while achieving an inference speedup of over 50 times. Additionally, we demonstrate the generalization capabilities of our model by evaluating it in zero-shot and few-shot settings on five specialized domain entity typing datasets that are unseen during training. Remarkably, our model outperforms large language models with 10 times more parameters in the zero-shot setting, and when fine-tuned on 50 examples, it significantly outperforms ChatGPT on all datasets. Our code, models and demo are available at https://github.com/yanlinf/CASENT. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.15113 [pdf]

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

Abstract: Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i… ▽ More Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading. △ Less

Submitted 26 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: EMNLP 2023

arXiv:2309.07423 [pdf, other]

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

Authors: Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig

Abstract: Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's dive… ▽ More Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: 27 pages, 9 figures, 14 tables

arXiv:2307.01896 [pdf, other]

Transformed Protoform Reconstruction

Authors: Young Min Kim, Kalvin Chang, Chenxuan Cui, David Mortensen

Abstract: Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their mode… ▽ More Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at https://github.com/cmu-llab/acl-2023. △ Less

Submitted 5 July, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: Accepted at ACL 2023

arXiv:2305.13707 [pdf, other]

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov

Abstract: Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data… ▽ More Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API's pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable to begin with. Through these analyses, we aim to increase transparency around language model APIs' pricing policies and encourage the vendors to make them more equitable. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2304.02541 [pdf, other]

PWESuite: Phonetic Word Embeddings and Tasks They Facilitate

Authors: Vilém Zouhar, Kalvin Chang, Chenxuan Cui, Nathaniel Carlson, Nathaniel Robinson, Mrinmaya Sachan, David Mortensen

Abstract: Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embed… ▽ More Mapping words into a fixed-dimensional vector space is the backbone of modern NLP. While most word embedding methods successfully encode semantic information, they overlook phonetic information that is crucial for many tasks. We develop three methods that use articulatory features to build phonetically informed word embeddings. To address the inconsistent evaluation of existing phonetic word embedding methods, we also contribute a task suite to fairly evaluate past, current, and future methods. We evaluate both (1) intrinsic aspects of phonetic word embeddings, such as word retrieval and correlation with sound similarity, and (2) extrinsic performance on tasks such as rhyme and cognate detection and sound analogies. We hope our task suite will promote reproducibility and inspire future phonetic embedding research. △ Less

Submitted 26 March, 2024; v1 submitted 5 April, 2023; originally announced April 2023.

Comments: LREC-COLING 2024

arXiv:2302.02178 [pdf, other]

Construction Grammar Provides Unique Insight into Neural Language Models

Authors: Leonie Weissweiler, Taiqi He, Naoki Otani, David R. Mortensen, Lori Levin, Hinrich Schütze

Abstract: Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mi… ▽ More Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces. △ Less

Submitted 4 February, 2023; originally announced February 2023.

Comments: GURT 2023

arXiv:2211.15783 [pdf, other]

Mathematically Modeling the Lexicon Entropy of Emergent Language

Authors: Brendon Boldt, David Mortensen

Abstract: We formulate a stochastic process, FiLex, as a mathematical model of lexicon entropy in deep learning-based emergent language systems. Defining a model mathematically allows it to generate clear predictions which can be directly and decisively tested. We empirically verify across four different environments that FiLex predicts the correct correlation between hyperparameters (training steps, lexico… ▽ More We formulate a stochastic process, FiLex, as a mathematical model of lexicon entropy in deep learning-based emergent language systems. Defining a model mathematically allows it to generate clear predictions which can be directly and decisively tested. We empirically verify across four different environments that FiLex predicts the correct correlation between hyperparameters (training steps, lexicon size, learning rate, rollout buffer size, and Gumbel-Softmax temperature) and the emergent language's entropy in 20 out of 20 environment-hyperparameter combinations. Furthermore, our experiments reveal that different environments show diverse relationships between their hyperparameters and entropy which demonstrates the need for a model which can make well-defined predictions at a precise level of granularity. △ Less

Submitted 24 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: 12 pages, 3 figures; added link to GitHub repo

ACM Class: I.2.11; I.2.7; I.6.m

arXiv:2209.06295 [pdf, other]

Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

Authors: Nathaniel R. Robinson, Cameron J. Hogan, Nancy Fulda, David R. Mortensen

Abstract: Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that for some languages beyo… ▽ More Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that for some languages beyond a threshold of authentic data, back-translation augmentation methods are counterproductive, while cross-lingual transfer from a sufficiently related language is preferred. We complement this finding by contributing a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding. When used with multilingual techniques, orthographic transformation makes statistically significant improvements over conventional methods. And in very low-resource Jamaican MT, code-switching with a transfer language for orthographic resemblance yields a 6.63 BLEU point advantage. △ Less

Submitted 13 September, 2022; originally announced September 2022.

arXiv:2209.02842 [pdf, other]

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Authors: Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

Abstract: Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronu… ▽ More Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: INTERSPEECH 2022

arXiv:2207.09889 [pdf, other]

When Is TTS Augmentation Through a Pivot Language Useful?

Authors: Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

Abstract: Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either.… ▽ More Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either. We propose an alternative: produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language. We investigate when and how this technique is most effective in low-resource settings. In our experiments, using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results. Our findings suggest that searching over a set of candidate pivot languages can lead to marginal improvements and that, surprisingly, ASR performance can by harmed by increases in measured TTS quality. Application of these findings improves ASR by 64.5\% and 45.0\% character error reduction rate (CERR) respectively for two low-resource languages: Guaraní and Suba. △ Less

Submitted 20 July, 2022; originally announced July 2022.

arXiv:2206.11302 [pdf, ps, other]

Recommendations for Systematic Research on Emergent Language

Authors: Brendon Boldt, David Mortensen

Abstract: Emergent language is unique among fields within the discipline of machine learning for its open-endedness, not obviously presenting well-defined problems to be solved. As a result, the current research in the field has largely been exploratory: focusing on establishing new problems, techniques, and phenomena. Yet after these problems have been established, subsequent progress requires research whi… ▽ More Emergent language is unique among fields within the discipline of machine learning for its open-endedness, not obviously presenting well-defined problems to be solved. As a result, the current research in the field has largely been exploratory: focusing on establishing new problems, techniques, and phenomena. Yet after these problems have been established, subsequent progress requires research which can measurably demonstrate how it improves on prior approaches. This type of research is what we call systematic research; in this paper, we illustrate this mode of research specifically for emergent language. We first identify the overarching goals of emergent language research, categorizing them as either science or engineering. Using this distinction, we present core methodological elements of science and engineering, analyze their role in current emergent language research, and recommend how to apply these elements. △ Less

Submitted 22 June, 2022; originally announced June 2022.

Comments: 10 pages

ACM Class: I.2.11; I.6.m; K.4.m

arXiv:2206.11146 [pdf, other]

Modeling Emergent Lexicon Formation with a Self-Reinforcing Stochastic Process

Authors: Brendon Boldt, David Mortensen

Abstract: We introduce FiLex, a self-reinforcing stochastic process which models finite lexicons in emergent language experiments. The central property of FiLex is that it is a self-reinforcing process, parallel to the intuition that the more a word is used in a language, the more its use will continue. As a theoretical model, FiLex serves as a way to both explain and predict the behavior of the emergent la… ▽ More We introduce FiLex, a self-reinforcing stochastic process which models finite lexicons in emergent language experiments. The central property of FiLex is that it is a self-reinforcing process, parallel to the intuition that the more a word is used in a language, the more its use will continue. As a theoretical model, FiLex serves as a way to both explain and predict the behavior of the emergent language system. We empirically test FiLex's ability to capture the relationship between the emergent language's hyperparameters and the lexicon's Shannon entropy. △ Less

Submitted 22 June, 2022; originally announced June 2022.

Comments: Published as workshop paper at EmeCom at ICLR 2022, 8 pages, 14 figures

ACM Class: I.2.11; I.2.7; I.6.m

arXiv:2204.04080 [pdf]

Learning the Ordering of Coordinate Compounds and Elaborate Expressions in Hmong, Lahu, and Chinese

Authors: Chenxuan Cui, Katherine J. Zhang, David R. Mortensen

Abstract: Coordinate compounds (CCs) and elaborate expressions (EEs) are coordinate constructions common in languages of East and Southeast Asia. Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) these phonological hierarchies lack a clear phonetic rationale. These claims are significant because morphosyntax… ▽ More Coordinate compounds (CCs) and elaborate expressions (EEs) are coordinate constructions common in languages of East and Southeast Asia. Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) these phonological hierarchies lack a clear phonetic rationale. These claims are significant because morphosyntax has often been seen as in a feed-forward relationship with phonology, and phonological generalizations have often been assumed to be phonetically "natural". We investigate whether the ordering of CCs and EEs can be learned empirically and whether computational models (classifiers and sequence labeling models) learn unnatural hierarchies similar to those posited by Mortensen (2006). We find that decision trees and SVMs learn to predict the order of CCs/EEs on the basis of phonology, with DTs learning hierarchies strikingly similar to those proposed by Mortensen. However, we also find that a neural sequence labeling model is able to learn the ordering of elaborate expressions in Hmong very effectively without using any phonological information. We argue that EE ordering can be learned through two independent routes: phonology and lexical distribution, presenting a more nuanced picture than previous work. [ISO 639-3:hmn, lhu, cmn] △ Less

Submitted 3 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: NAACL2022 Oral

arXiv:2203.13901 [pdf, other]

AUTOLEX: An Automatic Framework for Linguistic Exploration

Authors: Aditi Chaudhary, Zaid Sheikh, David R Mortensen, Antonios Anastasopoulos, Graham Neubig

Abstract: Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep… ▽ More Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep understanding of the language at hand and linguistics as a whole. We propose an automatic framework AutoLEX that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena. Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order, across several languages. We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: 9 pages

arXiv:2110.13382 [pdf, other]

doi 10.3847/1538-4365/ac324a

TESS Eclipsing Binary Stars. I. Short cadence observations of 4584 eclipsing binaries in Sectors 1-26

Authors: Andrej Prsa, Angela Kochoska, Kyle E. Conroy, Nora Eisner, Daniel R. Hey, Luc IJspeert, Ethan Kruse, Scott W. Fleming, Cole Johnston, Martti H. Kristiansen, Daryll LaCourse, Danielle Mortensen, Joshua Pepper, Keivan G. Stassun, Guillermo Torres, Michael Abdul-Masih, Joheen Chakraborty, Robert Gagliano, Zhao Guo, Kelly Hambleton, Kyeongsoo Hong, Thomas Jacobs, David Jones, Veselin Kostov, Jae Woo Lee , et al. (22 additional authors not shown)

Abstract: In this paper we present a catalog of 4584 eclipsing binaries observed during the first two years (26 sectors) of the TESS survey. We discuss selection criteria for eclipsing binary candidates, detection of hither-to unknown eclipsing systems, determination of the ephemerides, the validation and triage process, and the derivation of heuristic estimates for the ephemerides. Instead of keeping to th… ▽ More In this paper we present a catalog of 4584 eclipsing binaries observed during the first two years (26 sectors) of the TESS survey. We discuss selection criteria for eclipsing binary candidates, detection of hither-to unknown eclipsing systems, determination of the ephemerides, the validation and triage process, and the derivation of heuristic estimates for the ephemerides. Instead of keeping to the widely used discrete classes, we propose a binary star morphology classification based on a dimensionality reduction algorithm. Finally, we present statistical properties of the sample, we qualitatively estimate completeness, and discuss the results. The work presented here is organized and performed within the TESS Eclipsing Binary Working Group, an open group of professional and citizen scientists; we conclude by describing ongoing work and future goals for the group. The catalog is available from http://tessEBs.villanova.edu and from MAST. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 32 pages, 21 figures, accepted to ApJ Supplement Series; comments welcome

arXiv:2110.05775 [pdf, other]

Quantifying Cognitive Factors in Lexical Decline

Authors: David Francis, Ella Rabinovich, Farhan Samir, David Mortensen, Suzanne Stevenson

Abstract: We adopt an evolutionary view on language change in which cognitive factors (in addition to social ones) affect the fitness of words and their success in the linguistic ecosystem. Specifically, we propose a variety of psycholinguistic factors -- semantic, distributional, and phonological -- that we hypothesize are predictive of lexical decline, in which words greatly decrease in frequency over tim… ▽ More We adopt an evolutionary view on language change in which cognitive factors (in addition to social ones) affect the fitness of words and their success in the linguistic ecosystem. Specifically, we propose a variety of psycholinguistic factors -- semantic, distributional, and phonological -- that we hypothesize are predictive of lexical decline, in which words greatly decrease in frequency over time. Using historical data across three languages (English, French, and German), we find that most of our proposed factors show a significant difference in the expected direction between each curated set of declining words and their matched stable words. Moreover, logistic regression analyses show that semantic and distributional factors are significant in predicting declining words. Further diachronic analysis reveals that declining words tend to decrease in the diversity of their lexical contexts over time, gradually narrowing their 'ecological niches'. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Transactions of the Association for Computational Linguistics (TACL) 2021, 16 pages

arXiv:2107.11628 [pdf, other]

Differentiable Allophone Graphs for Language-Universal Speech Recognition

Authors: Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

Abstract: Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supe… ▽ More Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages. △ Less

Submitted 24 July, 2021; originally announced July 2021.

Comments: INTERSPEECH 2021. Contains additional studies on phone recognition for unseen languages

arXiv:2104.01624 [pdf, other]

Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties

Authors: Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David Mortensen, Michael R. Marlo, Graham Neubig

Abstract: Models pre-trained on multiple languages have shown significant promise for improving speech recognition, particularly for low-resource languages. In this work, we focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation, which incorporates phonological knowledge through a language-dependent allophone layer that associates a universal narrow… ▽ More Models pre-trained on multiple languages have shown significant promise for improving speech recognition, particularly for low-resource languages. In this work, we focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation, which incorporates phonological knowledge through a language-dependent allophone layer that associates a universal narrow phone-set with the phonemes that appear in each language. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. To our knowledge, these datasets are the first of their kind. We carry out similar experiments on the dataset of an endangered Tangkhulic language, East Tusom, a Tibeto-Burman language variety spoken mostly in India. We explore both zero-shot and few-shot recognition by fine-tuning using datasets of varying sizes (10 to 1000 utterances). We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates. △ Less

Submitted 4 April, 2021; originally announced April 2021.

arXiv:2104.00824 [pdf]

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments

Authors: David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu

Abstract: There is growing interest in ASR systems that can recognize phones in a language-independent fashion. There is additionally interest in building language technologies for low-resource and endangered languages. However, there is a paucity of realistic data that can be used to test such systems and technologies. This paper presents a publicly available, phonetically transcribed corpus of 2255 uttera… ▽ More There is growing interest in ASR systems that can recognize phones in a language-independent fashion. There is additionally interest in building language technologies for low-resource and endangered languages. However, there is a paucity of realistic data that can be used to test such systems and technologies. This paper presents a publicly available, phonetically transcribed corpus of 2255 utterances (words and short phrases) in the endangered Tangkhulic language East Tusom (no ISO 639-3 code), a Tibeto-Burman language variety spoken mostly in India. Because the dataset is transcribed in terms of phones, rather than phonemes, it is a better match for universal phone recognition systems than many larger (phonemically transcribed) datasets. This paper describes the dataset and the methodology used to produce it. It further presents basic benchmarks of state-of-the-art universal phone recognition systems on the dataset as baselines for future experiments. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: 4 pages, 3 figures

arXiv:2103.16590 [pdf, other]

Evaluating the Morphosyntactic Well-formedness of Generated Texts

Authors: Adithya Pratapa, Antonios Anastasopoulos, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig, Yulia Tsvetkov

Abstract: Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various… ▽ More Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages. △ Less

Submitted 9 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: EMNLP 2021 camera-ready

arXiv:2010.01160 [pdf, other]

Automatic Extraction of Rules Governing Morphological Agreement

Authors: Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig

Abstract: Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focu… ▽ More Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results. Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data. We confirm this finding with human expert evaluations of the rules that our framework produces, which have an average accuracy of 78%. We release an interface demonstrating the extracted rules at https://neulab.github.io/lase/. △ Less

Submitted 5 October, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

Comments: Accepted at EMNLP 2020

arXiv:2006.09336 [pdf, other]

Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks

Authors: Jimin Sun, Hwijeen Ahn, Chan Young Park, Yulia Tsvetkov, David R. Mortensen

Abstract: Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic… ▽ More Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic features that capture cross-cultural similarities that manifest in linguistic patterns and quantify distinct aspects of language pragmatics: language context-level, figurative language, and the lexification of emotion concepts. Our analyses show that the proposed pragmatic features do capture cross-cultural similarities and align well with existing work in sociolinguistics and linguistic anthropology. We further corroborate the effectiveness of pragmatically-driven transfer in the downstream task of choosing transfer languages for cross-lingual sentiment analysis. △ Less

Submitted 8 April, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: EACL 2021

arXiv:2006.04334 [pdf, other]

Characterizing Sociolinguistic Variation in the Competing Vaccination Communities

Authors: Shahan Ali Memon, Aman Tyagi, David R. Mortensen, Kathleen M. Carley

Abstract: Public health practitioners and policy makers grapple with the challenge of devising effective message-based interventions for debunking public health misinformation in cyber communities. "Framing" and "personalization" of the message is one of the key features for devising a persuasive messaging strategy. For an effective health communication, it is imperative to focus on "preference-based framin… ▽ More Public health practitioners and policy makers grapple with the challenge of devising effective message-based interventions for debunking public health misinformation in cyber communities. "Framing" and "personalization" of the message is one of the key features for devising a persuasive messaging strategy. For an effective health communication, it is imperative to focus on "preference-based framing" where the preferences of the target sub-community are taken into consideration. To achieve that, it is important to understand and hence characterize the target sub-communities in terms of their social interactions. In the context of health-related misinformation, vaccination remains to be the most prevalent topic of discord. Hence, in this paper, we conduct a sociolinguistic analysis of the two competing vaccination communities on Twitter: "pro-vaxxers" or individuals who believe in the effectiveness of vaccinations, and "anti-vaxxers" or individuals who are opposed to vaccinations. Our data analysis show significant linguistic variation between the two communities in terms of their usage of linguistic intensifiers, pronouns, and uncertainty words. Our network-level analysis show significant differences between the two communities in terms of their network density, echo-chamberness, and the EI index. We hypothesize that these sociolinguistic differences can be used as proxies to characterize and understand these communities to devise better message interventions. △ Less

Submitted 4 October, 2020; v1 submitted 7 June, 2020; originally announced June 2020.

Comments: 11 pages, 4 tables, 1 figure, 1 algorithm, accepted to SBP-BRiMS 2020 -- International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation

arXiv:2005.07203 [pdf, other]

doi 10.3847/1538-3881/ab93dd

TESS Data for Asteroseismology: Timing verification

Authors: Carolina von Essen, Mikkel N. Lund, Rasmus Handberg, Marina S. Sosa, Julie Thiim Gadeberg, Hans Kjeldsen, Roland K. Vanderspek, Dina S. Mortensen, M. Mallonn, L. Mammana, Edward H. Morgan, Jesus Noel S. Villasenor, Michael M. Fausnaugh, George R. Ricker

Abstract: The Transiting Exoplanet Survey Satellite (TESS) is NASA's latest space telescope dedicated to the discovery of transiting exoplanets around nearby stars. Besides the main goal of the mission, asteroseismology is an important secondary goal and very relevant for the high-quality time series that TESS will make during its two year all-sky survey. Using TESS for asteroseismology introduces strong ti… ▽ More The Transiting Exoplanet Survey Satellite (TESS) is NASA's latest space telescope dedicated to the discovery of transiting exoplanets around nearby stars. Besides the main goal of the mission, asteroseismology is an important secondary goal and very relevant for the high-quality time series that TESS will make during its two year all-sky survey. Using TESS for asteroseismology introduces strong timing requirements, especially for coherent oscillators. Although the internal clock on board TESS is precise in its own time, it might have a constant drift and will thus need calibration, or offsets might inadvertently be introduced. Here we present simultaneously ground- and space-based observations of primary eclipses of several binary systems in the Southern ecliptic hemisphere, used to verify the reliability of the TESS timestamps. From twelve contemporaneous TESS/ground observations we determined a time offset equal to 5.8 +/- 2.5 sec, in the sense that the Barycentric time measured by TESS is ahead of real time. The offset is consistent with zero at 2.3-sigma level. In addition, we used 405 individually measured mid-eclipse times of 26 eclipsing binary stars observed solely by TESS to test the existence of a potential drift with a monotonic growth (or decay) affecting the observations of all stars. We find a drift corresponding to sigma_drift = 0.009 +/- 0.015 sec/day. We find that the measured offset is of a size that will not become an issue for comparing ground-based and space data for coherent oscillations for most of the targets observed with TESS. △ Less

Submitted 14 May, 2020; originally announced May 2020.

Comments: 15 pages, 9 figures

arXiv:2004.08031 [pdf, other]

AlloVera: A Multilingual Allophone Database

Authors: David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. Black, Florian Metze, Graham Neubig

Abstract: We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a uni… ▽ More We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: 8 pages, LREC 2020

arXiv:2002.11800 [pdf, other]

Universal Phone Recognition with a Multilingual Allophone System

Authors: Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W Black, Florian Metze

Abstract: Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages. Multilingual acoustic models, however, generally ignore the difference between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This c… ▽ More Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages. Multilingual acoustic models, however, generally ignore the difference between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This can lead to performance degradation when combining a variety of training languages, as identically annotated phonemes can actually correspond to several different underlying phonetic realizations. In this work, we propose a joint model of both language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute in low-resource conditions. Additionally, because we are explicitly modeling language-independent phones, we can build a (nearly-)universal phone recognizer that, when combined with the PHOIBLE large, manually curated database of phone inventories, can be customized into 2,000 language dependent recognizers. Experiments on two low-resourced indigenous languages, Inuktitut and Tusom, show that our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: ICASSP 2020

arXiv:2002.11781 [pdf, other]

Towards Zero-shot Learning for Automatic Phonemic Transcription

Authors: Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W Black, Florian Metze

Abstract: Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data… ▽ More Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data. The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes. In this work, we address this problem by adopting the idea of zero-shot learning. Our model is able to recognize unseen phonemes in the target language without any training data. In our model, we decompose phonemes into corresponding articulatory attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over articulatory attributes, and then compute phoneme distributions with a customized acoustic model. We evaluate our model by training it using 13 languages and testing it using 7 unseen languages. We find that it achieves 7.7% better phoneme error rate on average over a standard multilingual model. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: AAAI 2020

arXiv:2001.07740 [pdf, other]

doi 10.7275/1jra-8m83

Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods

Authors: Maria Ryskina, Ella Rabinovich, Taylor Berg-Kirkpatrick, David R. Mortensen, Yulia Tsvetkov

Abstract: We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find… ▽ More We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find more support for the latter hypothesis. Besides presenting a new linguistic application of distributional semantics, this study tackles the linguistic question of the role of language-internal factors (in our case, sparsity) in language change motivated by language-external factors (reflected in frequency growth). △ Less

Submitted 21 January, 2020; originally announced January 2020.

Comments: SCiL 2020

Journal ref: Proceedings of the Society for Computation in Linguistics 3.1 (2020): 43-52

arXiv:1911.02709 [pdf, other]

Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation

Authors: Zhong Zhou, Lori Levin, David R. Mortensen, Alex Waibel

Abstract: We demonstrate a new approach to Neural Machine Translation (NMT) for low-resource languages using a ubiquitous linguistic resource, Interlinear Glossed Text (IGT). IGT represents a non-English sentence as a sequence of English lemmas and morpheme labels. As such, it can serve as a pivot or interlingua for NMT. Our contribution is four-fold. Firstly, we pool IGT for 1,497 languages in ODIN (54,545… ▽ More We demonstrate a new approach to Neural Machine Translation (NMT) for low-resource languages using a ubiquitous linguistic resource, Interlinear Glossed Text (IGT). IGT represents a non-English sentence as a sequence of English lemmas and morpheme labels. As such, it can serve as a pivot or interlingua for NMT. Our contribution is four-fold. Firstly, we pool IGT for 1,497 languages in ODIN (54,545 glosses) and 70,918 glosses in Arapaho and train a gloss-to-target NMT system from IGT to English, with a BLEU score of 25.94. We introduce a multilingual NMT model that tags all glossed text with gloss-source language tags and train a universal system with shared attention across 1,497 languages. Secondly, we use the IGT gloss-to-target translation as a key step in an English-Turkish MT system trained on only 865 lines from ODIN. Thirdly, we we present five metrics for evaluating extremely low-resource translation when BLEU is no longer sufficient and evaluate the Turkish low-resource system using BLEU and also using accuracy of matching nouns, verbs, agreement, tense, and spurious repetition, showing large improvements. △ Less

Submitted 3 March, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

arXiv:1907.10129 [pdf, other]

CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology

Authors: Aditi Chaudhary, Elizabeth Salesky, Gayatri Bhat, David R. Mortensen, Jaime G. Carbonell, Yulia Tsvetkov

Abstract: This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. PO… ▽ More This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. POS, Case, etc.) independently. However, most treebanks are under-resourced, thus making it challenging to train deep neural models for them. Hence, we propose a multi-lingual transfer training regime where we transfer from multiple related languages that share similar typology. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: In Proceedings of the ACL-SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology

arXiv:1902.08899 [pdf, other]

The ARIEL-CMU Systems for LoReHLT18

Authors: Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W Black, Jaime Carbonell, Graham V. Horwood , et al. (5 additional authors not shown)

Abstract: This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech). This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech). △ Less

Submitted 24 February, 2019; originally announced February 2019.

arXiv:1808.09500 [pdf]

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Authors: Aditi Chaudhary, Chunting Zhou, Lori Levin, Graham Neubig, David R. Mortensen, Jaime G. Carbonell

Abstract: Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to low-resourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel cor… ▽ More Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to low-resourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel corpora nor bilingual dictionaries and provides a significant gain in performance over previous methods relying on these resources. We demonstrate the effectiveness of our approaches on Named Entity Recognition for four languages, namely Uyghur, Turkish, Bengali and Hindi, of which Uyghur and Bengali are low resource languages, and also perform experiments on Machine Translation. Exploiting subwords with transfer learning gives us a boost of +15.2 NER F1 for Uyghur and +9.7 F1 for Bengali. We also show improvements in the monolingual setting where we achieve (avg.) +3 F1 and (avg.) +1.35 BLEU. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: Accepted at EMNLP 2018

arXiv:1807.08059 [pdf]

doi 10.1063/1.5049383

An Improved Laboratory-Based XAFS and XES Spectrometer for Analytical Applications in Materials Chemistry Research

Authors: Evan P. Jahrman, William M. Holden, Alex S. Ditter, Devon R. Mortensen, Gerald T. Seidler, Timothy T. Fister, Stosh A. Kozimor, Louis F. J. Piper, Jatinkumar Rana, Neil C. Hyatt, Martin C. Stennett

Abstract: X-ray absorption fine structure (XAFS) and x-ray emission spectroscopy (XES) are advanced x-ray spectroscopies that impact a wide range of disciplines. However, unlike the majority of other spectroscopic methods, XAFS and XES are accompanied by an unusual access model, wherein; the dominant use of the technique is for premier research studies at world-class facilities, i.e., synchrotron x-ray ligh… ▽ More X-ray absorption fine structure (XAFS) and x-ray emission spectroscopy (XES) are advanced x-ray spectroscopies that impact a wide range of disciplines. However, unlike the majority of other spectroscopic methods, XAFS and XES are accompanied by an unusual access model, wherein; the dominant use of the technique is for premier research studies at world-class facilities, i.e., synchrotron x-ray light sources. In this paper we report the design and performance of an improved spectrometer XAFS and XES based on the general conceptual design of Seidler, et al., Rev. Sci. Instrum. 2014. New developments include reduced mechanical degrees of freedom, much-increased flux, and a wider Bragg angle range to enable extended x-ray absorption fine structure (EXAFS) for the first time with this type of modern laboratory XAFS configuration. This instrument enables a new class of routine applications that are incompatible with the mission and access model of the synchrotron light sources. To illustrate this, we provide numerous examples of x-ray absorption near edge structure (XANES), EXAFS, and XES results for a variety of problems and energy ranges. Highlights include XAFS and XES measurements of battery electrode materials, EXAFS of Ni and V with full modeling of results to validate monochromator performance, valence-to-core XES for 3d transition metal compounds, and uranium XANES and XES for different oxidation states. Taken en masse, these results further support the growing perspective that modern laboratory-based XAFS and XES have the potential to develop a new branch of analytical chemistry. △ Less

Submitted 20 July, 2018; originally announced July 2018.

Showing 1–50 of 56 results for author: Mortensen, D