Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–13 of 13 results for author: Chang, T A

.
  1. arXiv:2403.13754  [pdf, other

    cs.CL

    Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

    Authors: Catherine Arnett, Pamela D. Rivière, Tyler A. Chang, Sean Trott

    Abstract: The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  2. arXiv:2403.08904  [pdf, other

    cs.CL

    Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics

    Authors: Tyler A. Chang, Katrin Tomanek, Jessica Hoffmann, Nithum Thain, Erin van Liemt, Kathleen Meier-Hellstern, Lucas Dixon

    Abstract: We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  3. arXiv:2403.00686  [pdf, other

    cs.CL

    A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

    Authors: Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

    Abstract: How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use li… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  4. arXiv:2311.09205  [pdf, other

    cs.CL

    When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

    Authors: Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

    Abstract: Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  5. arXiv:2311.09194  [pdf, other

    cs.CL

    Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

    Authors: James A. Michaelov, Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

    Abstract: Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is proc… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023

  6. arXiv:2310.07929  [pdf, other

    cs.CL

    Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

    Authors: Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen

    Abstract: Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-trainin… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: Extended abstract accepted to the 3rd Multilingual Representation Learning workshop at EMNLP 2023

  7. arXiv:2308.15419  [pdf, other

    cs.CL

    Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

    Authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

    Abstract: How do language models learn to make predictions during pre-training? To study this question, we extract learning curves from five autoregressive English language model pre-training runs, for 1M tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We quantify the final surprisal, within-run variability,… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

  8. arXiv:2305.17127  [pdf, other

    cs.CL

    Characterizing and Measuring Linguistic Dataset Drift

    Authors: Tyler A. Chang, Kishaloy Halder, Neha Anna John, Yogarshi Vyas, Yassine Benajiba, Miguel Ballesteros, Dan Roth

    Abstract: NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  9. arXiv:2303.11504  [pdf, ps, other

    cs.CL

    Language Model Behavior: A Comprehensive Survey

    Authors: Tyler A. Chang, Benjamin K. Bergen

    Abstract: Transformer language models have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English language model behavior before task-specific fine-tuning. Language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sen… ▽ More

    Submitted 25 August, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: 32 pages, accepted to Computational Linguistics

  10. arXiv:2205.10964  [pdf, other

    cs.CL

    The Geometry of Multilingual Language Model Representations

    Authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen

    Abstract: We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language. Using XLM-R as a case study, we show that languages occupy similar linear subspaces after mean-centering, evaluated based on causal effects on language modeling performance and direct comparisons between subspaces for 88 languages. The… ▽ More

    Submitted 21 October, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

    Comments: Accepted to EMNLP 2022

  11. arXiv:2110.02406  [pdf, other

    cs.CL

    Word Acquisition in Neural Language Models

    Authors: Tyler A. Chang, Benjamin K. Bergen

    Abstract: We investigate how neural language models acquire individual words during training, extracting learning curves and ages of acquisition for over 600 words on the MacArthur-Bates Communicative Development Inventory (Fenson et al., 2007). Drawing on studies of word acquisition in children, we evaluate multiple predictors for words' ages of acquisition in LSTMs, BERT, and GPT-2. We find that the effec… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted to TACL (pre-MIT Press version)

  12. arXiv:2106.05505  [pdf, other

    cs.CL

    Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

    Authors: Tyler A. Chang, Yifan Xu, Weijian Xu, Zhuowen Tu

    Abstract: In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-IJCNLP 2021

  13. arXiv:2005.08177  [pdf, other

    cs.CL cs.LG

    Encodings of Source Syntax: Similarities in NMT Representations Across Target Languages

    Authors: Tyler A. Chang, Anna N. Rafferty

    Abstract: We train neural machine translation (NMT) models from English to six target languages, using NMT encoder representations to predict ancestor constituent labels of source language words. We find that NMT encoders learn similar source syntax regardless of NMT target language, relying on explicit morphosyntactic cues to extract syntactic features from source sentences. Furthermore, the NMT encoders o… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

    Comments: To appear at the 5th Workshop on Representation Learning for NLP