Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–20 of 20 results for author: Bamman, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.06408  [pdf, other

    cs.CL

    AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

    Authors: Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren F. Klein, Jesse Dodge

    Abstract: Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-des… ▽ More

    Submitted 20 June, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: 28 pages, 13 figures. Association for Computational Linguistics (ACL) 2024

  2. arXiv:2311.09130  [pdf, other

    cs.CL

    Social Meme-ing: Measuring Linguistic Variation in Memes

    Authors: Naitian Zhou, David Jurgens, David Bamman

    Abstract: Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  3. arXiv:2305.17561  [pdf, other

    cs.CL

    Grounding Characters and Places in Narrative Texts

    Authors: Sandeep Soni, Amanpreet Sihra, Elizabeth F. Evans, Matthew Wilkens, David Bamman

    Abstract: Tracking characters and locations throughout a story can help improve the understanding of its plot structure. Prior research has analyzed characters and locations from text independently without grounding characters to their locations in narrative time. Here, we address this gap by proposing a new spatial relationship categorization task. The objective of the task is to assign a spatial relations… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

    Comments: 12 pages, 4 figures, 5 tables; to appear in the proceedings of ACL 2023

  4. arXiv:2305.16648  [pdf, other

    cs.CL

    Dramatic Conversation Disentanglement

    Authors: Kent K. Chang, Danica Chen, David Bamman

    Abstract: We present a new dataset for studying conversation disentanglement in movies and TV series. While previous work has focused on conversation disentanglement in IRC chatroom dialogues, movies and TV shows provide a space for studying complex pragmatic patterns of floor and topic change in face-to-face multi-party interactions. In this work, we draw on theoretical research in sociolinguistics, sociol… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: 25 pages, 5 figures, accepted to ACL 2023 Findings

  5. arXiv:2305.00118  [pdf, other

    cs.CL

    Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

    Authors: Kent K. Chang, Mackenzie Cramer, Sandeep Soni, David Bamman

    Abstract: In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set… ▽ More

    Submitted 20 October, 2023; v1 submitted 28 April, 2023; originally announced May 2023.

    Comments: EMNLP 2023 camera-ready (16 pages, 4 figures)

  6. arXiv:2212.09676  [pdf, other

    cs.CL cs.DL

    Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications

    Authors: Li Lucy, Jesse Dodge, David Bamman, Katherine A. Keith

    Abstract: Scholarly text is often laden with jargon, or specialized language that can facilitate efficient in-group communication within fields but hinder understanding for out-groups. In this work, we develop and validate an interpretable approach for measuring scholarly jargon from text. Expanding the scope of prior work which focuses on word types, we use word sense induction to also identify words that… ▽ More

    Submitted 22 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: 17 pages, 11 figures, to appear in Findings of the Association for Computational Linguistics 2023

  7. arXiv:2210.13628  [pdf, other

    cs.CL cs.CY cs.SI

    Predicting Long-Term Citations from Short-Term Linguistic Influence

    Authors: Sandeep Soni, David Bamman, Jacob Eisenstein

    Abstract: A standard measure of the influence of a research paper is the number of times it is cited. However, papers may be cited for many reasons, and citation count offers limited information about the extent to which a paper affected the content of subsequent publications. We therefore propose a novel method to quantify linguistic influence in timestamped document collections. There are two main steps:… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 17 pages, 3 figures, to appear in the Findings of EMNLP 2022

  8. arXiv:2210.12170  [pdf, other

    cs.CL cs.SI

    Discovering Differences in the Representation of People using Contextualized Semantic Axes

    Authors: Li Lucy, Divya Tadimeti, David Bamman

    Abstract: A common paradigm for identifying semantic differences across social and temporal contexts is the use of static word embeddings and their distances. In particular, past work has compared embeddings against "semantic axes" that represent two opposing concepts. We extend this paradigm to BERT embeddings, and construct contextualized axes that mitigate the pitfall where antonyms have neighboring repr… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: 10 pages, 6 figures, EMNLP 2022

  9. arXiv:2102.06820  [pdf, other

    cs.CL cs.SI

    Characterizing English Variation across Social Media Communities with BERT

    Authors: Li Lucy, David Bamman

    Abstract: Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificit… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Comments: 18 pages, 5 figures, accepted to TACL 2021, please cite that version

  10. arXiv:2009.10053  [pdf, other

    cs.CL

    Latin BERT: A Contextual Language Model for Classical Philology

    Authors: David Bamman, Patrick J. Burns

    Abstract: We present Latin BERT, a contextual language model for the Latin language, trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century. In a series of case studies, we illustrate the affordances of this language-specific model both for work in natural language processing for Latin and in using computational methods for traditional scholarship: we show th… ▽ More

    Submitted 21 September, 2020; originally announced September 2020.

  11. arXiv:2004.13980  [pdf, other

    cs.CL cs.SI

    Measuring Information Propagation in Literary Social Networks

    Authors: Matthew Sims, David Bamman

    Abstract: We present the task of modeling information propagation in literature, in which we seek to identify pieces of information passing from character A to character B to character C, only given a description of their activity in text. We describe a new pipeline for measuring information propagation in this domain and publish a new dataset for speaker attribution, enabling the evaluation of an important… ▽ More

    Submitted 6 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020 long paper

  12. arXiv:1912.06979  [pdf, other

    cs.HC cs.CL cs.LG

    Breaking Speech Recognizers to Imagine Lyrics

    Authors: Jon Gillick, David Bamman

    Abstract: We introduce a new method for generating text, and in particular song lyrics, based on the speech-like acoustic qualities of a given audio file. We repurpose a vocal source separation algorithm and an acoustic model trained to recognize isolated speech, instead inputting instrumental music or environmental sounds. Feeding the "mistakes" of the vocal separator into the recognizer, we obtain a trans… ▽ More

    Submitted 15 December, 2019; originally announced December 2019.

    Comments: 3 pages

    Journal ref: NeurIPS 2019 Workshop on Machine Learning for Creativity and Design

  13. arXiv:1912.01140  [pdf, other

    cs.CL

    An Annotated Dataset of Coreference in English Literature

    Authors: David Bamman, Olivia Lewke, Anya Mansoor

    Abstract: We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction. This dataset differs from previous coreference datasets in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult cor… ▽ More

    Submitted 15 May, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020)

  14. arXiv:1905.06118  [pdf, other

    cs.SD cs.LG cs.MM eess.AS stat.ML

    Learning to Groove with Inverse Sequence Transformations

    Authors: Jon Gillick, Adam Roberts, Jesse Engel, Douglas Eck, David Bamman

    Abstract: We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using Seq2Seq and recurrent Variational Information Bottleneck (VIB) models. Though Seq2Seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix (Isola et al., 2017) and Vid2V… ▽ More

    Submitted 26 July, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: Blog post and links: https://g.co/magenta/groovae

    ACM Class: J.5; I.2

    Journal ref: Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2269-2279, 2019

  15. arXiv:1801.03406  [pdf, other

    cs.IR

    DeepSeek: Content Based Image Search & Retrieval

    Authors: Tanya Piplani, David Bamman

    Abstract: Most of the internet today is composed of digital media that includes videos and images. With pixels becoming the currency in which most transactions happen on the internet, it is becoming increasingly important to have a way of browsing through this ocean of information with relative ease. YouTube has 400 hours of video uploaded every minute and many million images are browsed on Instagram, Faceb… ▽ More

    Submitted 11 January, 2018; v1 submitted 9 January, 2018; originally announced January 2018.

    Comments: arXiv admin note: text overlap with arXiv:1706.06064 by other authors

  16. arXiv:1512.00728  [pdf, other

    cs.CL

    Annotating Character Relationships in Literary Texts

    Authors: Philip Massey, Patrick Xia, David Bamman, Noah A. Smith

    Abstract: We present a dataset of manually annotated relationships between characters in literary texts, in order to support the training and evaluation of automatic methods for relation type prediction in this domain (Makazhanov et al., 2014; Kokkinakis, 2013) and the broader computational analysis of literary character (Elson et al., 2010; Bamman et al., 2014; Vala et al., 2015; Flekova and Gurevych, 2015… ▽ More

    Submitted 2 December, 2015; originally announced December 2015.

  17. arXiv:1306.2091  [pdf, other

    cs.CL

    A framework for (under)specifying dependency syntax without overloading annotators

    Authors: Nathan Schneider, Brendan O'Connor, Naomi Saphra, David Bamman, Manaal Faruqui, Noah A. Smith, Chris Dyer, Jason Baldridge

    Abstract: We introduce a framework for lightweight dependency syntax annotation. Our formalism builds upon the typical representation for unlabeled dependencies, permitting a simple notation and annotation workflow. Moreover, the formalism encourages annotators to underspecify parts of the syntax if doing so would streamline the annotation process. We demonstrate the efficacy of this annotation on three lan… ▽ More

    Submitted 14 June, 2013; v1 submitted 9 June, 2013; originally announced June 2013.

    Comments: This is an expanded version of a paper appearing in Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia, Bulgaria, August 8-9, 2013

  18. arXiv:1305.1319  [pdf, other

    cs.CL

    New Alignment Methods for Discriminative Book Summarization

    Authors: David Bamman, Noah A. Smith

    Abstract: We consider the unsupervised alignment of the full text of a book with a human-written summary. This presents challenges not seen in other text alignment problems, including a disparity in length and, consequent to this, a violation of the expectation that individual words and phrases should align, since large passages and chapters can be distilled into a single summary phrase. We present two new… ▽ More

    Submitted 6 May, 2013; originally announced May 2013.

    Comments: This paper reflects work in progress

  19. arXiv:1303.2873  [pdf, other

    cs.CY cs.SI

    Inferring Social Rank in an Old Assyrian Trade Network

    Authors: David Bamman, Adam Anderson, Noah A. Smith

    Abstract: We present work in jointly inferring the unique individuals as well as their social rank within a collection of letters from an Old Assyrian trade colony in Kültepe, Turkey, settled by merchants from the ancient city of Assur for approximately 200 years between 1950-1750 BCE, the height of the Middle Bronze Age. Using a probabilistic latent-variable model, we leverage pairwise social differences b… ▽ More

    Submitted 12 March, 2013; originally announced March 2013.

    Comments: Digital Humanities 2013 (Lincoln, Nebraska)

  20. Gender identity and lexical variation in social media

    Authors: David Bamman, Jacob Eisenstein, Tyler Schnoebelen

    Abstract: We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many clust… ▽ More

    Submitted 12 May, 2014; v1 submitted 16 October, 2012; originally announced October 2012.

    Comments: submission version

    Journal ref: Journal of Sociolinguistics 18 (2014) 135-160