Search | arXiv e-print repository

Is Child-Directed Speech Effective Training Data for Language Models?

Authors: Steven Y. Feng, Noah D. Goodman, Michael C. Frank

Abstract: While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a n… ▽ More While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children's training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children's learning is instead substantially more efficient than current language modeling techniques. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: Preprint. Code and data will be released soon

arXiv:2406.10447 [pdf, other]

The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Authors: Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman, Daniel L. K. Yamins, Michael C. Frank

Abstract: Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient fo… ▽ More Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans? △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 9 pages, 2 figures, 4 tables and SI. Submitted to NeurIPS Datasets and Benchmarks

arXiv:2210.04191 [pdf, other]

CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Authors: Steven Y. Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, Eduard Hovy

Abstract: We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditi… ▽ More We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration. △ Less

Submitted 12 February, 2023; v1 submitted 9 October, 2022; originally announced October 2022.

Comments: EACL 2023. Code available at https://github.com/styfeng/CHARD

arXiv:2209.08950 [pdf, other]

doi 10.1364/JOSAA.474837

Using fluorescent beads to emulate single flurophores

Authors: Luis A. Aleman-Castaneda, Sherry Yi-Ting Feng, Rodrigo Gutierrez-Cuevas, Isael Herrera, Thomas G. Brown, Sophie Brasselet, Miguel A. Alonso

Abstract: In this work, we study the conditions under which fluorescent beads can be used to emulate single fluorescent molecules in the calibration of optical microscopes. Although beads are widely used due to their brightness and easy manipulation, there can be notable differences between the point spread functions (PSFs) they produce and those for single-molecule fluorophores, caused by their different e… ▽ More In this work, we study the conditions under which fluorescent beads can be used to emulate single fluorescent molecules in the calibration of optical microscopes. Although beads are widely used due to their brightness and easy manipulation, there can be notable differences between the point spread functions (PSFs) they produce and those for single-molecule fluorophores, caused by their different emission pattern and their size. We study theoretically these differences for various scenarios, e.g. with or without polarization channel splitting, to determine the conditions under which the use of beads as a model for single molecules is valid. We also propose methods to model the blurring due to the size difference and compensate for it to produce PSFs that are more similar to those for single molecules. △ Less

Submitted 6 December, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Journal ref: J. Opt. Soc. Am. A 39, C167-C178 (2022)

arXiv:2209.07752 [pdf, other]

PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Authors: Sedrick Scott Keh, Kevin Lu, Varun Gangal, Steven Y. Feng, Harsh Jhamtani, Malihe Alikhani, Eduard Hovy

Abstract: A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called Personif… ▽ More A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative personifications that enhance the overall appeal of a sentence. △ Less

Submitted 16 September, 2022; originally announced September 2022.

Comments: Accepted to COLING 2022; official Github repo at https://github.com/sedrickkeh/PINEAPPLE

arXiv:2209.06275 [pdf, other]

PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Authors: Sedrick Scott Keh, Steven Y. Feng, Varun Gangal, Malihe Alikhani, Eduard Hovy

Abstract: Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenome… ▽ More Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters. △ Less

Submitted 14 February, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

Comments: EACL 2023. Code at https://github.com/sedrickkeh/PANCETTA

arXiv:2109.03892 [pdf, other]

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Authors: Steven Y. Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, Varun Gangal

Abstract: We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves… ▽ More We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity. △ Less

Submitted 25 March, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: Accepted to AAAI 2022. Code at https://github.com/styfeng/VisCTG

arXiv:2108.06643 [pdf, other]

SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Authors: Steven Y. Feng, Jessica Huynh, Chaitanya Narisetty, Eduard Hovy, Varun Gangal

Abstract: We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a. the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPP… ▽ More We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a. the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPPHIRE noticeably improves model performance. An in-depth qualitative analysis illustrates that SAPPHIRE effectively addresses many issues of the baseline model generations, including lack of commonsense, insufficient specificity, and poor fluency. △ Less

Submitted 1 December, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

Comments: INLG 2021 [Best Long Paper]. Code available at https://github.com/styfeng/SAPPHIRE

arXiv:2105.03075 [pdf, other]

A Survey of Data Augmentation Approaches for NLP

Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Abstract: Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensi… ▽ More Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP △ Less

Submitted 1 December, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

Comments: Accepted to ACL 2021 Findings. GitHub repo with paper list at https://github.com/styfeng/DataAug4NLP ; Talk at https://www.youtube.com/watch?v=kNBVesKUZCk&ab_channel=StevenFeng ; Podcast at https://www.youtube.com/watch?v=qmqyT_97Poc&ab_channel=GradientFlow and https://thedataexchange.media/data-augmentation-in-natural-language-processing

arXiv:2104.06669 [pdf, other]

NAREOR: The Narrative Reordering Problem

Authors: Varun Gangal, Steven Y. Feng, Malihe Alikhani, Teruko Mitamura, Eduard Hovy

Abstract: Many implicit inferences exist in text depending on how it is structured that can critically impact the text's interpretation and meaning. One such structural aspect present in text with chronology is the order of its presentation. For narratives or stories, this is known as the narrative order. Reordering a narrative can impact the temporal, causal, event-based, and other inferences readers draw… ▽ More Many implicit inferences exist in text depending on how it is structured that can critically impact the text's interpretation and meaning. One such structural aspect present in text with chronology is the order of its presentation. For narratives or stories, this is known as the narrative order. Reordering a narrative can impact the temporal, causal, event-based, and other inferences readers draw from it, which in turn can have strong effects both on its interpretation and interestingness. In this paper, we propose and investigate the task of Narrative Reordering (NAREOR) which involves rewriting a given story in a different narrative order while preserving its plot. We present a dataset, NAREORC, with human rewritings of stories within ROCStories in non-linear orders, and conduct a detailed analysis of it. Further, we propose novel task-specific training methods with suitable evaluation metrics. We perform experiments on NAREORC using state-of-the-art models such as BART and T5 and conduct extensive automatic and human evaluations. We demonstrate that although our models can perform decently, NAREOR is a challenging task with potential for further exploration. We also investigate two applications of NAREOR: generation of more interesting variations of stories and serving as adversarial sets for temporal/event-related tasks, besides discussing other prospective ones, such as for pedagogical setups related to language skills like essay writing and applications to medicine involving clinical narratives. △ Less

Submitted 27 March, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: Accepted to AAAI 2022; Code at https://github.com/vgtomahawk/NAREORCamReady

arXiv:2010.01794 [pdf, other]

GenAug: Data Augmentation for Finetuning Text Generators

Authors: Steven Y. Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, Eduard Hovy

Abstract: In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. We als… ▽ More In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. We also examine the relationship between the amount of augmentation and the quality of the generated text. We utilize several metrics that evaluate important aspects of the generated text including its diversity and fluency. Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods, and that the quality of generations improves to a peak at approximately three times the amount of original data. △ Less

Submitted 10 October, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: EMNLP 2020 Deep Learning Inside Out (DeeLIO) Workshop; Code available at https://github.com/styfeng/GenAug

arXiv:1910.08293 [pdf, other]

doi 10.1609/aaai.v34i05.6328

ALOHA: Artificial Learning of Human Attributes for Dialogue Agents

Authors: Aaron W. Li, Veronica Jiang, Steven Y. Feng, Julia Sprague, Wei Zhou, Jesse Hoey

Abstract: For conversational AI and virtual assistants to communicate with humans in a realistic way, they must exhibit human characteristics such as expression of emotion and personality. Current attempts toward constructing human-like dialogue agents have presented significant difficulties. We propose Human Level Attributes (HLAs) based on tropes as the basis of a method for learning dialogue agents that… ▽ More For conversational AI and virtual assistants to communicate with humans in a realistic way, they must exhibit human characteristics such as expression of emotion and personality. Current attempts toward constructing human-like dialogue agents have presented significant difficulties. We propose Human Level Attributes (HLAs) based on tropes as the basis of a method for learning dialogue agents that can imitate the personalities of fictional characters. Tropes are characteristics of fictional personalities that are observed recurrently and determined by viewers' impressions. By combining detailed HLA data with dialogue data for specific characters, we present a dataset, HLA-Chat, that models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs. We then introduce a three-component system, ALOHA (which stands for Artificial Learning of Human Attributes), that combines character space mapping, character community detection, and language style retrieval to build a character (or personality) specific language model. Our preliminary experiments demonstrate that two variations of ALOHA, combined with our proposed dataset, can outperform baseline models at identifying the correct dialogue responses of chosen target characters, and are stable regardless of the character's identity, the genre of the show, and the context of the dialogue. △ Less

Submitted 1 December, 2021; v1 submitted 18 October, 2019; originally announced October 2019.

Comments: AAAI 2020. Code available at https://github.com/newpro/aloha-chatbot Talk at https://www.youtube.com/watch?v=TtomrolC4Dc&ab_channel=StevenFeng

arXiv:1909.00088 [pdf, other]

doi 10.18653/v1/D19-1272

Keep Calm and Switch On! Preserving Sentiment and Fluency in Semantic Text Exchange

Authors: Steven Y. Feng, Aaron W. Li, Jesse Hoey

Abstract: In this paper, we present a novel method for measurably adjusting the semantics of text while preserving its sentiment and fluency, a task we call semantic text exchange. This is useful for text data augmentation and the semantic correction of text generated by chatbots and virtual assistants. We introduce a pipeline called SMERTI that combines entity replacement, similarity masking, and text infi… ▽ More In this paper, we present a novel method for measurably adjusting the semantics of text while preserving its sentiment and fluency, a task we call semantic text exchange. This is useful for text data augmentation and the semantic correction of text generated by chatbots and virtual assistants. We introduce a pipeline called SMERTI that combines entity replacement, similarity masking, and text infilling. We measure our pipeline's success by its Semantic Text Exchange Score (STES): the ability to preserve the original text's sentiment and fluency while adjusting semantic content. We propose to use masking (replacement) rate threshold as an adjustable parameter to control the amount of semantic change in the text. Our experiments demonstrate that SMERTI can outperform baseline models on Yelp reviews, Amazon reviews, and news headlines. △ Less

Submitted 21 September, 2020; v1 submitted 30 August, 2019; originally announced September 2019.

Comments: EMNLP-IJCNLP 2019; Code available at https://github.com/styfeng/SMERTI

arXiv:1011.3103 [pdf, ps, other]

doi 10.1088/1674-4527/11/9/004

Multiband Fitting to Three Long GRBs with Fermi/LAT Data: Structured Ejecta Sweeping up a Density-Jump Medium

Authors: S. Y. Feng, Z. G. Dai

Abstract: We present broadband (radio, optical, X-ray and GeV) fits to the afterglow light curves and spectra of three long-duration gamma-ray bursts (GRBs 080916C, 090902B, and 090926A) detected by the Gamma-Ray Burst Monitor (GBM) and Large Area Telescope (LAT) instruments on the Fermi satellite. Using the observed broadband data, we study the origin of the high energy emission, and suggest that the early… ▽ More We present broadband (radio, optical, X-ray and GeV) fits to the afterglow light curves and spectra of three long-duration gamma-ray bursts (GRBs 080916C, 090902B, and 090926A) detected by the Gamma-Ray Burst Monitor (GBM) and Large Area Telescope (LAT) instruments on the Fermi satellite. Using the observed broadband data, we study the origin of the high energy emission, and suggest that the early-time GeV emission and the late-time radio, optical, and X-ray afterglows can be understood as being due to synchrotron emission from an external forward shock caused by structured ejecta propagating in a wind bubble jumping to a homogeneous density medium. If the ceasing time for majority of the energy injection is assumed to be close to the deceleration time of the forward shock, the structured ejecta with continuous energy injection to the forward shock can well explain the early rising feature of the GeV mission from these burst, and the density-jump medium can account for some certain plateaus or flares in the late afterglows. From our fits, we find that, on one hand, the external shock origin of the GeV photons will make the optical depth have not significant contribution to the early LAT rising part, which will loosen strong constraint of lower limits of Lorentz factor. On the other hand, these Fermi-LAT events preferentially occur in a low-density circumburst environment, in which case the Klein-Nishina cutoff will significantly suppress the Self-Synchrotron Compton (SSC) radiation. Such an environment might result from superbubbles or low-metallicity progenitor stars (which have a low mass-loss rate at late times of stellar evolution) of type Ib/c supernovae. △ Less

Submitted 4 June, 2011; v1 submitted 13 November, 2010; originally announced November 2010.

Comments: 32 pages, 4 figures, 2 tables; some minor typo corrected, optical depth does not have significant contribution to the result, major conclusions unchanged

Journal ref: Research in Astron. Astrophys. 2011, Vol .11 No. 9

Showing 1–14 of 14 results for author: Feng, S Y