Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–19 of 19 results for author: Albalak, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.14985  [pdf, other

    cs.CL cs.AI cs.LG

    Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

    Authors: Antonis Antoniades, Xinyi Wang, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

    Abstract: Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their training da… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

    Comments: ICML FM-Wild workshop version

  2. arXiv:2407.09693  [pdf, other

    cs.LG cs.AI

    A Mathematical Framework, a Taxonomy of Modeling Paradigms, and a Suite of Learning Techniques for Neural-Symbolic Systems

    Authors: Charles Dickens, Connor Pryor, Changyu Gao, Alon Albalak, Eriq Augustine, William Wang, Stephen Wright, Lise Getoor

    Abstract: The field of Neural-Symbolic (NeSy) systems is growing rapidly. Proposed approaches show great promise in achieving symbiotic unions of neural and symbolic methods. However, each NeSy system differs in fundamental ways. There is a pressing need for a unifying theory to illuminate the commonalities and differences in approaches and enable further progress. In this paper, we introduce Neural-Symboli… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  3. arXiv:2406.16746  [pdf, other

    cs.LG cs.AI cs.CL

    The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

    Authors: Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini

    Abstract: Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation,… ▽ More

    Submitted 3 September, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  4. arXiv:2406.11794  [pdf, other

    cs.LG cs.CL

    DataComp-LM: In search of the next generation of training sets for language models

    Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

    Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More

    Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.datacomp.ai/dclm/

  5. arXiv:2404.05892  [pdf, other

    cs.CL cs.AI

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

    Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV, Jan Kocoń, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanisław Woźniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao , et al. (3 additional authors not shown)

    Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokeni… ▽ More

    Submitted 24 September, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  6. arXiv:2402.16827  [pdf, other

    cs.CL cs.LG

    A Survey on Data Selection for Language Models

    Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

    Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the am… ▽ More

    Submitted 2 August, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Paper list available at https://github.com/alon-albalak/data-selection-survey

  7. arXiv:2312.02406  [pdf, other

    cs.CL cs.LG

    Efficient Online Data Mixing For Language Model Pre-Training

    Authors: Alon Albalak, Liangming Pan, Colin Raffel, William Yang Wang

    Abstract: The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and… ▽ More

    Submitted 8 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

  8. arXiv:2305.13048  [pdf, other

    cs.CL cs.AI

    RWKV: Reinventing RNNs for the Transformer Era

    Authors: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang , et al. (9 additional authors not shown)

    Abstract: Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scala… ▽ More

    Submitted 10 December, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  9. arXiv:2305.12295  [pdf, other

    cs.CL cs.AI

    Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

    Authors: Liangming Pan, Alon Albalak, Xinyi Wang, William Yang Wang

    Abstract: Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving. Our method first utilizes LLMs to translate a natural language problem into a symbolic formulation. Afterward, a deterministic symbolic solver perfo… ▽ More

    Submitted 18 October, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 (Findings, long paper)

  10. arXiv:2302.00674  [pdf, other

    cs.LG cs.CL

    Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data

    Authors: Alon Albalak, Colin Raffel, William Yang Wang

    Abstract: Few-shot learning is valuable in many real-world applications, but learning a generalizable model without overfitting to the few labeled datapoints is challenging. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Previous works have proposed automated meth… ▽ More

    Submitted 3 October, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: NeurIPS 2023, 25 pages, 8 figures, code available at https://github.com/alon-albalak/FLAD

  11. arXiv:2212.10515  [pdf, other

    cs.CL

    CausalDialogue: Modeling Utterance-level Causality in Conversations

    Authors: Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang

    Abstract: Despite their widespread adoption, neural conversation models have yet to exhibit natural chat capabilities with humans. In this research, we examine user utterances as causes and generated responses as effects, recognizing that changes in a cause should produce a different effect. To further explore this concept, we have compiled and expanded upon a new dataset called CausalDialogue through crowd… ▽ More

    Submitted 8 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL-Findings 2023

  12. arXiv:2210.11729  [pdf, other

    cs.CL

    An Exploration of Data Efficiency in Intra-Dataset Task Transfer for Dialog Understanding

    Authors: Josiah Ross, Luke Yoffe, Alon Albalak, William Yang Wang

    Abstract: Transfer learning is an exciting area of Natural Language Processing that has the potential to both improve model performance and increase data efficiency. This study explores the effects of varying quantities of target task training data on sequential transfer learning in the dialog domain. We hypothesize that a model can utilize the information learned from a source task to better learn a target… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

  13. arXiv:2210.03871  [pdf, other

    cs.CL

    Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models

    Authors: Alon Albalak, Akshat Shrivastava, Chinnadhurai Sankar, Adithya Sagar, Mike Ross

    Abstract: Multi-task learning (MTL), instruction tuning, and prompting have recently been shown to improve the generalizability of large language models to new tasks. However, the benefits of such methods are less well-documented in smaller language models, with some studies finding contradictory results. In this work, we explore and isolate the effects of (i) model size, (ii) general purpose MTL, (iii) in-… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

  14. arXiv:2207.07238  [pdf, other

    cs.LG cs.CL

    Emotion Recognition in Conversation using Probabilistic Soft Logic

    Authors: Eriq Augustine, Pegah Jandaghi, Alon Albalak, Connor Pryor, Charles Dickens, William Wang, Lise Getoor

    Abstract: Creating agents that can both appropriately respond to conversations and understand complex human linguistic tendencies and social cues has been a long standing challenge in the NLP community. A recent pillar of research revolves around emotion recognition in conversation (ERC); a sub-field of emotion recognition that focuses on conversations or dialogues that contain two or more utterances. In th… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

  15. arXiv:2205.14268  [pdf, other

    cs.LG

    NeuPSL: Neural Probabilistic Soft Logic

    Authors: Connor Pryor, Charles Dickens, Eriq Augustine, Alon Albalak, William Wang, Lise Getoor

    Abstract: In this paper, we introduce Neural Probabilistic Soft Logic (NeuPSL), a novel neuro-symbolic (NeSy) framework that unites state-of-the-art symbolic reasoning with the low-level perception of deep neural networks. To model the boundary between neural and symbolic representations, we propose a family of energy-based models, NeSy Energy-Based Models, and show that they are general enough to include N… ▽ More

    Submitted 23 May, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

  16. arXiv:2205.06262  [pdf, other

    cs.CL

    FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue

    Authors: Alon Albalak, Yi-Lin Tuan, Pegah Jandaghi, Connor Pryor, Luke Yoffe, Deepak Ramachandran, Lise Getoor, Jay Pujara, William Yang Wang

    Abstract: Task transfer, transferring knowledge contained in related tasks, holds the promise of reducing the quantity of labeled data required to fine-tune language models. Dialogue understanding encompasses many diverse tasks, yet task transfer has not been thoroughly studied in conversational AI. This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer… ▽ More

    Submitted 13 October, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022. benchmark available at https://alon-albalak.github.io/feta-website

  17. arXiv:2201.11153  [pdf, other

    cs.CL cs.IR

    Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains

    Authors: Alon Albalak, Sharon Levy, William Yang Wang

    Abstract: Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstr… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

    Comments: 6 pages, 8 figures

  18. arXiv:2109.05126  [pdf, other

    cs.CL

    D-REX: Dialogue Relation Extraction with Explanations

    Authors: Alon Albalak, Varun Embar, Yi-Lin Tuan, Lise Getoor, William Yang Wang

    Abstract: Existing research studies on cross-sentence relation extraction in long-form multi-party conversations aim to improve relation extraction without considering the explainability of such methods. This work addresses that gap by focusing on extracting explanations that indicate that a relation exists while using only partially labeled data. We propose our model-agnostic framework, D-REX, a policy-gui… ▽ More

    Submitted 18 October, 2022; v1 submitted 10 September, 2021; originally announced September 2021.

    Comments: NLP4CONVAI, code at https://github.com/alon-albalak/D-REX

  19. Modeling Disclosive Transparency in NLP Application Descriptions

    Authors: Michael Saxon, Sharon Levy, Xinyi Wang, Alon Albalak, William Yang Wang

    Abstract: Broader disclosive transparency$-$truth and clarity in communication regarding the function of AI systems$-$is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where "too much informa… ▽ More

    Submitted 10 September, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: To appear at EMNLP 2021. 15 pages, 10 figures, 7 tables

    Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 2023-2037