Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 76 results for author: Firat, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.00179  [pdf, other

    cs.CL cs.AI

    Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

    Authors: Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

    Abstract: We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, unde… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  2. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2402.17193  [pdf, other

    cs.CL cs.LG

    When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

    Authors: Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat

    Abstract: While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: ICLR24

  4. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  5. arXiv:2312.06134  [pdf, other

    cs.CL cs.LG

    Order Matters in the Presence of Dataset Imbalance for Multilingual Learning

    Authors: Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani

    Abstract: In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's be… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  6. arXiv:2309.04662  [pdf, other

    cs.CL cs.LG

    MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

    Authors: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat

    Abstract: We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages usi… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: Preprint

  7. arXiv:2308.08998  [pdf, other

    cs.CL cs.LG

    Reinforced Self-Training (ReST) for Language Modeling

    Authors: Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas

    Abstract: Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating sampl… ▽ More

    Submitted 21 August, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: 23 pages, 16 figures

  8. arXiv:2308.07286  [pdf, other

    cs.CL cs.LG

    The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat

    Abstract: Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by pro… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: 19 pages

  9. arXiv:2306.09539  [pdf, other

    cs.CL cs.LG

    Block-State Transformers

    Authors: Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin

    Abstract: State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks.… ▽ More

    Submitted 30 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: NeurIPS'23 - Thirty-seventh Conference on Neural Information Processing Systems

  10. arXiv:2305.11778  [pdf, other

    cs.CL cs.LG

    Cross-Lingual Supervision improves Large Language Models Pre-training

    Authors: Andrea Schioppa, Xavier Garcia, Orhan Firat

    Abstract: The recent rapid progress in pre-training Large Language Models has relied on using self-supervised language modeling objectives like next token prediction or span corruption. On the other hand, Machine Translation Systems are mostly trained using cross-lingual supervision that requires aligned data between source and target languages. We demonstrate that pre-training Large Language Models on a mi… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

  11. arXiv:2305.10403  [pdf, other

    cs.CL cs.AI

    PaLM 2 Technical Report

    Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

    Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More

    Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  12. arXiv:2304.09151  [pdf, other

    cs.CL

    UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

    Authors: Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, Orhan Firat

    Abstract: Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mit… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

  13. arXiv:2303.15265  [pdf, other

    cs.CL cs.AI cs.LG

    Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

    Authors: Alex Jones, Isaac Caswell, Ishank Saxena, Orhan Firat

    Abstract: Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translat… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    ACM Class: I.2.7

  14. arXiv:2302.09650  [pdf, other

    cs.CL cs.LG

    Scaling Laws for Multilingual Neural Machine Translation

    Authors: Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

    Abstract: In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

    Comments: 19 pages, 20 figures

  15. arXiv:2302.04907  [pdf, other

    cs.CL cs.LG

    Binarized Neural Machine Translation

    Authors: Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat

    Abstract: The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residu… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Journal ref: Published at NeurIPS 2023

  16. arXiv:2302.01398  [pdf, other

    cs.CL

    The unreasonable effectiveness of few-shot learning for machine translation

    Authors: Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, Orhan Firat

    Abstract: We demonstrate the potential of few-shot translation systems, trained with unpaired language data, for both high and low-resource language pairs. We show that with only 5 examples of high-quality translation data shown at inference, a transformer decoder-only model trained solely with self-supervised learning, is able to match specialized supervised state-of-the-art models as well as more general… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

  17. arXiv:2301.10309  [pdf, other

    cs.LG cs.AI cs.CL

    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction

    Authors: Jonathan Pilault, Xavier Garcia, Arthur Bražinskas, Orhan Firat

    Abstract: Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator's preferences… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

  18. arXiv:2210.00193  [pdf, other

    cs.CL

    FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

    Authors: Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, Noah Constant

    Abstract: We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distr… ▽ More

    Submitted 3 October, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

    Comments: Published in TACL Vol. 11 (2023)

  19. arXiv:2209.11379  [pdf, other

    cs.LG cs.AI

    Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

    Authors: Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer

    Abstract: Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empiric… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  20. arXiv:2205.03983  [pdf, other

    cs.CL cs.AI cs.LG

    Building Machine Translation Systems for the Next Thousand Languages

    Authors: Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes

    Abstract: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing… ▽ More

    Submitted 6 July, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

    Comments: V2: updated with some details from 24-language Google Translate launch in May 2022 V3: spelling corrections, additional acknowledgements

  21. arXiv:2204.02311  [pdf, other

    cs.CL

    PaLM: Scaling Language Modeling with Pathways

    Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin , et al. (42 additional authors not shown)

    Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran… ▽ More

    Submitted 5 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

  22. arXiv:2203.10752  [pdf, other

    cs.CL

    XTREME-S: Evaluating Cross-lingual Speech Representations

    Authors: Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan Van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

    Abstract: We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as w… ▽ More

    Submitted 13 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"

  23. arXiv:2203.07627  [pdf, other

    cs.CL cs.AI

    Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation

    Authors: Yong Cheng, Ankur Bapna, Orhan Firat, Yuan Cao, Pidong Wang, Wolfgang Macherey

    Abstract: Multilingual neural machine translation models are trained to maximize the likelihood of a mix of examples drawn from multiple language pairs. The dominant inductive bias applied to these models is a shared vocabulary and a shared set of parameters across languages; the inputs and labels corresponding to examples drawn from different language pairs might still reside in distinct sub-spaces. In thi… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  24. arXiv:2202.11822  [pdf, other

    cs.CL

    Using natural language prompts for machine translation

    Authors: Xavier Garcia, Orhan Firat

    Abstract: We explore the use of natural language prompts for controlling various aspects of the outputs generated by machine translation models. We demonstrate that natural language prompts allow us to influence properties like formality or specific dialect of the output. We show that using language names to control the output language of multilingual translation models enables positive transfer for unseen… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

  25. arXiv:2202.01994  [pdf, other

    cs.LG cs.CL

    Data Scaling Laws in NMT: The Effect of Noise and Architecture

    Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

    Abstract: In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

  26. arXiv:2202.00528  [pdf, other

    cs.CL cs.LG

    Examining Scaling and Transfer of Language Model Architectures for Machine Translation

    Authors: Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

    Abstract: Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigati… ▽ More

    Submitted 16 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

  27. arXiv:2201.03110  [pdf, other

    cs.CL cs.LG

    Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning

    Authors: Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen, Isaac Caswell, Xavier Garcia

    Abstract: Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-Eng… ▽ More

    Submitted 13 January, 2022; v1 submitted 9 January, 2022; originally announced January 2022.

  28. arXiv:2112.06905  [pdf, other

    cs.CL

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

    Authors: Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu , et al. (2 additional authors not shown)

    Abstract: Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GL… ▽ More

    Submitted 1 August, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

    Comments: Accepted to ICML 2022

  29. arXiv:2110.04369  [pdf, other

    cs.LG cs.AI

    A Loss Curvature Perspective on Training Instability in Deep Learning

    Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat

    Abstract: In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 20 pages, 16 figures

  30. arXiv:2110.03742  [pdf, other

    cs.CL cs.LG

    Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

    Authors: Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, Orhan Firat

    Abstract: Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence,… ▽ More

    Submitted 24 September, 2021; originally announced October 2021.

    Comments: EMNLP Findings 2021

  31. arXiv:2109.10341  [pdf, other

    cs.CL cs.LG

    Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents

    Authors: Biao Zhang, Ankur Bapna, Melvin Johnson, Ali Dabirmoghaddam, Naveen Arivazhagan, Orhan Firat

    Abstract: Document-level neural machine translation (DocNMT) achieves coherent translations by incorporating cross-sentence context. However, for most language pairs there's a shortage of parallel documents, although parallel sentences are readily available. In this paper, we study whether and how contextual modeling in DocNMT is transferable via multilingual modeling. We focus on the scenario of zero-shot… ▽ More

    Submitted 17 May, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

    Comments: ACL2022

  32. arXiv:2109.09193  [pdf, other

    cs.CL cs.LG

    Towards Zero-Label Language Learning

    Authors: Zirui Wang, Adams Wei Yu, Orhan Firat, Yuan Cao

    Abstract: This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. At the core of our framework is a novel approach for better leveraging the powerful pretrained language models. Specifically, inspired by the recent success of few-shot inference on GPT-3, we present a traini… ▽ More

    Submitted 19 September, 2021; originally announced September 2021.

  33. arXiv:2109.07740  [pdf, other

    cs.LG cs.AI cs.CL

    Scaling Laws for Neural Machine Translation

    Authors: Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry

    Abstract: We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accu… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 31 pages, 23 figures

  34. arXiv:2109.06262  [pdf, other

    cs.CL

    Evaluating Multiway Multilingual NMT in the Turkic Languages

    Authors: Jamshidbek Mirzakhalov, Anoop Babu, Aigiz Kunafin, Ahsan Wahab, Behzod Moydinboyev, Sardana Ivanova, Mokhiyakhon Uzokova, Shaxnoza Pulatova, Duygu Ataman, Julia Kreutzer, Francis Tyers, Orhan Firat, John Licato, Sriram Chellappan

    Abstract: Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: 9 pages, 3 figures, 7 tables. To be presented at WMT 2021

  35. arXiv:2109.04593  [pdf, other

    cs.CL cs.LG

    A Large-Scale Study of Machine Translation in the Turkic Languages

    Authors: Jamshidbek Mirzakhalov, Anoop Babu, Duygu Ataman, Sherzod Kariev, Francis Tyers, Otabek Abduraufov, Mammad Hajili, Sardana Ivanova, Abror Khaytbaev, Antonio Laverghetta Jr., Behzodbek Moydinboyev, Esra Onal, Shaxnoza Pulatova, Ahsan Wahab, Orhan Firat, Sriram Chellappan

    Abstract: Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: 9 pages, 1 figure, 8 tables. Main proceedings of EMNLP 2021

  36. arXiv:2107.14749  [pdf, other

    cs.CL

    Towards Universality in Multilingual Text Rewriting

    Authors: Xavier Garcia, Noah Constant, Mandy Guo, Orhan Firat

    Abstract: In this work, we take the first steps towards building a universal rewriter: a model capable of rewriting text in any language to exhibit a wide variety of attributes, including styles and languages, while preserving as much of the original semantics as possible. In addition to obtaining state-of-the-art results on unsupervised translation, we also demonstrate the ability to do zero-shot sentiment… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

  37. arXiv:2104.07412  [pdf, other

    cs.CL cs.AI

    XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

    Authors: Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Junjie Hu, Dan Garrette, Graham Neubig, Melvin Johnson

    Abstract: Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This… ▽ More

    Submitted 7 October, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021 camera-ready

  38. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  39. arXiv:2103.06799  [pdf, other

    cs.CL

    Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution

    Authors: Xavier Garcia, Noah Constant, Ankur P. Parikh, Orhan Firat

    Abstract: We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models, paving the way towards efficient continual learning for multilingual machine translation. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts, incurs only minor degradation on the translation performance for the origin… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: Accepted at NAACL 2021

  40. arXiv:2102.13549  [pdf, ps, other

    cs.CL

    Gradient-guided Loss Masking for Neural Machine Translation

    Authors: Xinyi Wang, Ankur Bapna, Melvin Johnson, Orhan Firat

    Abstract: To mitigate the negative effect of low quality training data on the performance of neural machine translation models, most existing strategies focus on filtering out harmful data before training starts. In this paper, we explore strategies that dynamically optimize data usage during the training process using the model's gradients on a small set of clean data. At each training step, our algorithm… ▽ More

    Submitted 26 February, 2021; originally announced February 2021.

  41. arXiv:2010.12652  [pdf, other

    cs.CL

    Rapid Domain Adaptation for Machine Translation with Monolingual Data

    Authors: Mahdis Mahdieh, Mia Xu Chen, Yuan Cao, Orhan Firat

    Abstract: One challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19, in which case timely and accurate translation of in-domain information into multiple languages is critical but little parallel data is available yet. In this paper, we propose an approach that enables rapid domain adaptation from the perspective of unsupervised translation. Our p… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  42. arXiv:2010.10648  [pdf, other

    cs.CL cs.CV cs.LG

    Towards End-to-End In-Image Neural Machine Translation

    Authors: Elman Mansimov, Mitchell Stern, Mia Chen, Orhan Firat, Jakob Uszkoreit, Puneet Jain

    Abstract: In this paper, we offer a preliminary investigation into the task of in-image machine translation: transforming an image containing text in one language into an image containing the same text in another language. We propose an end-to-end neural model for this task inspired by recent approaches to neural machine translation, and demonstrate promising initial results based purely on pixel-level supe… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted as an oral presentation at EMNLP, NLP Beyond Text workshop, 2020

  43. arXiv:2010.10239  [pdf, other

    cs.CL cs.LG

    Complete Multilingual Neural Machine Translation

    Authors: Markus Freitag, Orhan Firat

    Abstract: Multilingual Neural Machine Translation (MNMT) models are commonly trained on a joint set of bilingual corpora which is acutely English-centric (i.e. English either as the source or target language). While direct data between two languages that are non-English is explicitly available at times, its use is not common. In this paper, we first take a step back and look at the commonly used bilingual c… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted at WMT 2020

  44. arXiv:2010.07972  [pdf, other

    cs.CL cs.AI

    Explicit Alignment Objectives for Multilingual Bidirectional Encoders

    Authors: Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, Graham Neubig

    Abstract: Pre-trained cross-lingual encoders such as mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020) have proven to be impressively effective at enabling transfer-learning of NLP systems from high-resource languages to low-resource languages. This success comes despite the fact that there is no explicit objective to align the contextual embeddings of words/sentences with similar meanings across… ▽ More

    Submitted 11 April, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

    Comments: published at NAACL 2021

  45. arXiv:2010.05874  [pdf, other

    cs.CL cs.LG

    Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

    Authors: Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao

    Abstract: Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this pape… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

  46. arXiv:2009.11201  [pdf, other

    cs.CL

    Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

    Authors: Xavier Garcia, Aditya Siddhant, Orhan Firat, Ankur P. Parikh

    Abstract: Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems p… ▽ More

    Submitted 12 March, 2021; v1 submitted 23 September, 2020; originally announced September 2020.

    Comments: Accepted to NAACL 2021

  47. arXiv:2006.16668  [pdf, other

    cs.CL cs.LG stat.ML

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Authors: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

    Abstract: Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices.… ▽ More

    Submitted 30 June, 2020; originally announced June 2020.

  48. arXiv:2005.04816  [pdf, other

    cs.CL cs.LG

    Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation

    Authors: Aditya Siddhant, Ankur Bapna, Yuan Cao, Orhan Firat, Mia Chen, Sneha Kudugunta, Naveen Arivazhagan, Yonghui Wu

    Abstract: Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervis… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

    Journal ref: ACL 2020

  49. arXiv:2003.11080  [pdf, other

    cs.CL cs.LG

    XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

    Authors: Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

    Abstract: Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tas… ▽ More

    Submitted 4 September, 2020; v1 submitted 24 March, 2020; originally announced March 2020.

    Comments: In Proceedings of the 37th International Conference on Machine Learning (ICML). July 2020

  50. arXiv:2002.07233  [pdf, other

    cs.LG stat.ML

    On the Discrepancy between Density Estimation and Sequence Generation

    Authors: Jason Lee, Dustin Tran, Orhan Firat, Kyunghyun Cho

    Abstract: Many sequence-to-sequence generation tasks, including machine translation and text-to-speech, can be posed as estimating the density of the output y given the input x: p(y|x). Given this interpretation, it is natural to evaluate sequence-to-sequence models using conditional log-likelihood on a test set. However, the goal of sequence-to-sequence generation (or structured prediction) is to find the… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.