-
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Authors:
Neha Sengupta,
Sunil Kumar Sahu,
Bokang Jia,
Satheesh Katipomu,
Haonan Li,
Fajri Koto,
William Marshall,
Gurpreet Gosal,
Cynthia Liu,
Zhiming Chen,
Osama Mohammed Afzal,
Samta Kamboj,
Onkar Pandit,
Rahul Pal,
Lalit Pradhan,
Zain Muhammad Mujahid,
Massa Baali,
Xudong Han,
Sondos Mahmoud Bsharat,
Alham Fikri Aji,
Zhiqiang Shen,
Zhengzhong Liu,
Natalia Vassilieva,
Joel Hestness,
Andy Hock
, et al. (7 additional authors not shown)
Abstract:
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning…
▽ More
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat
△ Less
Submitted 29 September, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
Authors:
Nolan Dey,
Gurpreet Gosal,
Zhiming,
Chen,
Hemant Khachane,
William Marshall,
Ribhu Pathria,
Marvin Tom,
Joel Hestness
Abstract:
We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-…
▽ More
We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization ($μ$P) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
Authors:
Vithursan Thangarasa,
Abhay Gupta,
William Marshall,
Tianda Li,
Kevin Leong,
Dennis DeCoste,
Sean Lie,
Shreyas Saxena
Abstract:
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Sca…
▽ More
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.
△ Less
Submitted 29 July, 2023; v1 submitted 18 March, 2023;
originally announced March 2023.
-
Windowed Prophet Inequalities
Authors:
William Marshall,
Nolan Miranda,
Albert Zuo
Abstract:
The prophet inequalities problem has received significant study over the past decades and has several applications such as to online auctions. In this paper, we study two variants of the i.i.d. prophet inequalities problem, namely the windowed prophet inequalities problem and the batched prophet inequalities problem. For the windowed prophet inequalities problem, we show that for window size…
▽ More
The prophet inequalities problem has received significant study over the past decades and has several applications such as to online auctions. In this paper, we study two variants of the i.i.d. prophet inequalities problem, namely the windowed prophet inequalities problem and the batched prophet inequalities problem. For the windowed prophet inequalities problem, we show that for window size $o(n)$, the optimal competitive ratio is $α\approx 0.745$, the same as in the non-windowed case. In the case where the window size is $n/k$ for some constant $k$, we show that $α_k < WIN_{n/k} \le α_k + o_k(1)$ where $WIN_{n/k}$ is the optimal competitive ratio for the window size $n/k$ prophet inequalities problem and $α_k$ is the optimal competitive ratio for the $k$ sample i.i.d. prophet inequalities problem. Finally, we prove an equivalence between the batched prophet inequalities problem and the i.i.d. prophet inequalities problem.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
PyPhi: A toolbox for integrated information theory
Authors:
William G. P. Mayner,
William Marshall,
Larissa Albantakis,
Graham Findlay,
Robert Marchman,
Giulio Tononi
Abstract:
Integrated information theory provides a mathematical framework to fully characterize the cause-effect structure of a physical system. Here, we introduce PyPhi, a Python software package that implements this framework for causal analysis and unfolds the full cause-effect structure of discrete dynamical systems of binary elements. The software allows users to easily study these structures, serves a…
▽ More
Integrated information theory provides a mathematical framework to fully characterize the cause-effect structure of a physical system. Here, we introduce PyPhi, a Python software package that implements this framework for causal analysis and unfolds the full cause-effect structure of discrete dynamical systems of binary elements. The software allows users to easily study these structures, serves as an up-to-date reference implementation of the formalisms of integrated information theory, and has been applied in research on complexity, emergence, and certain biological questions. We first provide an overview of the main algorithm and demonstrate PyPhi's functionality in the course of analyzing an example system, and then describe details of the algorithm's design and implementation.
PyPhi can be installed with Python's package manager via the command 'pip install pyphi' on Linux and macOS systems equipped with Python 3.4 or higher. PyPhi is open-source and licensed under the GPLv3; the source code is hosted on GitHub at https://github.com/wmayner/pyphi . Comprehensive and continually-updated documentation is available at https://pyphi.readthedocs.io/ . The pyphi-users mailing list can be joined at https://groups.google.com/forum/#!forum/pyphi-users . A web-based graphical interface to the software is available at http://integratedinformationtheory.org/calculate.html .
△ Less
Submitted 27 June, 2018; v1 submitted 27 December, 2017;
originally announced December 2017.
-
What caused what? A quantitative account of actual causation using dynamical causal networks
Authors:
Larissa Albantakis,
William Marshall,
Erik Hoel,
Giulio Tononi
Abstract:
Actual causation is concerned with the question "what caused what?" Consider a transition between two states within a system of interacting elements, such as an artificial neural network, or a biological brain circuit. Which combination of synapses caused the neuron to fire? Which image features caused the classifier to misinterpret the picture? Even detailed knowledge of the system's causal netwo…
▽ More
Actual causation is concerned with the question "what caused what?" Consider a transition between two states within a system of interacting elements, such as an artificial neural network, or a biological brain circuit. Which combination of synapses caused the neuron to fire? Which image features caused the classifier to misinterpret the picture? Even detailed knowledge of the system's causal network, its elements, their states, connectivity, and dynamics does not automatically provide a straightforward answer to the "what caused what?" question. Counterfactual accounts of actual causation based on graphical models, paired with system interventions, have demonstrated initial success in addressing specific problem cases in line with intuitive causal judgments. Here, we start from a set of basic requirements for causation (realization, composition, information, integration, and exclusion) and develop a rigorous, quantitative account of actual causation that is generally applicable to discrete dynamical systems. We present a formal framework to evaluate these causal requirements that is based on system interventions and partitions, and considers all counterfactuals of a state transition. This framework is used to provide a complete causal account of the transition by identifying and quantifying the strength of all actual causes and effects linking the two consecutive system states. Finally, we examine several exemplary cases and paradoxes of causation and show that they can be illuminated by the proposed framework for quantifying actual causation.
△ Less
Submitted 9 January, 2019; v1 submitted 22 August, 2017;
originally announced August 2017.