Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mini Giant

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Mini-Giants: “Small” Language Models and Open Source Win-Win

Zhengping Zhou Lezhi Li


zpzhou@cs.stanford.edu lli2@gsd.harvard.edu

Xinxi Chen Andy Li


xc336@cornell.edu andy@RLAI.institute

Abstract below 10B, and performance comparable or bet-


ter than ChatGPT / GPT-4, these "small" LMs are
ChatGPT is phenomenal. However, it is pro- indeed "mini-giants".
hibitively expensive to train and refine such
giant models. Fortunately, small language mod-
In this article, we survey the state-of-the-art
arXiv:2307.08189v1 [cs.CL] 17 Jul 2023

els are flourishing and becoming more and for these small language models. We show that
more competent. We call them "mini-giants". compared to their large counterparts, small lan-
We argue that open source community like guage/foundation models offer particularly promis-
Kaggle and mini-giants will win-win in many ing opportunities for various industries (including
ways, technically, ethically and socially. In open source ML research and Kaggle competitions)
this article, we present a brief yet rich back- to not only utilize but also to actively participate in
ground, discuss how to attain small language
the creation/adaptation of modern language models
models, present a comparative study of small
language models and a brief discussion of eval- and AI in general. We center our arguments around
uation methods, discuss the application sce- the 3 key advantages of small models: adaptability,
narios where small language models are most controllability, and affordability.
needed in the real world, and conclude with First of all, smaller models offer better adapt-
discussion and outlook. ability by being more manageable to modify and
fine-tune. In Section 3, we present various strate-
gies of creating these small models through opti-
1 Introduction
mized fine-tuning techniques. This is important
Large language models (LMs), like ChatGPT and because in most industries (or even in a Kaggle
GPT-4, have been taken us by storm. People com- competition), innovation typically arises from the
pare it to the moment of the computer, the moment ability to incorporate domain-specific data into the
of the operating system, the moment of the Internet, language model or to adjust the model’s structure
or the moment of the iPhone. It is considered by to accommodate their unique requirements. Rely-
many a paradigm shift in NLP and deep learning. ing solely on prompt engineering often falls short.
Large language models are large: OpenAI GPT- Therefore, smaller language models bring forward
3 175B parameters, Google PALM 560B, and ru- great benefits to these industries, offering the much-
mor has it that GPT-4 is as large as 8 × 220B. For needed flexibility for adaptation, allowing them to
most small/medium companies and independent full leverage the power of AI and thus catalyzing
researchers, it is prohibitively expensive to train innovation within them.
or update such giant models. In addition, huge Second, smaller models can run on local infras-
consumption of energy for language model train- tructure without resorting to GPU-rich third parties,
ing poses a serious concern to the environmental improving the model’s controllability by ensuring
sustainability (Verdecchia et al., 2023). model users’ autonomous data governance and re-
Recent studies show that network size is not the sult monitoring. In Section 5, we discuss real world
sole determinant of model performance (Hoffmann scenarios where small language models fill in the
et al., 2022). And thanks to the efforts from the gaps when their large counterparts are unacceptable
ML open source community as well as private AI due to privacy concerns. In Section 4, we also look
companies, we’ve recently seen more and more into strategies for customized instruction following
"small" LMs created out of these larger models. and other pioneer research directions for small mod-
With their network parameter sizes of around or els, underpinning the relevance of smaller language
Data quality
InstructGPT ChatGPT

Human- Open
StableVicuna Guanaco
curated Assistant
Dolly 2.0

Alpaca Vicuna
GPT-3.5
synthetic GPT4All Dolly 1.0 Koala

Pretrain
GPT-3 LLaMA Pythia Time
only
Jun’20 Jan’22 Nov’22 Feb’23 Mar’23 Apr’23 May’23

Figure 1: An evolution tree of recently released instruction-following small LMs. The color of the text boxes
indicates the openness of the license under which the models are released: red stands for proprietary licenses, yellow
stands for non-commercial licenses, and green stands for licenses permissive for commercial use.

models in ensuring compliance and mitigating the plication scenarios where small foundation models
risk of misinformation. Understanding and manag- are most needed in the real world. We conclude
ing the way a model operates, the data it accesses, with discussions and an outlook.
and the outputs it produces, form the cornerstone
of responsible AI usage. 2 A brief yet rich background
Another crucial aspect of the superiority of small The Giants are fast ChatGPT set a record for
language models, is affordability. Taking an aver- fastest-growing user base: one million users in
age Kaggle competitor as an example. The de- five days, and 100 million monthly active users in
manding nature of a Kaggle competition requires January 2023, two months after launching.
the competitor to iterate on the modeling solutions, Radford et al. (2018) introduce generative pre-
often times by integrating a variety of data sources training for LMs, which could be regarded as “GPT-
and trying different architectures. This necessitates 1”. Radford et al. (2019) introduce GPT-2, an
transparent model components and fast iteration unsupervised multitask learning LM. Brown et al.
pace, which is at odds with the resource require- (2020) introduce GPT-3, a few-shot learning LM,
ments that super large language models impose. popularizing the concept of in-context learning.
Having access to fast and inexpensive training / in- OpenAI (2022) introduces ChatGPT and OpenAI
ferencing options, means that he/she will not have (2023) introduces GPT-4.
to face the trade-off between being constrained There are many LMs released in recent
in their innovation space, and moving away from years: Google BERT, Bidirectional Encoder Rep-
language model solutions entirely. As another ex- resentations from Transformers (Devlin et al.,
ample which is elaborated in Section 5, privacy- 2019), Google T5, Text-To-Text Transfer Trans-
sensitive sectors such as finance and healthcare former (Raffel et al., 2020), Google LaMDA, Lan-
face a more pressing challenge of choosing be- guage Model for Dialogue Applications (Thoppilan
tween regulation risks and the prohibitive cost of et al., 2022), Google PaLM, Pathways Language
training massive models in-house. Small language Model (Chowdhery et al., 2022), Deepmind Spar-
models provide the opportunity for them to con- row (Glaese et al., 2022), Anthropic Claude (Bai
form with regulations while not missing out on the et al., 2022a), Deepmind Chinchilla (Hoffmann
power of latest AI technologies. et al., 2022) Nivedia Megatron-Turing NLG (Smith
Outline et al., 2022), Deepmind Gopher (Rae et al., 2022),
HuggingFace BLOOM (BigScience Workshop
In the following sections, we first present a brief
et al., 2023), and Meta LLaMA, Large Language
yet rich background. Next, we discuss how to at-
Model Meta AI (Touvron et al., 2023).
tain small foundation models, including parameter
reduction and efficient training/fine-tuning tech- Language models as experts Besides general
niques. Then we present a comparative study of purpose LMs as above, there are many specialized
“small” foundation models a brief discussion of models for various application, e.g., Table 1 shows
evaluation methods. After that, we discuss the ap- a sample of them.
Model Application Reference
AlphaFold Protein folding Tunyasuvunakool et al. (2021)
Codex Coding Chen et al. (2023)
AlphaCode Coding Li et al. (2022)
RT-1 Robotics Brohan et al. (2022)
BiomedGPT Biomedical Zhang et al. (2023a)
Clinical Camel Clinical Toma et al. (2023)
BloombergGPT Finance Wu et al. (2023b)
FinGPT Finance Yang et al. (2023)
Med-PaLM 2 Medical Singhal et al. (2023)
MusicLM Music Agostinelli et al. (2023)
AudioGPT Audio Huang et al. (2023)

Table 1: A (small) sample of specialized LMs

Language and functional competence Ma- Augmented LMs with tools A natural way to
howald et al. (2023) study language competence vs harnesses the language competence of LMs is
thought competence of LMs and show impressive by utilizing tools like a search engine, a vector
but imperfect formal linguistic competence, i.e., database, a code interpreter, or a solver to handle
“knowledge of rules and patterns of a given lan- tasks, e.g., LangChain1 , HuggingGPT (Shen et al.,
guage”, yet failures on many tests requiring func- 2023), Visual ChatGPT (Wu et al., 2023a), TaskMa-
tional linguistic competence, i.e.,“a host of cogni- trix.AI (Liang et al., 2023b), RCI (Kim et al., 2023),
tive abilities required for language understanding LLM+P (Liu et al., 2023a), ChemCrow (Bran et al.,
and use in the real world”. 2023), etc. See Mialon et al. (2023) for a survey
about augmented LMs.
Then we can leverage LMs’ competence as a
Domain expertise is still required, e.g., the
good model of language, e.g., by prompt engineer-
ChemCrow Bran et al. (2023) authors mention that
ing. We can also manage to improve the functional
“However, it is important to emphasize that po-
competence, e.g., factuality, safety, and planning.
tential risks may arise for non-experts who lack
With the capacity of in-context learning (Brown
the chemical reasoning to evaluate results or the
et al., 2020), prompting is a natural and popular
proper lab training, as conducting experiments still
way to utilize LMs. Prompting is the user interface
necessitates thorough laboratory experience.” and
for LMs, and can be formed with advanced meth-
the director of the movie trailer mentions that “For
ods like search and coding, e.g., Tree of Thoughts
those who believe that AI will do everything for
(ToT) (Yao et al., 2023), AdaPlanner (Sun et al.,
you: No!” and “I’ll always prefer to put my own
2023), Code as Policies (Liang et al., 2023a). Fine-
heart & soul in.” 2
tuning can improve LMs further. A parameter ef-
ficient approach makes fine-tuning large LMs fea- Mini-Giants are coming Following the leak-
sible considering the cost (Hu et al., 2021; Ding age of LLaMA (Touvron et al., 2023), many
et al., 2023; Ruder et al., 2022). Augmenting LMs “small” LMs appear in the open source commu-
with tools can achieve various functionalities. nity, with neural network parameter sizes of around
10B or smaller, e.g., Alpaca (Taori et al., 2023),
To approach artificial general intelligence (AGI)
Dolly (Conover et al., 2023), Koala (Geng et al.,
from language models, Mahowald et al. (2023)
2023), Vicuna (Chiang et al., 2023), StableLM (Sta-
suggest that, “instead of or in addition to scaling
bility AI, 2023a), ChatGLM (Du et al., 2020; Zeng
up the size of the models, more promising solu-
et al., 2023), Guanaco (Dettmers et al., 2023),
tions will come in the form of modular architec-
Pythia (Biderman et al., 2023), GPT4All3 , Open-
tures . . . , like the human brain, integrate language
Assistant4 , ColossalChat (You, 2023).
processing with additional systems that carry out
See Kim (2023) for a list of open sourced fine-
perception, reasoning, and planning”. The authors
tuned LMs. In Section 4, we will discuss and com-
believe that “a model that succeeds at real-world
1
language use would include – in addition to the https://langchain.com
2
https://twitter.com/ChristianF369/status/
core language component – a successful problem 1651607149804498946
solver, a grounded experiencer, a situation modeler, 3
https://github.com/nomic-ai/gpt4all
4
a pragmatic reasoner, and a goal setter”. https://github.com/LAION-AI/Open-Assistant
pare these mini-giants in details. 3.2 Efficient fine-tuning strategies for
foundation models
Discussions & debates abound There are all Compared with building even more compact mod-
sorts of discussions & debates, e.g. discussions els, the majority of research work by the ML com-
about AI alignment with human value from Russell munity in the direction of "smaller" foundation
(2019); Mitchell (2020); Christian (2021). Table 2 models, is around making them easier to fine-tune.
lists a few representative examples. Here we list several key strategies to achieve this.
Adapter (Houlsby et al., 2019) is a strategy to
3 How to make large foundation models add NN layers after existing layers (usually trans-
"small" former blocks) in pretrained foundation models,
so that they can be adapted to custom tasks with-
Since the advent of ultra-capable large foundation out changing the weights of existing layers. This
models like ChatGPT and StableDiffusion, numer- paper proposes an adapter module with two lin-
ous efforts have been devoted to address the pri- ear layers plus a non-linear activation in between.
mary challenges for their wide-spread utilization: The first layer projects the hidden state to a lower-
their humongous parameter sizes and the sheer time dimensional space, and the second layer projects
and compute resources needed to fine-tune them. it back to the original dimension. A newer pa-
Within 2 years, the research and open source com- per (Lin et al., 2020) recommended only one linear
munity have arrived at several strategies to cope layer plus an additional LayerNorm, as an Adapter
with this issue, which we will discuss in this sec- module. Adapter achieves near state-of-the-art per-
tion. formance, while adding only a small amount of
We classify these strategies into 2 groups: ones parameters per task - on GLUE, the added parame-
that directly reduce the parameter sizes, and ones ters accounted for 3.6% of the original model.
that makes fine-tuning large models more efficient.
Prefix fine-tuning (Li and Liang, 2021) Unlike
the Adapter architecture that focuses on modify-
3.1 Foundation models with reduced ing model behavior via model params, Prefix fine-
parameters tuning seeks to train a few params that are used
as input prefixes, for each custom sub task. The
Chinchilla (Hoffmann et al., 2022) is the first
authors commented that the method is inspired by
influential study on computational efficiency of
prompting: similar to prepending a few sentences
modern large language models. It put forward the
before a generation task, Prefix-tuning prepends
argument that given a compute budget, the best
a sequence of trained vectors to the input - just
model is attained not by larger parameter size, but
that the prefix vectors do not have to correspond
by more training data tokens. Based on this princi-
to any real tokens. Compared to full fine-tuning,
ple, the authors produced the Chinchilla 70B model
prefix-fine tuning achieves comparable or better
which out-performs prior large models 4 times as
performance with just 0.1% added parameters.
large, with the same amount of compute.
LoRA (Hu et al., 2021) Marks a substan-
LLaMa (Touvron et al., 2023) further reduces tial progress in parameter efficient fine-tuning.
the parameters and released a series of models Performance-wise, it is more efficient than previ-
ranging from 7 to 65B parameters, following the ous methods like Adapter and Prefix-finetuning.
Chinchilla computation rule. Notably, the paper LoRA proposes that we add a low rank, trainable
used only publicly available datasets as training cor- matrix in parallel to the frozen, pretrained model
pus and proved comparable performance as closed weights. The activation will be the sum of these
source counterparts. This, as commented by (Har- two matrices. Formally:
ris et al.), started a revolution of open source LLM
h = W0 x + ∆W x = W0 x + BAx
models. Along with parameter reduction, another
contribution by the authors is efficient implemen- where B and A are much "thinner" (i.e. low
tation of multi-headed attention layers through the rank), trainable matrices compared to W0 (the
open source xformers library, which optimizes the frozen pretrained matrix). The use of low rank ma-
memory consumption in training. trices reduces trainable parameters to as much as
Issue to discuss Reference
The dangers of stochastic parrots Bender et al. (2021)
Limitation of neural networks Delétang et al. (2023)
Limitation of autoregressive models Lin et al. (2021)
Lack of causality Jin et al. (2023)
Lack of compositionality Dziri et al. (2023)
Lack of recursion Zhang et al. (2023b)
Limitation of scaling laws Deshpande et al. (2023)
Limitation of scaling laws McKenzie et al. (2023)
Model collapse Shumailov et al. (2023)
Artificial general intelligence (AGI) Marcus (2023)
Evaluation of AI Burnell et al. (2023)
Distortion of human beliefs Kidd and Birhane (2023)
Social norms Browning and LeCun (2023)
Risks and benefits Goldman (2023)
Existential risk Bengio (2023)
Court hearing due to hallucination Novak (2023)
Risk of further concentration of wealth Chiang (2023)
Eight things to know Bowman (2023)

Table 2: Discussions and debates of LMs

by 10,000 times of the original model, compared to method here to inspire the readers to consider more
a full fine-tune of GPT-3 175B. The article suggests complex scenarios of controlling / customizing
that LoRA can be used next to any model weights, large foundation model’s outputs.
not just transformer layers. The authors claim that ControlNet copies weights of the original model
LoRA is superior compared to Adapters in that it to a frozen copy (ike all methods mentioned above).
doesn’t introduce additional inference latency; and The trainable branch consists of an exact same copy
it’s better than Prefix fine-tuning in that it doesn’t as the frozen copy, as well as two convolution lay-
reduce the available sequence length like the latter ers called "zero convolutions", both before and af-
does. Further more, since this architectural modi- ter the trainable copy. In the fine-tuning forward
fication is orthogonal to the ideas of Adapter and path, the activation from the trainable copy will be
Prefix fine-tuning, LoRA can be used in conjunc- combined with that of the frozen copy by Zero Con-
tion with them for even better results. volution. The so-called Zero Convolution is just a
1x1 convolution layer that are initiated with both
QLoRA (Dettmers et al., 2023) As an improve- weights and biases being zeros. The result of using
ment of LoRA, QLoRA proposes optimization ControlNet shows that in some tasks, ControlNets
methods via quantized low rank fine tuning. Inno- on a personal computer achieve comparable results
vations of QLoRA include a 4-bit data type: Nor- as commercial models trained on terabytes of GPU
malFloat4, which optimizes information efficiency memory and thousands of GPU hours.
for normally distributed data (e.g. weights) based
on information theory. Apart from that, the pa- 4 A brief survey of “small”
per uses Paged Optimizers (partial optimizer state instruction-following LMs
stored on CPU rather than GPU) to manage mem-
ory spikes, like when processing mini batches with Over the past few months, we have seen small LMs
long sequence lengths. Experiment results show flourish. See Figure 1 for an evolution tree. This is
that fine-tuning using QLoRA reaches 99.3% of a very fast progressing field, and it is challenging to
the performance of ChatGPT, and only requires even keep ahead with the latest progress. Quoting
training for 24 hours on one GPU. (Tunguz, 2023), “Trying to get ahead in AI these
days feels like wrestling a rabid 5,000 lbs hippo
ControlNet (Zhang and Agrawala, 2023) is covered in baby oil”.
proposed as a method to efficiently fine-tune im-
age generation models (diffusion model) on user- 4.1 Closed-source milestones
defined tasks. Because image generation models GPT-3 (Brown et al., 2020) gained public atten-
in general have a larger design space in terms of tion when it was released in 2020. As reported by
user interaction than language models, we list this New York Times, it “generates tweets, pens poetry,
Basic info Scale Openness
Time Training Training
MM/YY Model Institute # parameters hardware cost data size L I TC TD
06/20 GPT-3 OpenAI 175B 3.64k PT-days 300B tokens P P P ✓
02/23 LLaMA-7B Meta 7B 82k GPU-hours 1.4T tokens NC ✓ ✗ ✓
02/23 LLaMA-13B Meta 13B 135k GPU-hours 1.4T tokens NC ✓ ✗ ✓
04/23 Pythia-7B Eleuther AI 7B 33.5k GPU-hours 300B tokens C ✓ ✓ ✓
04/23 Pythia-12B Eleuther AI 12B 72k GPU-hours 300B tokens C ✓ ✓ ✓

Table 3: Comparison of recent base LMs. In the Openness section, L stands for License, I stands for Inference, TC stands
for Training Codes, and TD stands for Training Data. In the License column, P stands for Proprietary, NC stands for Non-
Commercial, and C stands for permissive for Commercial use.

Basic info Scale Openness


Time Training
MM/YY Model Institute Backbone # parameters hardware cost L I TC TD
01/22 InstructGPT OpenAI GPT-3 1.3B N/A P P P P
11/22 ChatGPT OpenAI GPT-3 N/A N/A P P P P
03/23 Alpaca-7B Stanford LLaMA-7B 7B < $100 NC ✓ ✓ ✓
03/23 GPT4All-Lora Nomic AI LLaMA-7B 7B $100 NC ✓ ✓ ✓
03/23 ChatGLM-6B Tsinghua GLM 6B N/A NC ✓ ✓ ✗
03/23 Vicuna-7B/13B LMSYS LLaMA-7B/13B 7B/13B $140/$300 NC ✓ ✓ ✓
03/23 Dolly-6B Databricks GPT-J-6B 6B < $30 NC ✓ ✓ ✓
04/23 OASST-12B LAION AI Pythia-12B 12B N/A C ✓ ✓ ✓
04/23 Koala-13B Berkeley LLaMA-13B 13B < $100 NC ✓ ✓ ✓
04/23 Dolly-v2-12B Databricks Pythia-12B 12B N/A C ✓ ✓ ✓
04/23 StableVicuna-13B Stability AI Vicuna-13B 13B N/A NC ✓ ✓ ✓
05/23 Guanaco-7B/13B UW LLaMA-7B/13B 7B/13B < 12 GPU-hours NC ✓ ✓ ✓

Table 4: Comparison of recent instruction-following small LMs. The abbreviations of the column names follow Table 3.

summarizes emails, answers trivia questions, trans- by incorporating dialogue data into the supervised
lates languages and even writes its own computer fine-tuning and the RLHF stage. It acquired 1 mil-
programs” (Markoff, 2020). It shows that decent lion users in just 5 days and revolutionizes the way
few-shot performance can be achieved without gra- people interact with modern AIs. As a proprietary
dient update, and the unprecedented model scale product, although the web UI is free, the underlying
(175B parameters) is a key ingredient for success. model can only be accessed via a paid API.

InstructGPT Although GPT-3 is already pow-


erful, Ouyang et al. (2022) points out that the 4.2 Open-source backbone LMs
model output may not align well with human intent
and may contain harmful content. For example, LLaMA Despite the recent success of GPT-3 and
when prompted to generate a story, the LM should ChatGPT, training and deploying LLMs remain a
generate a story instead of rambling around the major challenge to the open source community due
prompt itself. This necessitated an extra step called to the high training infra cost. For instance, the
model alignment, and the desired model behavior GPT-3 training is estimated to cost millions of dol-
is called instruction-following. In InstructGPT, this lars. (Touvron et al., 2023) propose LLaMA, an
is achieved by applying the reinforcement learning open source LLM pretrained with public data avail-
from human feedback (RLHF) (Christiano et al., able at several sizes. Remarkably, the 13B LLaMA
2017) technique on top of a GPT-3 backbone. De- model benefited from large scale pretraining data
spite having 100x less parameters, InstructGPT (1.4T tokens), and outperforms the 175B GPT-3 on
outperforms the unaligned GPT-3 model in human most benchmarks. It soon becomes a highly influ-
evaluation, giving rise to the phenomenal success ential milestone in the open source world, serving
of ChatGPT ten months later. as a powerful yet lightweight backbone for a wide
range of subsequent instruction-following small
ChatGPT (OpenAI, 2022) brings AIGC to the LMs. The non-commercial bespoke license, under
attention of the general public. It uses the same which it is released, limits the usage to research
technique as InstructGPT, but extends InstructGPT purpose only.
Pythia (Biderman et al., 2023) Published two it accounts for multi-turn conversation in training,
months later than LLaMA, Pythia releases a suite and made several optimizations to cut the training
of 16 LLMs ranging from 70M to 12B param- cost. Vicuna uses GPT-4 as an automatic chatbot
eters. Trained with 300B tokens from the Pile judge, based on which it outperforms LLaMA and
(Gao et al., 2020), it consumed a similar amount Alpaca, while achieving more than 90% quality
of data as GPT-3 but around four times less than of ChatGPT. A more rigorous analysis validating
LLaMA (see Table 3 for comparison). Released this evaluation approach is later presented in the
under the Apache 2.0 license, Pythia is free for Guanaco work (Dettmers et al., 2023).
commercial use, making it an appealing backbone
for many subsequent instruction-following small Koala (Geng et al., 2023) is another instruction
LMs (e.g. Open Assistant (Köpf et al., 2023), Dolly fine-tuned LLaMA model, with 13B parameters. It
2.0 (Conover et al., 2023)). is a concurrent effort with Vicuna, released at a sim-
ilar time. Like Vicuna, it is fine-tuned on ChatGPT-
4.3 Small LMs trained with GPT synthetic distilled data, with a focus on the dialogue scenario.
data In human evaluation, Koala achieves comparable
or superior results compared to Alpaca.
Since the release of LLaMA, open-source instruc-
tion fine-tuned small LMs emerge at a rapid speed. 4.4 Small LMs trained with human-curated
Viewing LLaMA as an open-source counterpart of data
GPT-3, these small LMs can be seen as the open-
source counterparts of InstructGPT or ChatGPT. Dolly 1.0 (Conover et al., 2023) trains a two-
Most of them can be fine-tuned under a feasible year-old GPT-J-6B backbone using the same
budget (the training hardware cost can be capped data as Alpaca, showcasing that the instruction-
under several hundred dollars). following capability does not necessarily require
state-of-the-art backbone model as long as the
A major challenge is to obtain high-quality
data quality is decent. Dolly 2.0, released one
instruction-following data, a key ingredient in the
month later, upgrades to the newly released
model alignment stage. At an early stage, the open-
Pythia-12B (Biderman et al., 2023) backbone and
source community tackles this challenge by using
is instruction fine-tuned using a newly crowd-
GPT-3.5 (OpenAI, 2022) to synthesize the response
sourced dataset, databricks-dolly-15k which
of a given prompt. This imposes a non-commercial
contains 15k human-generated prompt-response
license on the fine-tuned model.
pairs. Notably, it is the first open-source instruction-
Alpaca (Taori et al., 2023) is the first newborn following small LM that permits commercial use.
in this family. It fine-tunes LLaMA-7B with 52k
instruction-following data generated using the self- Open Assistant (Köpf et al., 2023) uses LLaMA-
instruct method, which leverages GPT-3.5 to syn- 13B and Pythia-12B as the backbones, allowing it
thesize prompt-response pairs from a manually cre- to release chatbots under either non-commercial
ated seed set. According to human evaluation, it and commercial licenses. It also releases the Ope-
achieves similar performance to GPT-3.5 on a small nAssistant Conversations (oasst1) dataset, which
sample data. contains 66k conversations generated by human,
accompanied with quality ratings. It also in-
GPT4All (Anand et al., 2023) fine-tunes cludes human preferences for the model responses,
LLaMA-7B with 437k prompt-response pairs. The which enables RLHF training. After fine-tuning on
instructions are collected from the unified_chip2 this dataset, Open Assistant achieves a 48.3% v.s.
and Stackoverflow Questions, while the responses 51.7% as compared to ChatGPT. As a high quality
are generated by GPT-3.5. The model is fine-tuned human-generated dataset free of GPT-synthesized
using the LoRA (Hu et al., 2021) algorithm. content, oasst1 is widely used in follow-up works.
Evaluated using the ground truth perplexity on the
Self-Instruct (Wang et al., 2023) human evaluation StableVicuna After the release of the oasst1
data, GPT4All stochastically outperforms Alpaca. dataset, (Stability AI, 2023b) proposes StableVi-
cuna, “the AI world’s first open-source RLHF LLM
Vicuna (Chiang et al., 2023) fine-tunes LLaMA- chatbot”. It is fine-tuned on the Vicuna-13B model
13B with 70k user-shared conversations with Chat- using a mix of the prompt-response datasets from
GPT (from ShareGPT.com). Compared to Alpaca, Open Assistant, GPT4All, and Alpaca. The model
is further optimized using RLHF with human pref- approach for LLM low-bit weight-only quantiza-
erence data from Open Assistant, HH-RLHF (Bai tion", exploiting the observation that "protecting
et al., 2022b), and SHP (Ethayarajh et al., 2022). only 1% of salient weights can greatly reduce quan-
By the time StableVicuna is released, it outper- tization error".
forms other similarly sized open-source chatbots
on a number of question-answering benchmarks. Performance improvement Liu and Low (2023)
propose a fine-tuned LLaMA-based model Goat to
Guanaco (Dettmers et al., 2023) introduces an outperform GPT-4 on arithmetic tasks, due to con-
efficient fine-tuning approach called QLoRA. As sistent tokenization of numbers by LLaMA. The au-
a by product, the chatbot Guanaco-65B fine-tuned thors decompose challenging tasks like multi-digit
on top of LLaMA achieves state-of-the-art results multiplication and division into learnable tasks and
in human evaluation. It also releases the 7B/13B leverage basic arithmetic principles. The authors
versions which are of a similar scale as previously show that Goat-7B can be trained with LoRA on a
mentioned small LMs. The fine-tuning dataset is a 24GB VRAM GPU.
mix of oasst1 (Köpf et al., 2023) and some other Patil et al. (2023) propose a finetuned LLaMA-
public datasets. based model Gorilla to surpass GPT-4 on writ-
ing API calls. With a document retriever, Gorilla
4.5 Community trends and research
adapts to document changes like user updates and
directions
version changes and mitigates hallucination. The
In addition to trained models shown above, we author also introduce APIBench, a dataset includ-
would like to point out a few research trends around ing HuggingFace, TorchHub, and TensorHub APIs.
the topic of making small language models more
efficient and performant. We discuss studies on Study of the scaling law Eldan and Li (2023)
accelerated training for large language models, per- show that LMs with <10M parameters and one
formance improvement strategies, the scaling rules Transformer block can generate fluent and con-
of large models, as well as the evaluation frame- sistent stories of several paragraphs with close to
works. perfect grammar.
Gunasekar et al. (2023) introduce phi-1 and show
Acceleration and optimization Hewitt et al. good coding performance with 1.3B parameters
(2023) propose Backpack, a new network archi- and 7B training tokens, with a selection of “text-
tecture that takes all of performance, interpretabil- book quality” data.
ity and control into consideration. In Backpack, Deshpande et al. (2023) study downscaling ef-
each word in a vocabulary is associated with mul- fects with the shrunk language, showing the bene-
tiple learned non-contextual sense vectors, and a fits of pre-training for models of 1.25M parameters
word in a sequence is represented as a context- and that compute-optimal models break the power
dependent, non-negative linear combination of its law. McKenzie et al. (2023) provide 11 datasets
associated sense vectors. The authors show that a for empirical analysis of inverse scaling laws and
170M-parameter Backpack LM on OpenWebText discuss the important of data and objectives for
has a comparable loss of a 124M parameter GPT- training LMs. Zhang et al. (2023c) propose NeQA,
2 small, and, Backpack sense vectors outperform a dataset containing questions with negation and
word embeddings of a 6B-parameter Transformer exhibit inverse scaling, U-shaped scaling, or posi-
LM on lexical similarity evaluations. tive scaling. Before this, the popular view follows
Liu et al. (2023b) propose Sophia, Second-order scaling laws that the overall cross-entropy loss of
Clipped Stochastic Optimization, an optimizer with an LM improves with the increased scale of model,
light-weight estimate of the diagonal Hessian as dataset and compute for training (Kaplan et al.,
the pre-conditioner to improve the popular, state- 2020), and that the model and data should be scaled
of-the-art optimizer Adam. Sophia attains half the equally for compute-optimal training (Hoffmann
number of steps, total compute, and wall-clock et al., 2022).
time compared with Adam with GPT-2 of sizes
from 125M to 770M. The authors also prove theo- Evaluation for instruction-following LMs
retical properties of Sophia. Fairly assessing the performance of instruction-
Lin et al. (2023) propose Activation-aware following LMs poses a challenging task, given the
Weight Quantization (AWQ), "a hardware-friendly extensive variety of tasks it must handle, including
question answering, mathematics problem solving, tations.
coding and debugging, translation, and more. Srivastava et al. (2022) propose the Beyond the
Furthermore, assessing the quality of chatbot Imitation Game benchmark (BIG-bench) with more
responses is highly subjective in nature. than 200 tasks.
Most works in Section 4.3 and 4.4 are evalu- Liang et al. (2022) propose Holistic Evaluation
ated by a few human evaluators on a small sample of Language Models (HELM) to improve trans-
data. For instance, Alpaca (Taori et al., 2023) is parency of LMs, with 1) a taxonomy of LM evalua-
evaluated by five students on around two hundred tion design space w.r.t. scenarios and metrics, 2) a
comparisons against text-davinci-003. Koala broad coverage of 16 core scenarios with 7 metrics,
(Geng et al., 2023) is evaluated by 100+ people on i.e., accuracy, calibration, robustness, fairness, bias,
180 test queries. Open Assistant (Köpf et al., 2023) toxicity, efficiency, together with 7 targeted evalua-
is evaluated using 7,042 manual comparisons on a tions of skills and risks and 21 new scenarios, and
sample of 22 prompts. 3) evaluation of 30 existing models.
On the other side, Vicuna (Chiang et al., 2023) Lee et al. (2022) propose Human-AI Language-
employs GPT-4 as a proxy evaluator across 80 ques- based Interaction Evaluation (HALIE) beyond non-
tions. This approach gains further support from interactive evaluation by considering targets (full
Guanaco (Dettmers et al., 2023), wherein both GPT- process and final output), perspectives (first-person
4 and humans are used to evaluate 953 user queries. and third-party), and criteria (preference and qual-
The comparison demonstrate that GPT-4 evalua- ity).
tions serve as a “cheap and reasonable” substitute Pythia (Biderman et al., 2023) is a suite of 16
for human evaluation. LMs with sizes from 70M to 12B parameters and
Evaluation of LMs in general, not just the public access to checkpoints for each models to
instruction-following ones, continues to be a signif- analyze the developments and evolutions of LMs
icant challenge and an active area of research. We over the course of training.
delve deeper into this topic in Section 4.6. Shumailov et al. (2023) discuss the issue of
model collapse due to training with generated data
4.6 Evaluation from LMs and show the importance of genuine
human data for LMs.
Evaluation feedback is valuable for researchers and
engineers to improve learning algorithms. Evalua-
5 Applying “Mini-Giants” to real-world
tion and benchmarks for natural language process-
ing, in particular, language models and interactive "Mini-giants" are uniquely positioned to solve two
applications, have been enjoying steady progress. important issues unaddressed by larger language
However, it is still challenging for research and models: privacy protection and local computation.
development. We examine the application of these smaller mod-
Burnell et al. (2023) present guidelines for robust els in real-world scenarios, using the therapeutic
evaluation practices with more granular reporting, chatbot Woebot as an example. Cognitive Based
in particular, in-depth performance breakdowns Therapy (CBT) took several years from being pop-
beyond aggregate metrics and instance-by-instance ular in Woebot, to become closer to clinical ready.
evaluation results. Before delving into the discussion, let’s clarify
Gehrmann et al. (2022) survey obstacles in eval- the definition of small language models. Recall
uation of test generation and propose to evaluate that by today’s standard, small LMs are the mod-
a model with multiple datasets via multiple met- els with parameter sizes of around 10B or lower
rics and document human evaluation well. The and with performance comparable or better than
authors propose the following best best practice & ChatGPT / GPT-4. However, this is a definition
implementation: make informed evaluation choices based on today’s technology capabilities. With the
and document them, measure specific generation development of hardware and other optimization
effects, analyze and address issues in the used softwares, there will definitely be “mini-giants”
dataset(s), evaluate in a comparable setting, run with much more network parameters in the future.
a well-documented human evaluation, produce ro- Therefore, to future-proof our discussion on appli-
bust human evaluation results, document results in cations, we use a more extensible definition for
model cards, and release model outputs and anno- a “mini-giant”: a language model which can be
trained/modified/used with affordable resources, rollment of the first patient in a pivotal clinical trial
like with a single GPU and an open source devel- to evaluate if it can help women with postpartum
oper today. depression (Woebot Health, 2023). Their paper
Compared to their larger counterparts, “Mini- published in Expert Review of Medical Devices
giants” offer two advantages: privacy protection (Darcy et al., 2022) documented the clinical trial
and computation efficiency. Users wishing to uti- process.
lize language models have two primary choices. The reader might ask why it takes such a com-
They can either utilize APIs provided by orga- plicated experimentation process to adopt a new
nizations like OpenAI, or build their own "mini- technology in clinical trials and go through the U.S.
giants". If they choose the former, it is expected Food & Drug Administration (FDA) process. The
that their proprietary data will go through third answer is simple. If your families and friends are
party’s servers and be logged, which would be un- going to go to a doctor and look for some men-
acceptable to sensitive industries such as financial tal health help, what evidence would you need to
or health care institutes. On the other hand, "Mini- decice a chatbot is as trust-worthy as a doctor?
giants" permit centralized user data storage, poten- In short, “mini-giants” a.k.a. “small” language
tially on a single GPU. For example, “Alpaca-Lora” models had some unique advantages in privacy pro-
can run locally on affordable hardware like a Rasp- tection and computation efficiency. However, their
berry Pi. In terms of computation efficiency, in successful integration into specific domains like
industries like autonomous driving, high network healthcare requires adherence to industry standards,
latency may occur when connecting to remote data a frequently long process involving more than just
centers. Hence, it’s crucial that the language model technological considerations.
can function independently.
6 Discussion and outlook
To demonstrate "mini-giants" advantages, we
examine Cognitive Based Therapy (CBT), an ef- As the capability of large foundation models and AI
fective technique for treating clinical depression. becomes increasingly well-known to the general
Moving CBT from casual to clinical use is a de- public, the demand for AI democracy becomes
manding process, involving extensive clinical trials. an issue of societal fairness and equity. In our
Woebot, an AI chatbot, incorporates CBT into daily opinion, the open source community and "small"
use, providing around-the-clock mental health sup- language models mark one step towards facilitating
port and anxiety reduction. The company Woebot AI democracy, making it easier for everyone to
was founded by Alison Darcy, a psychology stu- control, adapt, interpret and afford the power of AI.
dent who worked as a software engineer, and then
• Adaptability: For the open source communi-
joined Stanford as a postdoctoral researcher in clin-
ties including Kaggle, the ability to innovate
ical psychology in 2017. Since its establishment,
comes from the capability to use the model
it received endorsement from AI pioneers such as
in ways that are best suited to domain spe-
Andrew Ng, who became one of the board of direc-
cific scenarios. Prompt engineering alone is
tors in 2017. The chatbot is a popular App with 4.7
not enough. Thanks to methods mentioned in
rating out of 5, and more than 5,900 reviews in July
Section 3, fine-tuning even complex model ar-
2023, and exchanges millions of messages with
chitectures can mostly be achieved on a single
users every week in 2021 (Steven Loeb, 2021).
or a few GPUs. Without this, the role of ML
However, despite great user reviews, it took more researchers without an unimaginable amount
than two years for the company to go through the of resources risks being diminished to prompt
clinical trials process and get closer to being en- engineers.
dorsed by mental health doctors. Woebot first
posted their clinical trials recruitment notice on • Controllability: Being able to choose where
ClinicalTrials.gov in 2019, and designed a process to run the model, what data is seen by the
to recruit 101 participants to evaluate whether this model, and what model outputs are used relies
chatbot can help in alcohol use disorders etc. It took heavily on the model being easy enough to run
around 5 months to complete the study in 2020, and on local infrastructure, and model components
the results were first posted in Aug 2022. (Woebot are transparent and interpretable. Section 4
Health, 2022). In 2023, Woebot announced the en- listed a wide range of options to select from
for research and/or business use, which lever- Qingqing Huang, Aren Jansen, Adam Roberts, Marco
age the power of large foundation models and Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Chris-
tian Frank. 2023. MusicLM: Generating music from
at the same time keeps data local. Moreover,
text. arXiv.
with smaller models, users will have a better
chance tuning it with instruction following Yuvanesh Anand, Zach Nussbaum, Brandon Duder-
strategies, to further reduce mis-information stadt, Benjamin Schmidt, and Andriy Mulyar. 2023.
Gpt4all: Training an assistant-style chatbot with large
and ensure the compliance requirements for scale data distillation from gpt-3.5-turbo. https:
model outputs. This increases the chance //github.com/nomic-ai/gpt4all.
of successful AI application in compliance-
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
demanding domains. Askell, Anna Chen, Nova DasSarma, Dawn Drain,
• Affordability: Having access to smaller mod- Stanislav Fort, Deep Ganguli, Tom Henighan,
Nicholas Joseph, Saurav Kadavath, Jackson Kernion,
els and cheaper training / fine-tuning options Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac
is the only way that privacy-sensitive indus- Hatfield-Dodds, Danny Hernandez, Tristan Hume,
tries and applications can avoid the trade-off Scott Johnston, Shauna Kravec, Liane Lovitt, Neel
between giving up the right of autonomous Nanda, Catherine Olsson, Dario Amodei, Tom
Brown, Jack Clark, Sam McCandlish, Chris Olah,
data governance, and squandering unreason- Ben Mann, and Jared Kaplan. 2022a. Training a
able amounts of funds on training gigantic helpful and harmless assistant with reinforcement
models in-house. As mentioned in Section 5, learning from human feedback. arXiv.
the affordable option to build domain-specific Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
"small" language models enables industries Askell, Anna Chen, Nova DasSarma, Dawn Drain,
like finance and healthcare to leverage AI Stanislav Fort, Deep Ganguli, Tom Henighan, et al.
without risking leaking sensitive data to un- 2022b. Training a helpful and harmless assistant with
reinforcement learning from human feedback. arXiv
warranted third parties. In this sense, lowered
preprint arXiv:2204.05862.
costs brought about by these "small" models
can prevent the privilege of using AI from Emily M Bender, Timnit Gebru, Angelina McMillan-
falling into the hands of a few exclusive enti- Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language models
ties. be too big? In ACM conference on fairness, account-
To sum up, being users of new achievements like ability, and transparency.
GPT-4 is great. Being builders and/or owners of Yoshua Bengio. 2023. How rogue AIs may
innovations is even better. As technology optimists, arise. https://yoshuabengio.org/2023/05/22/
the authors believe that it is only through the ability how-rogue-ais-may-arise/.
to understand and leverage AI that the society as a Stella Biderman, Hailey Schoelkopf, Quentin Anthony,
whole can mitigate the potential AI risks. With a Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo-
well designed paradigm, the open source commu- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai
nity ans small language models can increase the Prashanth, Edward Raff, Aviya Skowron, Lintang
Sutawika, and Oskar van der Wal. 2023. Pythia:
chance for all to benefit from, and to contribute to, A suite for analyzing large language models across
the power of AI. training and scaling. arXiv.

Acknowledgments BigScience Workshop et al. 2023. BLOOM: A 176B-


parameter open-access multilingual language model.
The authors would like to thank Yiyao Liu and arXiv.
Qibin Chen for offering constructive feedback and
Samuel R. Bowman. 2023. Eight things to know about
valuable insights. The authors used ChatGPT (Ope- large language models. arXiv.
nAI, 2022) to edit several sentences in the essay
with the following prompt: Revise to more con- Andres M Bran, Sam Cox, Andrew D White, and
Philippe Schwaller. 2023. Chemcrow: Augmenting
cise, formal, and fluent, following the style of an large-language models with chemistry tools. arXiv.
academic research paper: [Insert sentence].
Anthony Brohan et al. 2022. RT-1: Robotics trans-
former for real-world control at scale. arXiv.
References
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Jesse Engel, Mauro Verzetti, Antoine Caillon, Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Pedro A. Ortega. 2023. Neural networks and the
Gretchen Krueger, Tom Henighan, Rewon Child, chomsky hierarchy. In ICLR.
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav
teusz Litwin, Scott Gray, Benjamin Chess, Jack Lialin, and Anna Rumshisky. 2023. Honey, i shrunk
Clark, Christopher Berner, Sam McCandlish, Alec the language: Language model behavior at reduced
Radford, Ilya Sutskever, and Dario Amodei. 2020. scale. In ACL.
Language models are few-shot learners. In NeurIPS.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Jacob Browning and Yann LeCun. 2023. Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
AI chatbots don’t care about your so- of quantized llms. arXiv.
cial norms. https://www.noemamag.com/
ai-chatbots-dont-care-about-your-social-norms/. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Ryan Burnell, Wout Schellaert, John Burden, Tomer D. deep bidirectional Transformers for language under-
Ullman, Fernando Martinez-Plumed, Joshua B. standing. arXiv.
Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha
Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong-
Murray Shanahan, Ellen M. Voorhees, Anthony G. han Yang, Yusheng Su, Shengding Hu, Yulin Chen,
Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo. Chi-Min Chan, Weize Chen, Jing Yi, Weilin Zhao,
2023. Rethink reporting of evaluation results in ai. Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei
Science, 380(6641):136–138. Chen, Yang Liu, Jie Tang, Juanzi Li, and Maosong
Sun. 2023. Parameter-efficient fine-tuning of large-
Mark Chen et al. 2023. Evaluating large language mod- scale pre-trained language models. Nature Machine
els trained on code. arXiv. Intelligence, 5(3):220–235.
Ted Chiang. 2023. Will A.I. become the new McK- Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
insey? https://www.newyorker.com/science/ Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2020. GLM:
annals-of-artificial-intelligence/ General language model pretraining with autoregres-
will-ai-become-the-new-mckinsey. sive blank infilling. In ACL.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chan-
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion dra Bhagavatula, Ronan Le Bras, Jena D. Hwang,
Stoica, and Eric P. Xing. 2023. Vicuna: An open- Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson
source chatbot impressing gpt-4 with 90%* chatgpt Ettinger, Zaid Harchaoui, and Yejin Choi. 2023.
quality. Faith and fate: Limits of transformers on compo-
Aakanksha Chowdhery et al. 2022. PaLM: Scaling sitionality. arXiv.
language modeling with pathways. arXiv. Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How
Brian Christian. 2021. The Alignment Problem: Ma- small can language models be and still speak coherent
chine Learning and Human Values. WW Norton. english? arXiv.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- Kawin Ethayarajh, Yejin Choi, and Swabha
tic, Shane Legg, and Dario Amodei. 2017. Deep Swayamdipta. 2022. Understanding dataset
reinforcement learning from human preferences. Ad- difficulty with V-usable information. In Proceedings
vances in neural information processing systems, 30. of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine
Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Learning Research, pages 5988–6008. PMLR.
Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Gh-
odsi, Patrick Wendell, Matei Zaharia, and Reynold Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
Xin. 2023. Free Dolly: Introducing the world’s ing, Travis Hoppe, Charles Foster, Jason Phang,
first truly open instruction-tuned LLM. https:// Horace He, Anish Thite, Noa Nabeshima, Shawn
tinyurl.com/3v9jss39. Presser, and Connor Leahy. 2020. The pile: An
800gb dataset of diverse text for language modeling.
Alison Darcy, Aaron Beaudette, Emil Chiauzzi, Jade
Daniels, Kim Goodwin, Timothy Y. Mariano, Paul Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
Wicks, and Athena Robinson. 2022. Anatomy of a lam. 2022. Repairing the cracked foundation: A sur-
woebot® (wb001): agent guided cbt for women with vey of obstacles in evaluation practices for generated
postpartum depression. Expert Review of Medical text. arXiv.
Devices, 19(4):287–301. PMID: 35748029.
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal-
Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim lace, Pieter Abbeel, Sergey Levine, and Dawn Song.
Genewein, Li Kevin Wenliang, Elliot Catt, Chris 2023. Koala: A dialogue model for academic re-
Cundy, Marcus Hutter, Shane Legg, Joel Veness, and search. Blog post.
Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Brown, Benjamin Chess, Rewon Child, Scott Gray,
Laura Weidinger, Martin Chadwick, Phoebe Thacker, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Lucy Campbell-Gillingham, Jonathan Uesato, Po- Scaling laws for neural language models. arXiv.
Sen Huang, Ramona Comanescu, Fan Yang, Abigail
See, Sumanth Dathathri, Rory Greig, Charlie Chen, Celeste Kidd and Abeba Birhane. 2023. How ai can dis-
Doug Fritz, Jaume Sanchez Elias, Richard Green, tort human beliefs. Science, 380(6651):1222–1223.
Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel
Foley, Susannah Young, Iason Gabriel, William Isaac, Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
John Mellor, Demis Hassabis, Koray Kavukcuoglu, 2023. Language models can solve computer tasks.
Lisa Anne Hendricks, and Geoffrey Irving. 2022. arXiv.
Improving alignment of dialogue agents via targeted
human judgements. arXiv. Sung Kim. 2023. List of open sourced fine-tuned large
Sharon Goldman. 2023. Top AI researcher dismisses AI language models (LLM). https://tinyurl.com/
‘extinction’ fears, challenges ‘hero scientist’ narrative. ykf57jd6.
https://tinyurl.com/bdd772p5.
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte,
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens,
César Teodoro Mendes, Allie Del Giorno, Sivakanth Abdullah Barhoum, Nguyen Minh Duc, Oliver
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri,
de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, David Glushkov, Arnav Dantuluri, Andrew Maguire,
Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Christoph Schuhmann, Huu Nguyen, and Alexander
Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Mattick. 2023. Openassistant conversations – democ-
Yuanzhi Li. 2023. Textbooks are all you need. arXiv. ratizing large language model alignment.
Derrick Harris, Matt Bornstein, and Guido Appen-
Mina Lee, Megha Srivastava, Amelia Hardy, John
zeller. Ai canon. https://a16z.com/2023/05/25/
Thickstun, Esin Durmus, Ashwin Paranjape, Ines
ai-canon/. Accessed: 2023-07-02.
Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda
John Hewitt, John Thickstun, Christopher D. Manning, Rong, Rose E. Wang, Minae Kwon, Joon Sung
and Percy Liang. 2023. Backpack language models. Park, Hancheng Cao, Tony Lee, Rishi Bommasani,
In ACL. Michael Bernstein, and Percy Liang. 2022. Evaluat-
ing human-language model interaction. arXiv.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
Diego de Las Casas, Lisa Anne Hendricks, Johannes Optimizing continuous prompts for generation. In
Welbl, Aidan Clark, Tom Hennigan, Eric Noland, ACL.
Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Si- Yujia Li, David Choi, Junyoung Chung, Nate Kush-
monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, man, Julian Schrittwieser, Rémi Leblond, Tom Ec-
and Laurent Sifre. 2022. Training compute-optimal cles, James Keeling, Felix Gimeno, Agustin Dal
large language models. arXiv. Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
Bruna Morrone, Quentin De Laroussilhe, Andrea Sen Huang, Johannes Welbl, Sven Gowal, Alexey
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Cherepanov, James Molloy, Daniel J. Mankowitz,
Parameter-efficient transfer learning for nlp. In In- Esme Sutherland Robson, Pushmeet Kohli, Nando
ternational Conference on Machine Learning, pages de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
2790–2799. PMLR. 2022. Competition-level code generation with Al-
phaCode. Science, 378(6624):1092–1097.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol
Weizhu Chen. 2021. LoRA: Low-rank adaptation of Hausman, Brian Ichter, Pete Florence, and Andy
large language models. arXiv. Zeng. 2023a. Code as policies: Language model
programs for embodied control. arXiv.
Rongjie Huang, Mingze Li, Dongchao Yang, Jia-
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu,
Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Percy Liang et al. 2022. Holistic evaluation of language
Zhou Zhao, and Shinji Watanabe. 2023. AudioGPT: models. arXiv.
Understanding and generating speech, music, sound,
and talking head. arXiv. Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu,
Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji,
Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrin- Shaoguang Mao, Yun Wang, Linjun Shou, Ming
maya Sachan, Rada Mihalcea, Mona Diab, and Bern- Gong, and Nan Duan. 2023b. TaskMatrix.AI: Com-
hard Schölkopf. 2023. Can large language models pleting tasks by connecting foundation models with
infer causation from correlation? arXiv. millions of APIs. arXiv.
Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Matt Novak. 2023. Lawyer uses chatgpt in federal court
Gormley, and Jason Eisner. 2021. Limitations of and it goes horribly wrong. https://tinyurl.com/
autoregressive models and their alternatives. In 5n7uk84m.
NAACL.
OpenAI. 2022. Introducing chatgpt. https://openai.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, com/blog/chatgpt.
Xingyu Dang, and Song Han. 2023. AWQ: OpenAI. 2023. GPT-4. https://openai.com/
Activation-aware weight quantization for LLM com- research/gpt-4.
pression and acceleration. arXiv.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Carroll Wainwright, Pamela Mishkin, Chong Zhang,
2020. Exploring versatile generative language model Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
via parameter-efficient transfer learning. arXiv 2022. Training language models to follow instruc-
preprint arXiv:2004.03829. tions with human feedback. Advances in Neural
Information Processing Systems, 35:27730–27744.
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi
Zhang, Joydeep Biswas, and Peter Stone. 2023a. Shishir G. Patil, Tianjun Zhang, Xin Wang, and
LLM+P: Empowering large language models with Joseph E. Gonzalez. 2023. Gorilla: Large language
optimal planning proficiency. arXiv. model connected with massive apis. arXiv.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and
Ilya Sutskever. 2018. Improving language under-
Tengyu Ma. 2023b. Sophia: A scalable stochas-
standing by generative pre-training. arXiv.
tic second-order optimizer for language model pre-
training. arXiv. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: models are unsupervised multitask learners. arXiv.
Fine-tuned LLaMA outperforms GPT-4 on arithmetic
tasks. arXiv. Jack W. Rae et al. 2022. Scaling language models:
Methods, analysis & insights from training Gopher.
Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, arXiv.
Nancy Kanwisher, Joshua B. Tenenbaum, and
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Evelina Fedorenko. 2023. Dissociating language
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
and thought in large language models: a cognitive
Wei Li, and Peter J. Liu. 2020. Exploring the limits
perspective. arXiv.
of transfer learning with a unified text-to-text Trans-
former. JMLR, 21(140):1–67.
Gary Marcus. 2023. The sparks of
AGI? or the end of science? https: Sebastian Ruder, Jonas Pfeiffer, and Ivan Vulic. 2022.
//cacm.acm.org/blogs/blog-cacm/ Modular and parameter-efficient fine-tuning for nlp
271354-the-sparks-of-agi-or-the-end-of-science/models. In EMNLP: Tutorial Abstracts.
fulltext.
Stuart Russell. 2019. Human Compatible: Artificial
John Markoff. 2020. The minds behind the ai ’arms Intelligence and the Problem of Control. Viking.
race’. The New York Times.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Weiming Lu, and Yueting Zhuang. 2023. Hugging-
Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan GPT: Solving AI tasks with ChatGPT and its friends
McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, in HuggingFace. arXiv.
Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauff- Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin
man, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Gal, Nicolas Papernot, and Ross Anderson. 2023.
Max Weiss, Sicong Huang, The Floating Droid, Tom The curse of recursion: Training on generated data
Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, makes models forget. arXiv.
Zhengping Zhou, Najoung Kim, Samuel R. Bowman,
and Ethan Perez. 2023. Inverse scaling: When bigger Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres,
isn’t better. arXiv. Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl,
Heather Cole-Lewis, Darlene Neal, Mike Schaeker-
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- mann, Amy Wang, Mohamed Amin, Sami Lachgar,
foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Philip Mansfield, Sushant Prakash, Bradley Green,
Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Ewa Dominowska, Blaise Aguera y Arcas, Nenad
Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Tomasev, Yun Liu, Renee Wong, Christopher Sem-
Thomas Scialom. 2023. Augmented language mod- turs, S. Sara Mahdavi, Joelle Barral, Dale Webster,
els: a survey. arXiv. Greg S. Corrado, Yossi Matias, Shekoofeh Azizi,
Alan Karthikesalingam, and Vivek Natarajan. 2023.
Melanie Mitchell. 2020. Artificial Intelligence: A Guide Towards expert-level medical question answering
for Thinking Humans. Picador. with large language models. arXiv.
Shaden Smith, Mostofa Patwary, Brandon Norick, Michael Figurnov, Olaf Ronneberger, Russ Bates,
Patrick LeGresley, Samyam Rajbhandari, Jared Simon A. A. Kohl, Anna Potapenko, Andrew J. Bal-
Casper, Zhun Liu, Shrimai Prabhumoye, George lard, Bernardino Romera-Paredes, Stanislav Nikolov,
Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Rishub Jain, Ellen Clancy, David Reiman, Stig
Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Petersen, Andrew W. Senior, Koray Kavukcuoglu,
Song, Mohammad Shoeybi, Yuxiong He, Michael Ewan Birney, Pushmeet Kohli, John Jumper, and
Houston, Saurabh Tiwary, and Bryan Catanzaro. Demis Hassabis. 2021. Highly accurate protein struc-
2022. Using DeepSpeed and Megatron to train ture prediction for the human proteome. Nature,
Megatron-Turing NLG 530B, a large-scale generative 596(7873):590–596.
language model. arXiv.
Roberto Verdecchia, June Sallou, and Luis Cruz. 2023.
Aarohi Srivastava et al. 2022. Beyond the imitation A systematic review of green AI. Data Mining
game: Quantifying and extrapolating the capabilities Knowledge Discovery.
of language models. arXiv.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Stability AI. 2023a. StableLM: Stability AI language Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
models. https://github.com/stability-AI/ Hajishirzi. 2023. Self-Instruct: Aligning language
stableLM/. model with self generated instructions. In ACL.
Stability AI. 2023b. StableVicuna: Open-Source Woebot Health. 2022. Woebot for Substance Use Dis-
RLHF Chatbot. https://stability.ai/blog/ orders . https://classic.clinicaltrials.gov/
stablevicuna-open-source-rlhf-chatbot. Ac- ct2/show/study/NCT04096001. Accessed on July
cessed on July 4, 2023. 1, 2023.
Steven Loeb. 2021. Woebot CEO Michael Evers on Woebot Health. 2023. Woebot Health Enrolls First
AI in mental health, and how to get a chatbot to Patient in Pivotal Clinical Trial of WB001 for Post-
bond with a human. https://vator.tv/news/ partum Depression . https://woebothealth.com/
2021-07-30-woebot-ceo-michael-evers-on-ai-in-mental-health-and-how-to-get-a-chatbot-to-bond-with-a-hu
woebot-health-enrolls-first-patient-in-pivotal-clinical-
Accessed on July 1, 2023. Accessed on July 1, 2023.
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai,
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong
and Chao Zhang. 2023. AdaPlanner: Adaptive plan-
Wang, Zecheng Tang, and Nan Duan. 2023a. Visual
ning from feedback with language models. arXiv.
ChatGPT: Talking, drawing and editing with visual
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann foundation models. arXiv.
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
An instruction-following LLaMA model. https:// Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
github.com/tatsu-lab/stanford_alpaca. badur, David Rosenberg, and Gideon Mann. 2023b.
BloombergGPT: A large language model for finance.
Romal Thoppilan et al. 2022. LaMDA: Language mod- arXiv.
els for dialog applications. arXiv.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan
Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Wang. 2023. FinGPT: Open-source financial large
Krishnan, Barry B. Rubin, and Bo Wang. 2023. Clin- language models. arXiv.
ical camel: An open-source expert-level medical lan-
guage model with dialogue-based knowledge encod- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
ing. arXiv. Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. 2023. Tree of thoughts: Deliberate
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier problem solving with large language models. arXiv.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Yang You. 2023. ColossalChat: An open-source so-
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard lution for cloning ChatGPT with a complete RLHF
Grave, and Guillaume Lample. 2023. LLaMA: Open pipeline. https://bit.ly/42ZTwW4.
and efficient foundation language models. arXiv.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Bojan Tunguz. 2023. Tweet by bojan tun- Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
guz. https://twitter.com/tunguz/status/ Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
1673760614576189441?s=20. [Accessed July 1, Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng
2023]. Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-
130B: An open bilingual pre-trained model. In ICLR.
Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu,
Tim Green, Michal Zielinski, Augustin Žídek, Alex Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan
Bridgland, Andrew Cowie, Clemens Meyer, Agata Adhikarla, Sunyang Fu, Xun Chen, Chen Chen,
Laydon, Sameer Velankar, Gerard J. Kleywegt, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davi-
Alex Bateman, Richard Evans, Alexander Pritzel, son, Quanzheng Li, Yong Chen, Hongfang Liu, and
Lichao Sun. 2023a. BiomedGPT: A unified and gen-
eralist biomedical generative pre-trained transformer
for vision, language, and multimodal tasks. arXiv.
Lvmin Zhang and Maneesh Agrawala. 2023. Adding
conditional control to text-to-image diffusion models.
arXiv.
Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman,
Maxim Raginsky, and Talia Ringer. 2023b. Can trans-
formers learn to solve problems recursively? arXiv.

Yuhui Zhang, Michihiro Yasunaga, Zhengping Zhou,


Jeff Z. HaoChen, James Zou, Percy Liang, and Ser-
ena Yeung. 2023c. Beyond positive scaling: How
negation impacts scaling trends of language models.
In ACL.

You might also like