Mini Giant
Mini Giant
Mini Giant
els are flourishing and becoming more and for these small language models. We show that
more competent. We call them "mini-giants". compared to their large counterparts, small lan-
We argue that open source community like guage/foundation models offer particularly promis-
Kaggle and mini-giants will win-win in many ing opportunities for various industries (including
ways, technically, ethically and socially. In open source ML research and Kaggle competitions)
this article, we present a brief yet rich back- to not only utilize but also to actively participate in
ground, discuss how to attain small language
the creation/adaptation of modern language models
models, present a comparative study of small
language models and a brief discussion of eval- and AI in general. We center our arguments around
uation methods, discuss the application sce- the 3 key advantages of small models: adaptability,
narios where small language models are most controllability, and affordability.
needed in the real world, and conclude with First of all, smaller models offer better adapt-
discussion and outlook. ability by being more manageable to modify and
fine-tune. In Section 3, we present various strate-
gies of creating these small models through opti-
1 Introduction
mized fine-tuning techniques. This is important
Large language models (LMs), like ChatGPT and because in most industries (or even in a Kaggle
GPT-4, have been taken us by storm. People com- competition), innovation typically arises from the
pare it to the moment of the computer, the moment ability to incorporate domain-specific data into the
of the operating system, the moment of the Internet, language model or to adjust the model’s structure
or the moment of the iPhone. It is considered by to accommodate their unique requirements. Rely-
many a paradigm shift in NLP and deep learning. ing solely on prompt engineering often falls short.
Large language models are large: OpenAI GPT- Therefore, smaller language models bring forward
3 175B parameters, Google PALM 560B, and ru- great benefits to these industries, offering the much-
mor has it that GPT-4 is as large as 8 × 220B. For needed flexibility for adaptation, allowing them to
most small/medium companies and independent full leverage the power of AI and thus catalyzing
researchers, it is prohibitively expensive to train innovation within them.
or update such giant models. In addition, huge Second, smaller models can run on local infras-
consumption of energy for language model train- tructure without resorting to GPU-rich third parties,
ing poses a serious concern to the environmental improving the model’s controllability by ensuring
sustainability (Verdecchia et al., 2023). model users’ autonomous data governance and re-
Recent studies show that network size is not the sult monitoring. In Section 5, we discuss real world
sole determinant of model performance (Hoffmann scenarios where small language models fill in the
et al., 2022). And thanks to the efforts from the gaps when their large counterparts are unacceptable
ML open source community as well as private AI due to privacy concerns. In Section 4, we also look
companies, we’ve recently seen more and more into strategies for customized instruction following
"small" LMs created out of these larger models. and other pioneer research directions for small mod-
With their network parameter sizes of around or els, underpinning the relevance of smaller language
Data quality
InstructGPT ChatGPT
Human- Open
StableVicuna Guanaco
curated Assistant
Dolly 2.0
Alpaca Vicuna
GPT-3.5
synthetic GPT4All Dolly 1.0 Koala
Pretrain
GPT-3 LLaMA Pythia Time
only
Jun’20 Jan’22 Nov’22 Feb’23 Mar’23 Apr’23 May’23
Figure 1: An evolution tree of recently released instruction-following small LMs. The color of the text boxes
indicates the openness of the license under which the models are released: red stands for proprietary licenses, yellow
stands for non-commercial licenses, and green stands for licenses permissive for commercial use.
models in ensuring compliance and mitigating the plication scenarios where small foundation models
risk of misinformation. Understanding and manag- are most needed in the real world. We conclude
ing the way a model operates, the data it accesses, with discussions and an outlook.
and the outputs it produces, form the cornerstone
of responsible AI usage. 2 A brief yet rich background
Another crucial aspect of the superiority of small The Giants are fast ChatGPT set a record for
language models, is affordability. Taking an aver- fastest-growing user base: one million users in
age Kaggle competitor as an example. The de- five days, and 100 million monthly active users in
manding nature of a Kaggle competition requires January 2023, two months after launching.
the competitor to iterate on the modeling solutions, Radford et al. (2018) introduce generative pre-
often times by integrating a variety of data sources training for LMs, which could be regarded as “GPT-
and trying different architectures. This necessitates 1”. Radford et al. (2019) introduce GPT-2, an
transparent model components and fast iteration unsupervised multitask learning LM. Brown et al.
pace, which is at odds with the resource require- (2020) introduce GPT-3, a few-shot learning LM,
ments that super large language models impose. popularizing the concept of in-context learning.
Having access to fast and inexpensive training / in- OpenAI (2022) introduces ChatGPT and OpenAI
ferencing options, means that he/she will not have (2023) introduces GPT-4.
to face the trade-off between being constrained There are many LMs released in recent
in their innovation space, and moving away from years: Google BERT, Bidirectional Encoder Rep-
language model solutions entirely. As another ex- resentations from Transformers (Devlin et al.,
ample which is elaborated in Section 5, privacy- 2019), Google T5, Text-To-Text Transfer Trans-
sensitive sectors such as finance and healthcare former (Raffel et al., 2020), Google LaMDA, Lan-
face a more pressing challenge of choosing be- guage Model for Dialogue Applications (Thoppilan
tween regulation risks and the prohibitive cost of et al., 2022), Google PaLM, Pathways Language
training massive models in-house. Small language Model (Chowdhery et al., 2022), Deepmind Spar-
models provide the opportunity for them to con- row (Glaese et al., 2022), Anthropic Claude (Bai
form with regulations while not missing out on the et al., 2022a), Deepmind Chinchilla (Hoffmann
power of latest AI technologies. et al., 2022) Nivedia Megatron-Turing NLG (Smith
Outline et al., 2022), Deepmind Gopher (Rae et al., 2022),
HuggingFace BLOOM (BigScience Workshop
In the following sections, we first present a brief
et al., 2023), and Meta LLaMA, Large Language
yet rich background. Next, we discuss how to at-
Model Meta AI (Touvron et al., 2023).
tain small foundation models, including parameter
reduction and efficient training/fine-tuning tech- Language models as experts Besides general
niques. Then we present a comparative study of purpose LMs as above, there are many specialized
“small” foundation models a brief discussion of models for various application, e.g., Table 1 shows
evaluation methods. After that, we discuss the ap- a sample of them.
Model Application Reference
AlphaFold Protein folding Tunyasuvunakool et al. (2021)
Codex Coding Chen et al. (2023)
AlphaCode Coding Li et al. (2022)
RT-1 Robotics Brohan et al. (2022)
BiomedGPT Biomedical Zhang et al. (2023a)
Clinical Camel Clinical Toma et al. (2023)
BloombergGPT Finance Wu et al. (2023b)
FinGPT Finance Yang et al. (2023)
Med-PaLM 2 Medical Singhal et al. (2023)
MusicLM Music Agostinelli et al. (2023)
AudioGPT Audio Huang et al. (2023)
Language and functional competence Ma- Augmented LMs with tools A natural way to
howald et al. (2023) study language competence vs harnesses the language competence of LMs is
thought competence of LMs and show impressive by utilizing tools like a search engine, a vector
but imperfect formal linguistic competence, i.e., database, a code interpreter, or a solver to handle
“knowledge of rules and patterns of a given lan- tasks, e.g., LangChain1 , HuggingGPT (Shen et al.,
guage”, yet failures on many tests requiring func- 2023), Visual ChatGPT (Wu et al., 2023a), TaskMa-
tional linguistic competence, i.e.,“a host of cogni- trix.AI (Liang et al., 2023b), RCI (Kim et al., 2023),
tive abilities required for language understanding LLM+P (Liu et al., 2023a), ChemCrow (Bran et al.,
and use in the real world”. 2023), etc. See Mialon et al. (2023) for a survey
about augmented LMs.
Then we can leverage LMs’ competence as a
Domain expertise is still required, e.g., the
good model of language, e.g., by prompt engineer-
ChemCrow Bran et al. (2023) authors mention that
ing. We can also manage to improve the functional
“However, it is important to emphasize that po-
competence, e.g., factuality, safety, and planning.
tential risks may arise for non-experts who lack
With the capacity of in-context learning (Brown
the chemical reasoning to evaluate results or the
et al., 2020), prompting is a natural and popular
proper lab training, as conducting experiments still
way to utilize LMs. Prompting is the user interface
necessitates thorough laboratory experience.” and
for LMs, and can be formed with advanced meth-
the director of the movie trailer mentions that “For
ods like search and coding, e.g., Tree of Thoughts
those who believe that AI will do everything for
(ToT) (Yao et al., 2023), AdaPlanner (Sun et al.,
you: No!” and “I’ll always prefer to put my own
2023), Code as Policies (Liang et al., 2023a). Fine-
heart & soul in.” 2
tuning can improve LMs further. A parameter ef-
ficient approach makes fine-tuning large LMs fea- Mini-Giants are coming Following the leak-
sible considering the cost (Hu et al., 2021; Ding age of LLaMA (Touvron et al., 2023), many
et al., 2023; Ruder et al., 2022). Augmenting LMs “small” LMs appear in the open source commu-
with tools can achieve various functionalities. nity, with neural network parameter sizes of around
10B or smaller, e.g., Alpaca (Taori et al., 2023),
To approach artificial general intelligence (AGI)
Dolly (Conover et al., 2023), Koala (Geng et al.,
from language models, Mahowald et al. (2023)
2023), Vicuna (Chiang et al., 2023), StableLM (Sta-
suggest that, “instead of or in addition to scaling
bility AI, 2023a), ChatGLM (Du et al., 2020; Zeng
up the size of the models, more promising solu-
et al., 2023), Guanaco (Dettmers et al., 2023),
tions will come in the form of modular architec-
Pythia (Biderman et al., 2023), GPT4All3 , Open-
tures . . . , like the human brain, integrate language
Assistant4 , ColossalChat (You, 2023).
processing with additional systems that carry out
See Kim (2023) for a list of open sourced fine-
perception, reasoning, and planning”. The authors
tuned LMs. In Section 4, we will discuss and com-
believe that “a model that succeeds at real-world
1
language use would include – in addition to the https://langchain.com
2
https://twitter.com/ChristianF369/status/
core language component – a successful problem 1651607149804498946
solver, a grounded experiencer, a situation modeler, 3
https://github.com/nomic-ai/gpt4all
4
a pragmatic reasoner, and a goal setter”. https://github.com/LAION-AI/Open-Assistant
pare these mini-giants in details. 3.2 Efficient fine-tuning strategies for
foundation models
Discussions & debates abound There are all Compared with building even more compact mod-
sorts of discussions & debates, e.g. discussions els, the majority of research work by the ML com-
about AI alignment with human value from Russell munity in the direction of "smaller" foundation
(2019); Mitchell (2020); Christian (2021). Table 2 models, is around making them easier to fine-tune.
lists a few representative examples. Here we list several key strategies to achieve this.
Adapter (Houlsby et al., 2019) is a strategy to
3 How to make large foundation models add NN layers after existing layers (usually trans-
"small" former blocks) in pretrained foundation models,
so that they can be adapted to custom tasks with-
Since the advent of ultra-capable large foundation out changing the weights of existing layers. This
models like ChatGPT and StableDiffusion, numer- paper proposes an adapter module with two lin-
ous efforts have been devoted to address the pri- ear layers plus a non-linear activation in between.
mary challenges for their wide-spread utilization: The first layer projects the hidden state to a lower-
their humongous parameter sizes and the sheer time dimensional space, and the second layer projects
and compute resources needed to fine-tune them. it back to the original dimension. A newer pa-
Within 2 years, the research and open source com- per (Lin et al., 2020) recommended only one linear
munity have arrived at several strategies to cope layer plus an additional LayerNorm, as an Adapter
with this issue, which we will discuss in this sec- module. Adapter achieves near state-of-the-art per-
tion. formance, while adding only a small amount of
We classify these strategies into 2 groups: ones parameters per task - on GLUE, the added parame-
that directly reduce the parameter sizes, and ones ters accounted for 3.6% of the original model.
that makes fine-tuning large models more efficient.
Prefix fine-tuning (Li and Liang, 2021) Unlike
the Adapter architecture that focuses on modify-
3.1 Foundation models with reduced ing model behavior via model params, Prefix fine-
parameters tuning seeks to train a few params that are used
as input prefixes, for each custom sub task. The
Chinchilla (Hoffmann et al., 2022) is the first
authors commented that the method is inspired by
influential study on computational efficiency of
prompting: similar to prepending a few sentences
modern large language models. It put forward the
before a generation task, Prefix-tuning prepends
argument that given a compute budget, the best
a sequence of trained vectors to the input - just
model is attained not by larger parameter size, but
that the prefix vectors do not have to correspond
by more training data tokens. Based on this princi-
to any real tokens. Compared to full fine-tuning,
ple, the authors produced the Chinchilla 70B model
prefix-fine tuning achieves comparable or better
which out-performs prior large models 4 times as
performance with just 0.1% added parameters.
large, with the same amount of compute.
LoRA (Hu et al., 2021) Marks a substan-
LLaMa (Touvron et al., 2023) further reduces tial progress in parameter efficient fine-tuning.
the parameters and released a series of models Performance-wise, it is more efficient than previ-
ranging from 7 to 65B parameters, following the ous methods like Adapter and Prefix-finetuning.
Chinchilla computation rule. Notably, the paper LoRA proposes that we add a low rank, trainable
used only publicly available datasets as training cor- matrix in parallel to the frozen, pretrained model
pus and proved comparable performance as closed weights. The activation will be the sum of these
source counterparts. This, as commented by (Har- two matrices. Formally:
ris et al.), started a revolution of open source LLM
h = W0 x + ∆W x = W0 x + BAx
models. Along with parameter reduction, another
contribution by the authors is efficient implemen- where B and A are much "thinner" (i.e. low
tation of multi-headed attention layers through the rank), trainable matrices compared to W0 (the
open source xformers library, which optimizes the frozen pretrained matrix). The use of low rank ma-
memory consumption in training. trices reduces trainable parameters to as much as
Issue to discuss Reference
The dangers of stochastic parrots Bender et al. (2021)
Limitation of neural networks Delétang et al. (2023)
Limitation of autoregressive models Lin et al. (2021)
Lack of causality Jin et al. (2023)
Lack of compositionality Dziri et al. (2023)
Lack of recursion Zhang et al. (2023b)
Limitation of scaling laws Deshpande et al. (2023)
Limitation of scaling laws McKenzie et al. (2023)
Model collapse Shumailov et al. (2023)
Artificial general intelligence (AGI) Marcus (2023)
Evaluation of AI Burnell et al. (2023)
Distortion of human beliefs Kidd and Birhane (2023)
Social norms Browning and LeCun (2023)
Risks and benefits Goldman (2023)
Existential risk Bengio (2023)
Court hearing due to hallucination Novak (2023)
Risk of further concentration of wealth Chiang (2023)
Eight things to know Bowman (2023)
by 10,000 times of the original model, compared to method here to inspire the readers to consider more
a full fine-tune of GPT-3 175B. The article suggests complex scenarios of controlling / customizing
that LoRA can be used next to any model weights, large foundation model’s outputs.
not just transformer layers. The authors claim that ControlNet copies weights of the original model
LoRA is superior compared to Adapters in that it to a frozen copy (ike all methods mentioned above).
doesn’t introduce additional inference latency; and The trainable branch consists of an exact same copy
it’s better than Prefix fine-tuning in that it doesn’t as the frozen copy, as well as two convolution lay-
reduce the available sequence length like the latter ers called "zero convolutions", both before and af-
does. Further more, since this architectural modi- ter the trainable copy. In the fine-tuning forward
fication is orthogonal to the ideas of Adapter and path, the activation from the trainable copy will be
Prefix fine-tuning, LoRA can be used in conjunc- combined with that of the frozen copy by Zero Con-
tion with them for even better results. volution. The so-called Zero Convolution is just a
1x1 convolution layer that are initiated with both
QLoRA (Dettmers et al., 2023) As an improve- weights and biases being zeros. The result of using
ment of LoRA, QLoRA proposes optimization ControlNet shows that in some tasks, ControlNets
methods via quantized low rank fine tuning. Inno- on a personal computer achieve comparable results
vations of QLoRA include a 4-bit data type: Nor- as commercial models trained on terabytes of GPU
malFloat4, which optimizes information efficiency memory and thousands of GPU hours.
for normally distributed data (e.g. weights) based
on information theory. Apart from that, the pa- 4 A brief survey of “small”
per uses Paged Optimizers (partial optimizer state instruction-following LMs
stored on CPU rather than GPU) to manage mem-
ory spikes, like when processing mini batches with Over the past few months, we have seen small LMs
long sequence lengths. Experiment results show flourish. See Figure 1 for an evolution tree. This is
that fine-tuning using QLoRA reaches 99.3% of a very fast progressing field, and it is challenging to
the performance of ChatGPT, and only requires even keep ahead with the latest progress. Quoting
training for 24 hours on one GPU. (Tunguz, 2023), “Trying to get ahead in AI these
days feels like wrestling a rabid 5,000 lbs hippo
ControlNet (Zhang and Agrawala, 2023) is covered in baby oil”.
proposed as a method to efficiently fine-tune im-
age generation models (diffusion model) on user- 4.1 Closed-source milestones
defined tasks. Because image generation models GPT-3 (Brown et al., 2020) gained public atten-
in general have a larger design space in terms of tion when it was released in 2020. As reported by
user interaction than language models, we list this New York Times, it “generates tweets, pens poetry,
Basic info Scale Openness
Time Training Training
MM/YY Model Institute # parameters hardware cost data size L I TC TD
06/20 GPT-3 OpenAI 175B 3.64k PT-days 300B tokens P P P ✓
02/23 LLaMA-7B Meta 7B 82k GPU-hours 1.4T tokens NC ✓ ✗ ✓
02/23 LLaMA-13B Meta 13B 135k GPU-hours 1.4T tokens NC ✓ ✗ ✓
04/23 Pythia-7B Eleuther AI 7B 33.5k GPU-hours 300B tokens C ✓ ✓ ✓
04/23 Pythia-12B Eleuther AI 12B 72k GPU-hours 300B tokens C ✓ ✓ ✓
Table 3: Comparison of recent base LMs. In the Openness section, L stands for License, I stands for Inference, TC stands
for Training Codes, and TD stands for Training Data. In the License column, P stands for Proprietary, NC stands for Non-
Commercial, and C stands for permissive for Commercial use.
Table 4: Comparison of recent instruction-following small LMs. The abbreviations of the column names follow Table 3.
summarizes emails, answers trivia questions, trans- by incorporating dialogue data into the supervised
lates languages and even writes its own computer fine-tuning and the RLHF stage. It acquired 1 mil-
programs” (Markoff, 2020). It shows that decent lion users in just 5 days and revolutionizes the way
few-shot performance can be achieved without gra- people interact with modern AIs. As a proprietary
dient update, and the unprecedented model scale product, although the web UI is free, the underlying
(175B parameters) is a key ingredient for success. model can only be accessed via a paid API.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- Kawin Ethayarajh, Yejin Choi, and Swabha
tic, Shane Legg, and Dario Amodei. 2017. Deep Swayamdipta. 2022. Understanding dataset
reinforcement learning from human preferences. Ad- difficulty with V-usable information. In Proceedings
vances in neural information processing systems, 30. of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine
Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Learning Research, pages 5988–6008. PMLR.
Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Gh-
odsi, Patrick Wendell, Matei Zaharia, and Reynold Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
Xin. 2023. Free Dolly: Introducing the world’s ing, Travis Hoppe, Charles Foster, Jason Phang,
first truly open instruction-tuned LLM. https:// Horace He, Anish Thite, Noa Nabeshima, Shawn
tinyurl.com/3v9jss39. Presser, and Connor Leahy. 2020. The pile: An
800gb dataset of diverse text for language modeling.
Alison Darcy, Aaron Beaudette, Emil Chiauzzi, Jade
Daniels, Kim Goodwin, Timothy Y. Mariano, Paul Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
Wicks, and Athena Robinson. 2022. Anatomy of a lam. 2022. Repairing the cracked foundation: A sur-
woebot® (wb001): agent guided cbt for women with vey of obstacles in evaluation practices for generated
postpartum depression. Expert Review of Medical text. arXiv.
Devices, 19(4):287–301. PMID: 35748029.
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal-
Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim lace, Pieter Abbeel, Sergey Levine, and Dawn Song.
Genewein, Li Kevin Wenliang, Elliot Catt, Chris 2023. Koala: A dialogue model for academic re-
Cundy, Marcus Hutter, Shane Legg, Joel Veness, and search. Blog post.
Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Brown, Benjamin Chess, Rewon Child, Scott Gray,
Laura Weidinger, Martin Chadwick, Phoebe Thacker, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
Lucy Campbell-Gillingham, Jonathan Uesato, Po- Scaling laws for neural language models. arXiv.
Sen Huang, Ramona Comanescu, Fan Yang, Abigail
See, Sumanth Dathathri, Rory Greig, Charlie Chen, Celeste Kidd and Abeba Birhane. 2023. How ai can dis-
Doug Fritz, Jaume Sanchez Elias, Richard Green, tort human beliefs. Science, 380(6651):1222–1223.
Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel
Foley, Susannah Young, Iason Gabriel, William Isaac, Geunwoo Kim, Pierre Baldi, and Stephen McAleer.
John Mellor, Demis Hassabis, Koray Kavukcuoglu, 2023. Language models can solve computer tasks.
Lisa Anne Hendricks, and Geoffrey Irving. 2022. arXiv.
Improving alignment of dialogue agents via targeted
human judgements. arXiv. Sung Kim. 2023. List of open sourced fine-tuned large
Sharon Goldman. 2023. Top AI researcher dismisses AI language models (LLM). https://tinyurl.com/
‘extinction’ fears, challenges ‘hero scientist’ narrative. ykf57jd6.
https://tinyurl.com/bdd772p5.
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte,
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens,
César Teodoro Mendes, Allie Del Giorno, Sivakanth Abdullah Barhoum, Nguyen Minh Duc, Oliver
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri,
de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, David Glushkov, Arnav Dantuluri, Andrew Maguire,
Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Christoph Schuhmann, Huu Nguyen, and Alexander
Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Mattick. 2023. Openassistant conversations – democ-
Yuanzhi Li. 2023. Textbooks are all you need. arXiv. ratizing large language model alignment.
Derrick Harris, Matt Bornstein, and Guido Appen-
Mina Lee, Megha Srivastava, Amelia Hardy, John
zeller. Ai canon. https://a16z.com/2023/05/25/
Thickstun, Esin Durmus, Ashwin Paranjape, Ines
ai-canon/. Accessed: 2023-07-02.
Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda
John Hewitt, John Thickstun, Christopher D. Manning, Rong, Rose E. Wang, Minae Kwon, Joon Sung
and Percy Liang. 2023. Backpack language models. Park, Hancheng Cao, Tony Lee, Rishi Bommasani,
In ACL. Michael Bernstein, and Percy Liang. 2022. Evaluat-
ing human-language model interaction. arXiv.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
Diego de Las Casas, Lisa Anne Hendricks, Johannes Optimizing continuous prompts for generation. In
Welbl, Aidan Clark, Tom Hennigan, Eric Noland, ACL.
Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Si- Yujia Li, David Choi, Junyoung Chung, Nate Kush-
monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, man, Julian Schrittwieser, Rémi Leblond, Tom Ec-
and Laurent Sifre. 2022. Training compute-optimal cles, James Keeling, Felix Gimeno, Agustin Dal
large language models. arXiv. Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
Bruna Morrone, Quentin De Laroussilhe, Andrea Sen Huang, Johannes Welbl, Sven Gowal, Alexey
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Cherepanov, James Molloy, Daniel J. Mankowitz,
Parameter-efficient transfer learning for nlp. In In- Esme Sutherland Robson, Pushmeet Kohli, Nando
ternational Conference on Machine Learning, pages de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
2790–2799. PMLR. 2022. Competition-level code generation with Al-
phaCode. Science, 378(6624):1092–1097.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol
Weizhu Chen. 2021. LoRA: Low-rank adaptation of Hausman, Brian Ichter, Pete Florence, and Andy
large language models. arXiv. Zeng. 2023a. Code as policies: Language model
programs for embodied control. arXiv.
Rongjie Huang, Mingze Li, Dongchao Yang, Jia-
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu,
Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Percy Liang et al. 2022. Holistic evaluation of language
Zhou Zhao, and Shinji Watanabe. 2023. AudioGPT: models. arXiv.
Understanding and generating speech, music, sound,
and talking head. arXiv. Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu,
Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji,
Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrin- Shaoguang Mao, Yun Wang, Linjun Shou, Ming
maya Sachan, Rada Mihalcea, Mona Diab, and Bern- Gong, and Nan Duan. 2023b. TaskMatrix.AI: Com-
hard Schölkopf. 2023. Can large language models pleting tasks by connecting foundation models with
infer causation from correlation? arXiv. millions of APIs. arXiv.
Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Matt Novak. 2023. Lawyer uses chatgpt in federal court
Gormley, and Jason Eisner. 2021. Limitations of and it goes horribly wrong. https://tinyurl.com/
autoregressive models and their alternatives. In 5n7uk84m.
NAACL.
OpenAI. 2022. Introducing chatgpt. https://openai.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, com/blog/chatgpt.
Xingyu Dang, and Song Han. 2023. AWQ: OpenAI. 2023. GPT-4. https://openai.com/
Activation-aware weight quantization for LLM com- research/gpt-4.
pression and acceleration. arXiv.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Carroll Wainwright, Pamela Mishkin, Chong Zhang,
2020. Exploring versatile generative language model Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
via parameter-efficient transfer learning. arXiv 2022. Training language models to follow instruc-
preprint arXiv:2004.03829. tions with human feedback. Advances in Neural
Information Processing Systems, 35:27730–27744.
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi
Zhang, Joydeep Biswas, and Peter Stone. 2023a. Shishir G. Patil, Tianjun Zhang, Xin Wang, and
LLM+P: Empowering large language models with Joseph E. Gonzalez. 2023. Gorilla: Large language
optimal planning proficiency. arXiv. model connected with massive apis. arXiv.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and
Ilya Sutskever. 2018. Improving language under-
Tengyu Ma. 2023b. Sophia: A scalable stochas-
standing by generative pre-training. arXiv.
tic second-order optimizer for language model pre-
training. arXiv. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: models are unsupervised multitask learners. arXiv.
Fine-tuned LLaMA outperforms GPT-4 on arithmetic
tasks. arXiv. Jack W. Rae et al. 2022. Scaling language models:
Methods, analysis & insights from training Gopher.
Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, arXiv.
Nancy Kanwisher, Joshua B. Tenenbaum, and
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Evelina Fedorenko. 2023. Dissociating language
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
and thought in large language models: a cognitive
Wei Li, and Peter J. Liu. 2020. Exploring the limits
perspective. arXiv.
of transfer learning with a unified text-to-text Trans-
former. JMLR, 21(140):1–67.
Gary Marcus. 2023. The sparks of
AGI? or the end of science? https: Sebastian Ruder, Jonas Pfeiffer, and Ivan Vulic. 2022.
//cacm.acm.org/blogs/blog-cacm/ Modular and parameter-efficient fine-tuning for nlp
271354-the-sparks-of-agi-or-the-end-of-science/models. In EMNLP: Tutorial Abstracts.
fulltext.
Stuart Russell. 2019. Human Compatible: Artificial
John Markoff. 2020. The minds behind the ai ’arms Intelligence and the Problem of Control. Viking.
race’. The New York Times.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Weiming Lu, and Yueting Zhuang. 2023. Hugging-
Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan GPT: Solving AI tasks with ChatGPT and its friends
McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, in HuggingFace. arXiv.
Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauff- Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin
man, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Gal, Nicolas Papernot, and Ross Anderson. 2023.
Max Weiss, Sicong Huang, The Floating Droid, Tom The curse of recursion: Training on generated data
Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, makes models forget. arXiv.
Zhengping Zhou, Najoung Kim, Samuel R. Bowman,
and Ethan Perez. 2023. Inverse scaling: When bigger Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres,
isn’t better. arXiv. Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl,
Heather Cole-Lewis, Darlene Neal, Mike Schaeker-
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- mann, Amy Wang, Mohamed Amin, Sami Lachgar,
foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Philip Mansfield, Sushant Prakash, Bradley Green,
Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Ewa Dominowska, Blaise Aguera y Arcas, Nenad
Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Tomasev, Yun Liu, Renee Wong, Christopher Sem-
Thomas Scialom. 2023. Augmented language mod- turs, S. Sara Mahdavi, Joelle Barral, Dale Webster,
els: a survey. arXiv. Greg S. Corrado, Yossi Matias, Shekoofeh Azizi,
Alan Karthikesalingam, and Vivek Natarajan. 2023.
Melanie Mitchell. 2020. Artificial Intelligence: A Guide Towards expert-level medical question answering
for Thinking Humans. Picador. with large language models. arXiv.
Shaden Smith, Mostofa Patwary, Brandon Norick, Michael Figurnov, Olaf Ronneberger, Russ Bates,
Patrick LeGresley, Samyam Rajbhandari, Jared Simon A. A. Kohl, Anna Potapenko, Andrew J. Bal-
Casper, Zhun Liu, Shrimai Prabhumoye, George lard, Bernardino Romera-Paredes, Stanislav Nikolov,
Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Rishub Jain, Ellen Clancy, David Reiman, Stig
Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Petersen, Andrew W. Senior, Koray Kavukcuoglu,
Song, Mohammad Shoeybi, Yuxiong He, Michael Ewan Birney, Pushmeet Kohli, John Jumper, and
Houston, Saurabh Tiwary, and Bryan Catanzaro. Demis Hassabis. 2021. Highly accurate protein struc-
2022. Using DeepSpeed and Megatron to train ture prediction for the human proteome. Nature,
Megatron-Turing NLG 530B, a large-scale generative 596(7873):590–596.
language model. arXiv.
Roberto Verdecchia, June Sallou, and Luis Cruz. 2023.
Aarohi Srivastava et al. 2022. Beyond the imitation A systematic review of green AI. Data Mining
game: Quantifying and extrapolating the capabilities Knowledge Discovery.
of language models. arXiv.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Stability AI. 2023a. StableLM: Stability AI language Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
models. https://github.com/stability-AI/ Hajishirzi. 2023. Self-Instruct: Aligning language
stableLM/. model with self generated instructions. In ACL.
Stability AI. 2023b. StableVicuna: Open-Source Woebot Health. 2022. Woebot for Substance Use Dis-
RLHF Chatbot. https://stability.ai/blog/ orders . https://classic.clinicaltrials.gov/
stablevicuna-open-source-rlhf-chatbot. Ac- ct2/show/study/NCT04096001. Accessed on July
cessed on July 4, 2023. 1, 2023.
Steven Loeb. 2021. Woebot CEO Michael Evers on Woebot Health. 2023. Woebot Health Enrolls First
AI in mental health, and how to get a chatbot to Patient in Pivotal Clinical Trial of WB001 for Post-
bond with a human. https://vator.tv/news/ partum Depression . https://woebothealth.com/
2021-07-30-woebot-ceo-michael-evers-on-ai-in-mental-health-and-how-to-get-a-chatbot-to-bond-with-a-hu
woebot-health-enrolls-first-patient-in-pivotal-clinical-
Accessed on July 1, 2023. Accessed on July 1, 2023.
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai,
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong
and Chao Zhang. 2023. AdaPlanner: Adaptive plan-
Wang, Zecheng Tang, and Nan Duan. 2023a. Visual
ning from feedback with language models. arXiv.
ChatGPT: Talking, drawing and editing with visual
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann foundation models. arXiv.
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
An instruction-following LLaMA model. https:// Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
github.com/tatsu-lab/stanford_alpaca. badur, David Rosenberg, and Gideon Mann. 2023b.
BloombergGPT: A large language model for finance.
Romal Thoppilan et al. 2022. LaMDA: Language mod- arXiv.
els for dialog applications. arXiv.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan
Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Wang. 2023. FinGPT: Open-source financial large
Krishnan, Barry B. Rubin, and Bo Wang. 2023. Clin- language models. arXiv.
ical camel: An open-source expert-level medical lan-
guage model with dialogue-based knowledge encod- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
ing. arXiv. Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. 2023. Tree of thoughts: Deliberate
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier problem solving with large language models. arXiv.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Yang You. 2023. ColossalChat: An open-source so-
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard lution for cloning ChatGPT with a complete RLHF
Grave, and Guillaume Lample. 2023. LLaMA: Open pipeline. https://bit.ly/42ZTwW4.
and efficient foundation language models. arXiv.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Bojan Tunguz. 2023. Tweet by bojan tun- Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
guz. https://twitter.com/tunguz/status/ Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
1673760614576189441?s=20. [Accessed July 1, Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng
2023]. Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-
130B: An open bilingual pre-trained model. In ICLR.
Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu,
Tim Green, Michal Zielinski, Augustin Žídek, Alex Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan
Bridgland, Andrew Cowie, Clemens Meyer, Agata Adhikarla, Sunyang Fu, Xun Chen, Chen Chen,
Laydon, Sameer Velankar, Gerard J. Kleywegt, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davi-
Alex Bateman, Richard Evans, Alexander Pritzel, son, Quanzheng Li, Yong Chen, Hongfang Liu, and
Lichao Sun. 2023a. BiomedGPT: A unified and gen-
eralist biomedical generative pre-trained transformer
for vision, language, and multimodal tasks. arXiv.
Lvmin Zhang and Maneesh Agrawala. 2023. Adding
conditional control to text-to-image diffusion models.
arXiv.
Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman,
Maxim Raginsky, and Talia Ringer. 2023b. Can trans-
formers learn to solve problems recursively? arXiv.