Mini Giant
els are flourishing and becoming more and for these small language models. We show that
more competent. We call them "mini-giants". compared to their large counterparts, small lan-
We argue that open source community like guage/foundation models offer particularly promis-
Kaggle and mini-giants will win-win in many ing opportunities for various industries (including
ways, technically, ethically and socially. In open source ML research and Kaggle competitions)
this article, we present a brief yet rich back- to not only utilize but also to actively participate in
ground, discuss how to attain small language
the creation/adaptation of modern language models
models, present a comparative study of small
language models and a brief discussion of eval- and AI in general. We center our arguments around
uation methods, discuss the application sce- the 3 key advantages of small models: adaptability,
narios where small language models are most controllability, and affordability.
needed in the real world, and conclude with First of all, smaller models offer better adapt-
discussion and outlook. ability by being more manageable to modify and
fine-tune. In Section 3, we present various strate-
gies of creating these small models through opti-
1 Introduction
mized fine-tuning techniques. This is important
Large language models (LMs), like ChatGPT and because in most industries (or even in a Kaggle
GPT-4, have been taken us by storm. People com- competition), innovation typically arises from the
pare it to the moment of the computer, the moment ability to incorporate domain-specific data into the
of the operating system, the moment of the Internet, language model or to adjust the model’s structure
or the moment of the iPhone. It is considered by to accommodate their unique requirements. Rely-
many a paradigm shift in NLP and deep learning. ing solely on prompt engineering often falls short.
Large language models are large: OpenAI GPT- Therefore, smaller language models bring forward
3 175B parameters, Google PALM 560B, and ru- great benefits to these industries, offering the much-
mor has it that GPT-4 is as large as 8 × 220B. For needed flexibility for adaptation, allowing them to
most small/medium companies and independent full leverage the power of AI and thus catalyzing
researchers, it is prohibitively expensive to train innovation within them.
or update such giant models. In addition, huge Second, smaller models can run on local infras-
consumption of energy for language model train- tructure without resorting to GPU-rich third parties,
ing poses a serious concern to the environmental improving the model’s controllability by ensuring
sustainability (Verdecchia et al., 2023). model users’ autonomous data governance and re-
Recent studies show that network size is not the sult monitoring. In Section 5, we discuss real world
sole determinant of model performance (Hoffmann scenarios where small language models fill in the
et al., 2022). And thanks to the efforts from the gaps when their large counterparts are unacceptable
ML open source community as well as private AI due to privacy concerns. In Section 4, we also look
companies, we’ve recently seen more and more into strategies for customized instruction following
"small" LMs created out of these larger models. and other pioneer research directions for small mod-
With their network parameter sizes of around or els, underpinning the relevance of smaller language
Data quality
InstructGPT ChatGPT
Human- Open
StableVicuna Guanaco
curated Assistant
Dolly 2.0
Alpaca Vicuna
synthetic GPT4All Dolly 1.0 Koala
GPT-3 LLaMA Pythia Time
Jun’20 Jan’22 Nov’22 Feb’23 Mar’23 Apr’23 May’23
Figure 1: An evolution tree of recently released instruction-following small LMs. The color of the text boxes
indicates the openness of the license under which the models are released: red stands for proprietary licenses, yellow
stands for non-commercial licenses, and green stands for licenses permissive for commercial use.
models in ensuring compliance and mitigating the plication scenarios where small foundation models
risk of misinformation. Understanding and manag- are most needed in the real world. We conclude
ing the way a model operates, the data it accesses, with discussions and an outlook.
and the outputs it produces, form the cornerstone
of responsible AI usage. 2 A brief yet rich background
Another crucial aspect of the superiority of small The Giants are fast ChatGPT set a record for
language models, is affordability. Taking an aver- fastest-growing user base: one million users in
age Kaggle competitor as an example. The de- five days, and 100 million monthly active users in
manding nature of a Kaggle competition requires January 2023, two months after launching.
the competitor to iterate on the modeling solutions, Radford et al. (2018) introduce generative pre-
often times by integrating a variety of data sources training for LMs, which could be regarded as “GPT-
and trying different architectures. This necessitates 1”. Radford et al. (2019) introduce GPT-2, an
transparent model components and fast iteration unsupervised multitask learning LM. Brown et al.
pace, which is at odds with the resource require- (2020) introduce GPT-3, a few-shot learning LM,
ments that super large language models impose. popularizing the concept of in-context learning.
Having access to fast and inexpensive training / in- OpenAI (2022) introduces ChatGPT and OpenAI
ferencing options, means that he/she will not have (2023) introduces GPT-4.
to face the trade-off between being constrained There are many LMs released in recent
in their innovation space, and moving away from years: Google BERT, Bidirectional Encoder Rep-
language model solutions entirely. As another ex- resentations from Transformers (Devlin et al.,
ample which is elaborated in Section 5, privacy- 2019), Google T5, Text-To-Text Transfer Trans-
sensitive sectors such as finance and healthcare former (Raffel et al., 2020), Google LaMDA, Lan-
face a more pressing challenge of choosing be- guage Model for Dialogue Applications (Thoppilan
tween regulation risks and the prohibitive cost of et al., 2022), Google PaLM, Pathways Language
training massive models in-house. Small language Model (Chowdhery et al., 2022), Deepmind Spar-
models provide the opportunity for them to con- row (Glaese et al., 2022), Anthropic Claude (Bai
form with regulations while not missing out on the et al., 2022a), Deepmind Chinchilla (Hoffmann
power of latest AI technologies. et al., 2022) Nivedia Megatron-Turing NLG (Smith
Outline et al., 2022), Deepmind Gopher (Rae et al., 2022),
HuggingFace BLOOM (BigScience Workshop
In the following sections, we first present a brief
et al., 2023), and Meta LLaMA, Large Language
yet rich background. Next, we discuss how to at-
Model Meta AI (Touvron et al., 2023).
tain small foundation models, including parameter
reduction and efficient training/fine-tuning tech- Language models as experts Besides general
niques. Then we present a comparative study of purpose LMs as above, there are many specialized
“small” foundation models a brief discussion of models for various application, e.g., Table 1 shows
evaluation methods. After that, we discuss the ap- a sample of them.
Model Application Reference
AlphaFold Protein folding Tunyasuvunakool et al. (2021)
Codex Coding Chen et al. (2023)
AlphaCode Coding Li et al. (2022)
RT-1 Robotics Brohan et al. (2022)
BiomedGPT Biomedical Zhang et al. (2023a)
Clinical Camel Clinical Toma et al. (2023)
BloombergGPT Finance Wu et al. (2023b)
FinGPT Finance Yang et al. (2023)
Med-PaLM 2 Medical Singhal et al. (2023)
MusicLM Music Agostinelli et al. (2023)
AudioGPT Audio Huang et al. (2023)
Language and functional competence Ma- Augmented LMs with tools A natural way to
howald et al. (2023) study language competence vs harnesses the language competence of LMs is
thought competence of LMs and show impressive by utilizing tools like a search engine, a vector
but imperfect formal linguistic competence, i.e., database, a code interpreter, or a solver to handle
“knowledge of rules and patterns of a given lan- tasks, e.g., LangChain1 , HuggingGPT (Shen et al.,
guage”, yet failures on many tests requiring func- 2023), Visual ChatGPT (Wu et al., 2023a), TaskMa-
tional linguistic competence, i.e.,“a host of cogni- trix.AI (Liang et al., 2023b), RCI (Kim et al., 2023),
tive abilities required for language understanding LLM+P (Liu et al., 2023a), ChemCrow (Bran et al.,
and use in the real world”. 2023), etc. See Mialon et al. (2023) for a survey
about augmented LMs.
Then we can leverage LMs’ competence as a
Domain expertise is still required, e.g., the
good model of language, e.g., by prompt engineer-
ChemCrow Bran et al. (2023) authors mention that
ing. We can also manage to improve the functional
“However, it is important to emphasize that po-
competence, e.g., factuality, safety, and planning.
tential risks may arise for non-experts who lack
With the capacity of in-context learning (Brown
the chemical reasoning to evaluate results or the
et al., 2020), prompting is a natural and popular
proper lab training, as conducting experiments still
way to utilize LMs. Prompting is the user interface
necessitates thorough laboratory experience.” and
for LMs, and can be formed with advanced meth-
the director of the movie trailer mentions that “For
ods like search and coding, e.g., Tree of Thoughts
those who believe that AI will do everything for
(ToT) (Yao et al., 2023), AdaPlanner (Sun et al.,
you: No!” and “I’ll always prefer to put my own
2023), Code as Policies (Liang et al., 2023a). Fine-
heart & soul in.” 2
tuning can improve LMs further. A parameter ef-
ficient approach makes fine-tuning large LMs fea- Mini-Giants are coming Following the leak-
sible considering the cost (Hu et al., 2021; Ding age of LLaMA (Touvron et al., 2023), many
et al., 2023; Ruder et al., 2022). Augmenting LMs “small” LMs appear in the open source commu-
with tools can achieve various functionalities. nity, with neural network parameter sizes of around
10B or smaller, e.g., Alpaca (Taori et al., 2023),
To approach artificial general intelligence (AGI)
Dolly (Conover et al., 2023), Koala (Geng et al.,
from language models, Mahowald et al. (2023)
2023), Vicuna (Chiang et al., 2023), StableLM (Sta-
suggest that, “instead of or in addition to scaling
bility AI, 2023a), ChatGLM (Du et al., 2020; Zeng
up the size of the models, more promising solu-
et al., 2023), Guanaco (Dettmers et al., 2023),
tions will come in the form of modular architec-
Pythia (Biderman et al., 2023), GPT4All3 , Open-
tures . . . , like the human brain, integrate language
Assistant4 , ColossalChat (You, 2023).
processing with additional systems that carry out
See Kim (2023) for a list of open sourced fine-
perception, reasoning, and planning”. The authors
tuned LMs. In Section 4, we will discuss and com-
believe that “a model that succeeds at real-world
language use would include – in addition to the
core language component – a successful problem 1651607149804498946
solver, a grounded experiencer, a situation modeler, 3
a pragmatic reasoner, and a goal setter”.
pare these mini-giants in details. 3.2 Efficient fine-tuning strategies for
foundation models
Discussions & debates abound There are all Compared with building even more compact mod-
sorts of discussions & debates, e.g. discussions els, the majority of research work by the ML com-
about AI alignment with human value from Russell munity in the direction of "smaller" foundation
(2019); Mitchell (2020); Christian (2021). Table 2 models, is around making them easier to fine-tune.
lists a few representative examples. Here we list several key strategies to achieve this.
Adapter (Houlsby et al., 2019) is a strategy to
3 How to make large foundation models add NN layers after existing layers (usually trans-
"small" former blocks) in pretrained foundation models,
so that they can be adapted to custom tasks with-
Since the advent of ultra-capable large foundation out changing the weights of existing layers. This
models like ChatGPT and StableDiffusion, numer- paper proposes an adapter module with two lin-
ous efforts have been devoted to address the pri- ear layers plus a non-linear activation in between.
mary challenges for their wide-spread utilization: The first layer projects the hidden state to a lower-
their humongous parameter sizes and the sheer time dimensional space, and the second layer projects
and compute resources needed to fine-tune them. it back to the original dimension. A newer pa-
Within 2 years, the research and open source com- per (Lin et al., 2020) recommended only one linear
munity have arrived at several strategies to cope layer plus an additional LayerNorm, as an Adapter
with this issue, which we will discuss in this sec- module. Adapter achieves near state-of-the-art per-
tion. formance, while adding only a small amount of
We classify these strategies into 2 groups: ones parameters per task - on GLUE, the added parame-
that directly reduce the parameter sizes, and ones ters accounted for 3.6% of the original model.
that makes fine-tuning large models more efficient.
Prefix fine-tuning (Li and Liang, 2021) Unlike
the Adapter architecture that focuses on modify-
3.1 Foundation models with reduced ing model behavior via model params, Prefix fine-
parameters tuning seeks to train a few params that are used
as input prefixes, for each custom sub task. The
Chinchilla (Hoffmann et al., 2022) is the first
authors commented that the method is inspired by
influential study on computational efficiency of
prompting: similar to prepending a few sentences
modern large language models. It put forward the
before a generation task, Prefix-tuning prepends
argument that given a compute budget, the best
a sequence of trained vectors to the input - just
model is attained not by larger parameter size, but
that the prefix vectors do not have to correspond
by more training data tokens. Based on this princi-
to any real tokens. Compared to full fine-tuning,
ple, the authors produced the Chinchilla 70B model
prefix-fine tuning achieves comparable or better
which out-performs prior large models 4 times as
performance with just 0.1% added parameters.
large, with the same amount of compute.
LoRA (Hu et al., 2021) Marks a substan-
LLaMa (Touvron et al., 2023) further reduces tial progress in parameter efficient fine-tuning.
the parameters and released a series of models Performance-wise, it is more efficient than previ-
ranging from 7 to 65B parameters, following the ous methods like Adapter and Prefix-finetuning.
Chinchilla computation rule. Notably, the paper LoRA proposes that we add a low rank, trainable
used only publicly available datasets as training cor- matrix in parallel to the frozen, pretrained model
pus and proved comparable performance as closed weights. The activation will be the sum of these
source counterparts. This, as commented by (Har- two matrices. Formally:
ris et al.), started a revolution of open source LLM
h = W0 x + ∆W x = W0 x + BAx
models. Along with parameter reduction, another
contribution by the authors is efficient implemen- where B and A are much "thinner" (i.e. low
tation of multi-headed attention layers through the rank), trainable matrices compared to W0 (the
open source xformers library, which optimizes the frozen pretrained matrix). The use of low rank ma-
memory consumption in training. trices reduces trainable parameters to as much as
Issue to discuss Reference
The dangers of stochastic parrots Bender et al. (2021)
Limitation of neural networks Delétang et al. (2023)
Limitation of autoregressive models Lin et al. (2021)
Lack of causality Jin et al. (2023)
Lack of compositionality Dziri et al. (2023)
Lack of recursion Zhang et al. (2023b)
Limitation of scaling laws Deshpande et al. (2023)
Limitation of scaling laws McKenzie et al. (2023)
Model collapse Shumailov et al. (2023)
Artificial general intelligence (AGI) Marcus (2023)
Evaluation of AI Burnell et al. (2023)
Distortion of human beliefs Kidd and Birhane (2023)
Social norms Browning and LeCun (2023)
Risks and benefits Goldman (2023)
Existential risk Bengio (2023)
Court hearing due to hallucination Novak (2023)
Risk of further concentration of wealth Chiang (2023)
Eight things to know Bowman (2023)
by 10,000 times of the original model, compared to method here to inspire the readers to consider more
a full fine-tune of GPT-3 175B. The article suggests complex scenarios of controlling / customizing
that LoRA can be used next to any model weights, large foundation model’s outputs.
not just transformer layers. The authors claim that ControlNet copies weights of the original model
LoRA is superior compared to Adapters in that it to a frozen copy (ike all methods mentioned above).
doesn’t introduce additional inference latency; and The trainable branch consists of an exact same copy
it’s better than Prefix fine-tuning in that it doesn’t as the frozen copy, as well as two convolution lay-
reduce the available sequence length like the latter ers called "zero convolutions", both before and af-
does. Further more, since this architectural modi- ter the trainable copy. In the fine-tuning forward
fication is orthogonal to the ideas of Adapter and path, the activation from the trainable copy will be
Prefix fine-tuning, LoRA can be used in conjunc- combined with that of the frozen copy by Zero Con-
tion with them for even better results. volution. The so-called Zero Convolution is just a
1x1 convolution layer that are initiated with both
QLoRA (Dettmers et al., 2023) As an improve- weights and biases being zeros. The result of using
ment of LoRA, QLoRA proposes optimization ControlNet shows that in some tasks, ControlNets
methods via quantized low rank fine tuning. Inno- on a personal computer achieve comparable results
vations of QLoRA include a 4-bit data type: Nor- as commercial models trained on terabytes of GPU
malFloat4, which optimizes information efficiency memory and thousands of GPU hours.
for normally distributed data (e.g. weights) based
on information theory. Apart from that, the pa- 4 A brief survey of “small”
per uses Paged Optimizers (partial optimizer state instruction-following LMs
stored on CPU rather than GPU) to manage mem-
ory spikes, like when processing mini batches with Over the past few months, we have seen small LMs
long sequence lengths. Experiment results show flourish. See Figure 1 for an evolution tree. This is
that fine-tuning using QLoRA reaches 99.3% of a very fast progressing field, and it is challenging to
the performance of ChatGPT, and only requires even keep ahead with the latest progress. Quoting
training for 24 hours on one GPU. (Tunguz, 2023), “Trying to get ahead in AI these
days feels like wrestling a rabid 5,000 lbs hippo
ControlNet (Zhang and Agrawala, 2023) is covered in baby oil”.
proposed as a method to efficiently fine-tune im-
age generation models (diffusion model) on user- 4.1 Closed-source milestones
defined tasks. Because image generation models GPT-3 (Brown et al., 2020) gained public atten-
in general have a larger design space in terms of tion when it was released in 2020. As reported by
user interaction than language models, we list this New York Times, it “generates tweets, pens poetry,
Basic info Scale Openness
Time Training Training
MM/YY Model Institute # parameters hardware cost data size L I TC TD
06/20 GPT-3 OpenAI 175B 3.64k PT-days 300B tokens P P P ✓
02/23 LLaMA-7B Meta 7B 82k GPU-hours 1.4T tokens NC ✓ ✗ ✓
02/23 LLaMA-13B Meta 13B 135k GPU-hours 1.4T tokens NC ✓ ✗ ✓
04/23 Pythia-7B Eleuther AI 7B 33.5k GPU-hours 300B tokens C ✓ ✓ ✓
04/23 Pythia-12B Eleuther AI 12B 72k GPU-hours 300B tokens C ✓ ✓ ✓
Table 3: Comparison of recent base LMs. In the Openness section, L stands for License, I stands for Inference, TC stands
for Training Codes, and TD stands for Training Data. In the License column, P stands for Proprietary, NC stands for Non-
Commercial, and C stands for permissive for Commercial use.
Table 4: Comparison of recent instruction-following small LMs. The abbreviations of the column names follow Table 3.
summarizes emails, answers trivia questions, trans- by incorporating dialogue data into the supervised
lates languages and even writes its own computer fine-tuning and the RLHF stage. It acquired 1 mil-
programs” (Markoff, 2020). It shows that decent lion users in just 5 days and revolutionizes the way
few-shot performance can be achieved without gra- people interact with modern AIs. As a proprietary
dient update, and the unprecedented model scale product, although the web UI is free, the underlying
(175B parameters) is a key ingredient for success. model can only be accessed via a paid API.
