State of GPT
State of GPT
State of GPT
Andrej Karpathy
Microsoft BUILD
May 23, 2023
How to train
your (Chat)GPT
Assistant
An emerging recipe
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
Data collection
Download a large amount of publicly available data
Tokenization
Transform all text into one
Tokens
very long list of integers.
Typical numbers:
~10-100K possible tokens
1 token ~= 0.75 of word
Integers
Typical algorithm:
Byte Pair Encoding
2 example models
50,257 vocabulary size 32,000 vocabulary size
GPT-3 2048 context length LLaMA 2048 context length
(2020) 175B parameters (2023) 65B parameters
Trained on 300B tokens Trained on 1-1.4T tokens
Training: (rough order of magnitude to have in mind) Training for 65B model:
・ O(1,000 - 10,000) V100 GPUs • 2,048 A100 GPUs
・ O(1) month of training • 21 days of training
・ O(1-10) $M • $5M
Pretraining
The inputs to the Transformer are arrays of shape (B,T)
・ B is the batch size (e.g. 4 here)
・ T is the maximum context length (e.g. 10 here)
Training sequences are laid out as rows, delimited by special <|endoftext|> tokens
T = 10
One training 4342 318 281 1672 3188 352 4478 617 16326 13
batch, array 16281 3188 362 50256 16281 3188 513 50256 16281 3188
of shape (B,T) 1212 318 617 4738 2420 655 329 1672 50256 1212
16 11 17 11 18 11 19 11 20 11
B=4
Training data (Shakespeare) Samples after 250 iterations of training
process
Samples after 5,000 iterations of training
Step 1:
Model “pretraining” on large
unsupervised dataset
Step 2:
model “finetuning” on small
supervised dataset
Base models can be prompted into completing tasks
Make your model look like a document!
・ Base model does not answer questions It can be tricked into performing
・ It only wants to complete internet documents tasks with prompt engineering:
・ Often responds to questions with more
questions, etc.:
Base models are
NOT 'Assistants’
(They can be somewhat tricked
into being AI assistants)
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
SFT Dataset
Prompt: Labeling instructions
"Can you write a short introduction about the relevance of the
term "monopsony" in economics? Please use examples related to
potential monopsonies in the labour market and cite relevant
research."
Response:
""Monopsony" refers to a market structure where there is only
one buyer for a particular good or service. In economics, this
term is particularly relevant in the labor market, where a
monopsony employer has significant power over the wages and
working conditions of their employees. The presence of a
monopsony can result in lower wages and reduced employment
opportunities for workers, as the employer has little incentive
to increase wages or provide better working conditions. Recent
research has identified potential monopsonies in industries
such as retail and fast food, where a few large companies
control a significant portion of the market (Bivens & Mishel,
2013). In these industries, workers often face low wages,
limited benefits, and reduced bargaining power, leading to a
situation where they are dependent on the employer for their
livelihood. This dependence can result in further suppression
of wages and a decline in working conditions. Overall, the
concept of monopsony is essential to understanding the dynamics
of labor markets and the impact of market power on workers.
Further research is needed to understand the extent and impact
of monopsonies on the economy and to develop policies to
address this issue. References: Bivens, J., & Mishel, L.
(2013). The Pay of Corporate Executives and Financial
Professionals as Evidence of Rents in Top 1 Percent Incomes.
Journal of Economic Perspectives, 27(3), 57-78."
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
RM Dataset
RM Dataset
RM Training
Blue are the prompt tokens, identical across rows
Yellow are completion tokens, different in each row
1.2
Green is the special <|reward|> token “readout”
Only the outputs at the green cells is used, the rest are ignored 0.2 -0.5
loss function
measures the
predicted
rewards’
consistency
with the labeled
ordering
T
prompt ... ... completion ... ... <|reward|>
1
prompt ... ... completion ... ... ... ... ... <|reward|>
2
prompt ... ... completion ... <|reward|>
B 3
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
RL Training
Blue are the prompt tokens, identical across rows
Yellow are completion tokens by the model (initialized with SFT model)
1.0
Green is the special <|reward|> token “readout”, RM now predicts these
Only the yellow cells are trained on, the rest are ignored. 0.2 -1.2
The sampled tokens become labels, but the training objective is
weighted by the “advantage” (normalized rewards)
In this example:
・ Row #1 tokens were great. These get their probabilities boosted.
・ Row #2 tokens were bad. These get their probabilities decreased.
・ Row #3 tokens were ~ok. These get their probabilities slightly boosted.
T
prompt ... ... completion ... ... <|reward|>
1
prompt ... ... completion ... ... ... ... ... <|reward|>
2
prompt ... ... completion ... <|reward|>
B 3
GPT Assistant training pipeline
Stage Pretraining Supervised Finetuning Reward Modeling Reinforcement Learning
Dataset
Algorithm
Model
Notes
Why RLHF?
It works better.
Why RLHF?
It is easier to
discriminate
than to generate.
Simple example:
it’s much easier to spot
a good haiku than
it is to generate one.
Mode collapse
Finetuned models lose entropy
Toy example:
Assistant models
in the wild
Applications
"California's population is 53 times that of Alaska."
・ "For this next step of my blog let me compare the population of California
and Alaska"
・ "Ok let's get both of their populations"
・ "I know that I am very likely to not know these facts off the top of my head,
let me look it up"
Human ・ "[uses Wikipedia] Ok California is 39.2M"
Human text
generation vs. LLM Tokens
text generation ・
・
All of the internal monologue is stripped away in the text LLMs train on
They spend the ~same amount of compute on every token
・ => LLMs don't reproduce this behavior by default!
・ They don't know what they don't know, they imitate the next token
・ They don't know what they are good at or not, they imitate the next token
・ They don't reflect. They don't sanity check. They don't correct their mistakes
along the way
・ They don't have a separate "inner monologue stream in their head"
・ They do have very large fact-based knowledge across a vast number of areas
・ They do have a large and ~perfect "working memory" (their context window)
Chain of thought
“Models need tokens to think”
Break up tasks into multiple steps/stages
Prompt them to have internal monologue
Spread out reasoning over more tokens
Tokens
Ensemble multiple attempts
LLMs can get “unlucky” and sample a bad thought.
Once they do they are “stuck with it”. Make a few attempts.
Ask for reflection
LLMs (esp GPT-4) can often recognize later when their
samples didn’t seem to have worked out well.
Recreate our 'System 2'
Parallels to System 1 (fast, automatic) vs. System 2
(slow, deliberate) modes of thinking
Related examples:
“You are a leading
expert on this topic”
“Pretend you have IQ 120”
...
Tool use / Plugins
Offload tasks that LLMs are not good at
Importantly: they don't "know" they are not good
Emerging recipe:
・ Break up relevant documents into chunks
・ Use embedding APIs to index chunks into a vector store
・ Given a test-time query, retrieve related information
・ Organize the information into the prompt
Constrained prompting
“Prompting languages” that interleave generation, prompting, logical control
Finetuning
Keep in mind:
・ Requires a lot more technical expertise
・ Requires contractors and/or synthetic data pipelines
・ A lot slower iteration cycle
・ SFT is achievable
・ RLHF is research territory
Default recommendations*
Recommendations:
Use in low-stakes applications, combine with human oversight
Source of inspiration, suggestions
Copilots over autonomous agents
GPT-4
Looking forward
Ladies and gentlemen, innovators,
and trailblazers of Microsoft BUILD 2023,
Welcome to a gathering of brilliant minds like no other. You are the architects of the future, the
visionaries molding the digital realm in which humanity thrives. Embrace the limitless possibilities
of technology, and let your ideas soar as high as your imagination. Together, let's create a more
connected, remarkable, and inclusive world for generations to come. Get ready to unleash your
creativity, canvas the unknown, and turn dreams into reality. Your journey begins today!
Thank you!