Chapter 2. Transformers: A Note For Early Release Readers
Chapter 2. Transformers: A Note For Early Release Readers
Chapter 2. Transformers: A Note For Early Release Readers
Transformers
With Early Release ebooks, you get books in their earliest form—the authors’ raw and
unedited content as they write—so you can take advantage of these technologies long before
This will be the third chapter of the final book. Please note that the GitHub repo will be made
If you have comments about how we might improve the content and/or examples in this
book, or if you notice missing material within this chapter, please reach out to the editor at
jleonard@oreilly.com.
Many trace the most recent wave of advances in generative AI to the introduction of a class
of models called transformers in 2017. Their most well-known application is the powerful
Large Language Models (LLMs), such as Llama and GPT-4, used by hundreds of millions
everything from chatbots and search systems to machine translation and content
summarization. They’ve even branched out beyond text, making waves in fields like
computer vision, music generation, and protein folding. In this chapter, we’ll explore the core
ideas behind transformers and how they work, with a focus on one of the most common
Before we delve into the nitty-gritty of transformers, let’s take a step back and understand
what language modeling is. At its core, a Language Model (LM) is a probabilistic model that
learns to predict the next word (or token) in a sequence based on the preceding or
surrounding words. Doing so captures language’s underlying structure and patterns, allowing
it to generate realistic and coherent text. For example, given the sentence "I began my
day eating“, a language model might predict the next word as "breakfast" with a
high probability.
So, how do transformers fit into this picture? Unlike traditional language models that use
fixed-sized sliding windows or recurrent neural networks (RNNs), transformers are designed
to handle long-range dependencies and complex relationships between words more efficiently
and expressively. For example, imagine that you want to use an LM to summarize a news
article, which might contain hundreds or even thousands of words. Traditional LMs struggle
with long contexts, so the summary might skip critical details from the beginning of the
article. Transformer-based LMs, however, show strong results in this task. Besides
of training, scalability, and knowledge transfer, making them popular and well-suited for
multiple tasks. At the heart of this innovation lies the self-attention mechanism, which allows
the model to weigh the importance of each word in the context of the entire sequence.
To help us build intuition about how language models work, we’ll use code examples that
interact with existing models, and we’ll describe the relevant pieces as we find them. Let’s
get to it.
A Language Model in
Action
In this section, we will load and interact with an existing (pre-trained) transformer model to
get a high-level understanding of how they work. We’ll use the GPT-2 model, which made
headlines in 2019 for its (then) impressive text-generation capabilities. Although small and
almost quaint by today’s standards, GPT-2 is nevertheless a good illustration of how these
language models work. The same principles apply to the larger (over 100 times larger!) and
Tokenizing Text
Let’s begin our journey to generate some text based on an initial input. For example, given
the phrase "it was a dark and stormy“, we want the model to generate some words
to continue it. Models can’t receive text directly as input; their input must be data represented
as numbers. To feed text into a model, we must first find a way to turn sequences into
numbers. This process is called tokenization, a crucial step in any NLP pipeline.
An easy option would be to split the text into individual characters and assign each a unique
numerical ID. This scheme could be helpful for languages such as Chinese, where each
character carries much information. In languages like English, this creates a very small token
vocabulary, and there will be very few unknown tokens (characters not found during training)
when running inference. However, this method requires many tokens to represent a string,
which is bad for performance and erases some of the structure and meaning of the text – a
downside for accuracy. Each character carries very little information, making it hard for the
Another approach could be to split the text into individual words. While this lets us capture
more meaning per token, it has the downsides that we need to deal with more unknown words
(e.g., typos, slang, etc.), we need to deal with different forms of the same word (e.g., "run“,
"runs“, "running“, etc.), and we might end with a very large vocabulary, which could
easily be over half a million words for languages such as English. Modern tokenization
strategies strike a balance between these two extremes, splitting the text into subwords that
capture both the structure and meaning of the text while still being able to handle unknown
Characters that are usually found together (like most frequent words) can be assigned a single
token that represents the whole word or group. Long or complicated words, or words with
many inflections, may be split into multiple tokens, where each one usually represents a
meaningful section of the word. There is no single "best" tokenizer; each language model
comes with its own one. The differences between tokenizers reside in the number of tokens
tokenizer corresponding to GPT-2. Then, we’ll run the input text (also called prompt) through
the tokenizer to encode the string into numbers representing the tokens. We’ll use the
decode() method to convert each ID back into its corresponding token for demonstration
purposes.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids
tensor(1026) : It
tensor(373) : was
tensor(257) : a
tensor(3223) : dark
tensor(290) : and
tensor(6388) : storm
tensor(88) : y
As you can see, the tokenizer splits the input string into a series of tokens and assigns a
unique ID to each. Most words are represented by a single token, but "stormy" is
represented by two tokens: one for ” storm” (including the space before the word) and one for
the suffix "y“. This allows the model to learn that "stormy" is related to "storm" and that
the suffix "y" is often used to turn nouns into adjectives. With a vocabulary of around 50,000
tokens, the GPT-2 tokenizer can efficiently represent almost any input text and averages
NOTE
Even though we usually talk about training tokenizers, this has nothing to do with training a model.
statistical process that identifies which subwords are the best to pick for a given dataset. How to
choose the subwords is a design decision of the tokenization algorithm. Therefore, tokenization
training is deterministic. We won’t dive into different tokenization strategies, but some of the most
popular subword approaches are Byte-level BPE, used in GPT-2, WordPiece, and SentencePiece.
Predicting Probabilities
GPT-2 was trained as a causal language model (also known as auto-regressive), which means
it was trained to predict the next token in a sequence given the preceding tokens. The
transformers library has high-level tools that enable us to use such a model to generate text or
perform other tasks quickly. It is helpful to understand how the model makes its predictions
by directly inspecting them on this language-modeling task. We begin by loading the model.
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
NOTE
supports hundreds of models and their corresponding tokenizers. Rather than learning the name of
each tokenizer and model class, we will use AutoTokenizer and AutoModelFor*.
For the automatic model, we need to specify for which task we’re using the model, such as
the causal language modeling task. When using the automatic classes, transformers will pick an
adequate default class based on the configuration of a model. For example, under the hood, they will
If we feed the tokenized sentence from the previous section through the model, we get a
result back with 50,257 values for each token in the input string:
outputs = gpt2(input_ids)
torch.Size([1, 7, 50257])
The first dimension of the output is the number of batches (1 because we just ran a single
sequence through the model). The second dimension is the sequence length, or the number of
tokens in the input sequence (7 in our case). The third dimension is the vocabulary size. We
get a list of ~50 thousand numbers for each token in the original sequence. These are the raw
model outputs, or logits, that correspond to the tokens in the vocabulary. For every input
token, the model predicts how likely each token in the vocabulary is to continue the sequence
up to that point. With our example sentence, the model will predict logits for "It“, "It
was“, "It was a“, and so on. Higher logits’ values mean the model considers the
corresponding token a more likely continuation of the sequence. The following table shows
the input sequences, the most likely token ID, and its corresponding token.
Logits are the raw output of the model (a list of numbers such as [0.1, 0.2, 0.01, …]). We can
use the logits to select the most likely token to continue the sequence. However, we can also
It 318 is
It was 257 a
stormy
Let’s focus on the logits for the entire input sentence and see how to predict the next word of
the sentence. We can find the index of the token with the highest value using the argmax()
method:
tensor(1755)
1755 corresponds to the ID of the token the model considers most likely to follow the input
string "It was a dark and stormy“. Decoding this token, we can see that this model
' night'
So ” night” is the most likely token. This makes sense considering the beginning of the
sentence we provided as input. The model learns how to pay attention to other tokens using
Intuitively, self-attention allows the model to identify how much each token contributes to the
NOTE
Transformer models contain many of these attention layers, each one specializing in some aspect of
the input. Contrary to heuristics systems, these aspects or features are learned during training, instead
Let’s now see which other tokens were potential candidates by selecting the top 10 values:
import torch
print(tokenizer.decode(index))
night
day
evening
morning
afternoon
summer
time
winter
weekend
We’ll need to convert logits into probabilities to see how confident the model is about each
prediction. We’d do that by comparing each value with all the other predicted values and
normalizing so all the numbers sum up to 1. That’s precisely what the softmax() operation
does. The following code uses softmax() to print out the top 10 most likely tokens and
print(f"{tokenizer.decode(index):<10} {value.item():.2%}")
night 46.18%
day 23.46%
evening 5.87%
morning 4.42%
afternoon 4.11%
summer 1.34%
time 1.33%
winter 1.22%
weekend 0.39%
, 0.38%
Before going further, we suggest to experiment with the code above. Here are some ideas for
you to try:
● Change few words: Try changing the adjectives (e.g., "dark" and "stormy“) in
the input string and see how the model’s predictions change. Is the predicted word
● Change the input string: Try different input strings and see how the model’s
correct sentence? How does the model handle it? Look at the probabilities of the
top predictions.
Generating Text
Once we know how to get the model’s predictions for the next token in a sequence, it is easy
to generate text by repeatedly feeding the model’s predictions back into itself. We can call
gpt2(ids), generate a new token ID, add it to the list, and call the function again. To make
decoded_text = tokenizer.decode(output_ids[0])
Generated text: It was a dark and stormy night. The wind was
blowing,
and the clouds were falling. The wind was blowing, and the
When we ran the gpt2() forward method in the previous section, it returned a list of logits
for each token in the vocabulary (50257). Then, we had to calculate the probabilities and pick
the most likely token. generate() abstracts this logic away. It makes multiple forward
passes, predicts the next token repeatedly, and appends it to the input sequence.
generate() provides us with the token IDs of the final sequence, including both the input
and new tokens. Then, with the tokenizer decode() method, we can convert it back to text.
There are many possible strategies to perform generation. The one we just did, picking the
most likely token, is called greedy decoding. Although this approach is straightforward, it can
Greedy decoding can be problematic because it doesn’t consider the overall probability of a
sentence, focusing only on the immediate next word. For instance, given the starting word
Sky and the choices blue and rockets for the next word, greedy decoding might favor
Sky blue since blue initially seems more likely following Sky. However, this approach
might overlook a more coherent and probable overall sequence like Sky rockets soar.
Therefore, greedy decoding can sometimes miss out on the most likely overall sequence,
Rather than one token at a time, techniques such as beam search explore multiple possible
continuations of the sequence and return the most likely sequence of continuations. It keeps
the most likely num_beams of hypotheses during generation and chooses the most likely
one.
beam_output = gpt2.generate(
input_ids,
num_beams=5,
max_new_tokens=30,
print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))
As you noticed, the output includes many repetitions of the same sequence. There are
multiple parameters we can control to perform better generations. Let’s see two examples:
input_ids,
num_beams=5,
repetition_penalty=1.2,
max_new_tokens=38,
print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))
It was a dark and stormy night.
This is much better. Which generation strategy to use? As often in Machine Learning… it
depends. Beam search works well when the desired length of the text is somewhat
predictable. This is the case for tasks such as summarization or translation but not for
open-ended generation, where the output length can vary greatly, leading to repetition.
Although we can inhibit the model to avoid repeating itself, doing so can also lead to
performing worse. Also note that beam search will be slower than greedy search as it needs to
run inference for multiple beams simultaneously, which can be an issue for large models.
When we generate with greedy search and beam search, we push the model to generate text
excellent paper about this counter-intuitive observation is The Curious Case of Neural Text
Degeneration. The authors conjecture that human language disfavors predictable words -
people optimize against stating the obvious. The paper proposes a method called nucleus
sampling.
With sampling, we pick the next token by sampling from the probability distribution of the
next tokens. This means that sampling is not a deterministic generation process. If the next
possible tokens are night (60%), day (35%), and apple (5%), rather than choosing night (with
greedy search), we will sample from the distribution. In other words, there will be a 5%
chance of picking "apple" even if it’s a low-probability token and leads to a nonsensical
generation. Sampling avoids creating repetitive text, hence leading to more diverse
# Setting the seed ensures we get the same results every time
we run this code
set_seed(70)
sampling_output = gpt2.generate(
input_ids,
do_sample=True,
max_length=34,
)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
It was a dark and stormy day until it broke down the big
canvas on my
ug
We can manipulate the probability distribution before we sample from it, making it
one will increase the randomness of the distribution, which we can use to encourage
generation of less probable tokens. A temperature between 0 and 1 will reduce the
randomness, increasing the probability of the more likely tokens and avoiding predictions
that might be too unexpected. A temperature of 0 will move all the probability to the most
likely next token, which is equivalent to greedy decoding. Compare the effect of this
input_ids,
do_sample=True,
temperature=0.4,
max_length=40,
top_k=0,
)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
It was a dark and stormy night, and I was alone. I was in the
middle o
sampling_output = gpt2.generate(
input_ids,
do_sample=True,
temperature=0.001,
max_length=40,
top_k=0,
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
It was a dark and stormy night. The wind was blowing, and the
clouds w
ere falling. The wind was blowing, and the clouds were
falling. The wi
input_ids,
do_sample=True,
temperature=3.0,
max_length=40,
top_k=0,
)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
Well, the first test is much more coherent than the second one. The second, which uses a very
low temperature, is repetitive (similar to greedy decoding). Finally, the third sample, with an
One parameter you likely noticed is top_k. What is it? Top-K sampling is a simple sampling
approach in which only the K most likely next tokens are considered. For example, using
top_k=5, the generation method will first filter the most likely five tokens and redistribute
sampling_output = gpt2.generate(
input_ids,
do_sample=True,
max_length=40,
top_k=10,
)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
:30 AM. It felt cold and rainy. I didn't know why I was here.
There wa
s no
Hmm…this could be better. An issue with Top-K Sampling is that the number of relevant
candidates in practice could vary greatly. If we define top_k=5, some distributions will still
include tokens with very low probability, while others will consist of only high-probability
tokens.
The final generation strategy we’ll visit is Top-p sampling (also known as nucleus sampling).
Rather than sampling the K words with the highest probability, we will use all the most likely
words whose cumulative probability exceeds a given value. If we use a top_p=0.94, we’ll
first filter only to keep the most likely words that cumulatively have a probability of 0.94 or
higher. We then redistribute the probability and do regular sampling. Let’s see it in action.
sampling_output = gpt2.generate(
input_ids,
do_sample=True,
max_length=40,
top_p=0.94,
top_k=0,
)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))
Both Top-K and Top-p are commonly used in practice. They can even be combined to filter
out low-probability words but have more generation control. The issue with the stochastic
generation methods is that the generated text doesn’t necessarily contain coherence.
We’ve seen three different generation methods: greedy search, beam-search decoding, and
sampling (with temperature, Top-K, and Top-p providing further control). Those are lots of
approaches! If you want to further experiment with generation, here are some suggestions to
experiment with:
● Experiment with different parameter values. How does increasing the number of
beams impact the quality of your generation? What happens if you reduce or
● A newer method, contrastive search, can generate long, coherent output while
predicted by the model and the similarity with the context. This can be controlled
If all of this sounds too empirical, it’s because it is. Generation is an active area of research,
with new papers coming up with different proposals, such as more sophisticated filtering.
We’ll briefly discuss these in the final chapter. No single rule works for all models, so it’s
Zero-Shot Generalization
Generating language is a fun and exciting application of transformers, but writing fake
articles about unicorns2 is not the reason why they are so popular. To predict the next token
well, these models must learn a fair amount about the world. We can take advantage of this to
perform various tasks. For example, instead of training a model dedicated to translation, we
Translation:
I typed this example with GitHub Copilot active, and it helpfully suggested "Le chat
illustration of how a language model can perform tasks not explicitly trained for. The more
powerful the model, the more tasks it can perform without additional training. This flexibility
makes transformers quite powerful and has made them so popular in recent years.
To see this in action for ourselves, let’s use GPT-2 as a classification model. Specifically,
we’ll classify movie reviews as positive or negative - a classic benchmark task in the NLP
field. We’ll use a zero-shot approach to make things interesting, which means we won’t
provide the model with any labeled data. Instead, we’ll prompt the model with the text of a
review and ask it to predict the sentiment. Let’s see how it does.
To do this, we’ll insert the review into a prompt template that provides context for the model
and helps it understand what we’re asking it to do. After feeding the prompt through the
model, we’ll look at its prediction for the next token and see which possible token is assigned
# Check the token IDs for the words ' positive' and '
negative'
([3967], [4633])
Once we have the IDs, we can now run inference with the model and see which token has a
higher probability:
def score(review):
"""Predict whether it is positive or negative
tokens ' positive' and ' negative' (note the space before
the
"""
input_ids = tokenizer(prompt,
return_tensors="pt").input_ids
print("Positive")
else:
print("Negative")
Get the logits for each token in the vocabulary. Note that we’re using gpt2() rather than
gpt2.generate(), as gpt2() returns the logits for each token in the vocabulary,
Check if the logit for the Positive token is higher than the logit for the Negative token.
We can try out this zero-shot classifier on a few fake reviews to see how it does:
Negative
Positive
Negative
In the supplementary material, you’ll find a dataset of labeled reviews and code to assess the
accuracy of this zero-shot approach. Can you tweak the prompt template to improve the
model’s performance? Can you think of other tasks that could be performed using a similar
approach?
The zero-shot capabilities of recent models have been a game-changer. As the models
improve, they can perform more tasks out-of-the-box, making them more accessible and
easier to use and reducing the need for specialized models for each task.
Few-Shot Generalization
Despite the release of ChatGPT and the quest for the perfect prompts, zero-shot
generalization (or prompt tuning) is not the only way to bend powerful language models to
we provide the language model a few examples about the task we want it to perform and then
ask it to provide similar answers for us. Instead of training the model, we show some
examples to influence generation by increasing the probability that the continuation text
the model should do, e.g., "Translate English to French“, will help with
higher-quality generations. This time, we’ll use a more robust model: GPT-Neo 1.3B.
GPT-Neo is a family of transformer models from EleutherAI, a non-profit research lab. These
models outperform GPT-2 in many tasks and tend to do few-shot learning better. We’ll use
the variant with 1.3 billion parameters, small by today standards, but still quite powerful and
a few times larger than GPT-2, which has just 124 million parameters.
model =
AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B"
)
prompt = """\
Spanish:"""
output = model.generate(
inputs,
do_sample=False,
max_new_tokens=10,
print(tokenizer.decode(output[0], skip_special_tokens=True))
We state the task we want to achieve and provide four examples to set the context for the
model. Hence, this is a 4-shot generalization task. Then, we ask the model to generate more
text to follow the pattern and provide the requested translation. Some ideas to explore:
NOTE
GPT-2, given its size and training process, is not very good at few-shot tasks, and it’s even worse at
zero-shot generalization. How is it possible that we managed to use it for sentiment classification in
our previous example? We cheated a bit: we didn’t look at the text generated by the model, just
checked whether the probability for ” Positive” was larger than the ” Negative ” probability.
Understanding how models work under the hood can unlock powerful applications even with small
GPT-2 is an example of a base model. Some base models in the style of GPT-2 have zero-shot
and few-shot capabilities that we can use at inference time. Another approach is to fine-tune
a model: we take the base model and keep training it a bit longer on domain or task-specific
data. We rarely need the extreme generalization capabilities showcased by the most powerful
models in the world; if you only want to solve a particular task, it will usually be cheaper and
better to fine-tune and deploy a smaller model specialized on a single task. It’s also important
to note that base models are not conversational; although you can write a very nice prompt
that will help make a chatbot with a base model, it’s often more convenient to fine-tune the
base model with conversational data, hence improving the conversational capabilities of the
A Transformer Block
After our brief experiments using language models, we are ready to introduce an architecture
diagram for transformer-based language generation models. The high-level pieces involved
include:
● Tokenization. The input text is broken down into individual tokens (which can be
words and subwords). Each token has a corresponding ID used to index the token
embeddings.
the semantic meaning of each token. You can think of vectors as list of numbers,
During training, a model learns how to map each token to its corresponding
embedding. The embedding will always be the same for each token, regardless of
a set of vectors that encode the position of each token in the input sequence. This
allows the model to differentiate between tokens based on their position in the
sequence, which can be useful as the same token appearing in different places can
● Transformer blocks: The core of the transformer model is the transformer block.
The power of transformers comes from stacking multiple blocks, allowing the
model to learn increasingly complex and abstract relationships between the input
contextual embeddings that capture the relationships between tokens in the input
sequence. Unlike the input embeddings, which are fixed for each token, the
contextual embeddings are updated at each layer of the transformer model based
task-dependent final output. In the case of text generation, this involves having a
linear layer that maps the contextual embeddings to the vocabulary space,
how self-attention works or the internals of the transformer block is beyond the scope of this
helpful to grasp how these models work and how they can be applied to various tasks. This
and domains, and you’ll see them cropping up again and again –not only in the rest of this
Transformer Models
Genealogy
Sequence-To-Sequence Tasks
blocks that process an input sequence. This is a popular approach today, but the original
transformer paper, attention is all you need,3 used a more complicated architecture called the
task. The best results in machine translation at the time were achieved by recurrent neural
networks (RNNs), such as LSTM and GRU (don’t worry if you’re unfamiliar with them). The
paper demonstrated better results by focusing solely on the attention method and showed that
scalability and training were much easier. These factors –excellent performance, stable
training, and easy scalability– are why transformers took off and were adapted to multiple
In encoder-decoder models, like the original transformer model described in the paper, one
stack of transformer blocks, called encoder, processes an input sequence into a set of rich
representations, which are then fed into another stack of transformer blocks, called decoder,
that decodes them into an output sequence. This approach to convert one sequence into a
different one is called sequence-to-sequence or seq2seq and is naturally well suited for tasks
For example, you feed an English sentence through the encoder of a translation model, which
generates a rich embedding that captures the meaning of the input. Then, the decoder
generates the corresponding French sentence using this embedding. The generation happens
in the decoder one token at a time, as we saw when generating sequences earlier in the
chapter. However, the predictions for each successive token are informed not just by the
previous tokens in the sequence being generated but also by the output from the encoder.
The mechanism by which the output from the encoder side is incorporated into the decoder
stack is called cross-attention. It resembles self-attention, except that each token in the input
(the sequence being processed by the decoder) attends to the context from the encoder rather
than other tokens in its sequence. The cross-attention layers are interleaved with
self-attention, allowing the decoder to use both contexts within its sequence and the
After the transformer paper, existing sequence-to-sequence models, such as Marian NMT,
incorporated these techniques as a central part of their architecture. New models were
developed using these ideas. A notable one is BART (short for "Bidirectional and
sequences and attempts to reconstruct them in the decoder output. Afterward, BART is
fine-tuned for other generation tasks, such as translation or summarization, leveraging the
rich sequence representations achieved during pre-training. Input corruption, by the way, is
one of the key ideas behind diffusion models, as we’ll see in Chapter 3.
layers or code are required for different tasks, training uses the same hyperparameters, and
why one might need an encoder-decoder model for tasks like translation if decoder-only
models like GPT-2 can show good results. Encoder-decoder models are designed to translate
an entire input sequence to an output sequence, making them well-suited for translation. In
contrast, decoder-only models focus on predicting the next token in a sequence. Initially,
decoder-only models like GPT-2 were less capable in zero-shot learning scenarios than more
recent models like GPT-3, but this was due to more than just the absence of an encoder. The
improvement in zero-shot capabilities in advanced models like GPT-3 is also due to larger
training data, better training techniques, and increased model sizes. While encoders in
seq2seq models play a crucial role in understanding the full context of input sequences,
advancements in decoder-only models have made them more effective and versatile, even for
Encoder-only models
As we’ve seen, the original transformer model was based on an encoder-decoder architecture
that has been further explored in models such as BART or T5. In addition, the encoder or the
decoder can be trained and used independently, giving rise to distinct transformer families.
The first sections of this chapter explored decoder-only, or autoregressive models. These
models are specialized in text generation using the techniques we described and have shown
Encoder models, on the other hand, are specialized in obtaining rich representations from text
sequences and can be used for tasks such as classification or to prepare semantic embeddings
(usually a vector of a few hundred numbers) for a multitude of documents that can be used in
retrieval systems. The best-known transformer encoder model is probably BERT6, which
introduced the masked language model objective that was later picked up and further
explored by BART.
Causal language modeling predicts the next token given the previous ones - it’s what we did
with GPT-2. The model can only attend to the context on the left of a given token. A different
approach used in encoder models is called masked language modeling (MLM). Masked
language modeling, proposed in the famous BERT paper, pre-trains a model to learn to
"fill in the blanks“. Given an input text, we randomly mask some tokens, and the
model must predict the hidden tokens. Unlike causal language modeling, MLM uses both the
sequence at the masked token’s left and right, hence the B of "bidirectional" in BERT’s
name. This helps create strong representations of the given text. Under the hood, these
fill_masker = pipeline(model="bert-base-uncased")
'token': 9841,
'token_str': 'dish',
{'score': 0.1290755718946457,
'token': 8808,
'token_str': 'cheese',
'token': 6501,
'token_str': 'milk',
{'score': 0.04112089052796364,
'token': 4392,
'token_str': 'drink',
'token': 7852,
'token_str': 'bread',
What happens under the hood? The encoder receives the input sequence and generates a
contextualized representation for each token. This representation is a vector of numbers that
captures the meaning of the token in the context of the entire sequence. The encoder is
usually followed by a task-specific layer that uses the representations to perform tasks such as
number of new open and closed language models, such as GPT-4, Mistral, Falcon, Llama 2,
Qwen, Yi, Claude, Bloom, PaLM, and hundreds more. Yann LeCun posted this delightful
genealogy diagram in Twitter, taken from a survey paper7 shows transformers’ rich and
Having access to existing models is quite powerful. In the previous sections, we explored
using GPT2 and GPT-NeoX to generate text and perform zero-shot classification.
Transformer models have shown state-of-the-art performance across many other language
tasks, such as text classification, machine translation, and answering questions based on an
The first insight is the usage of the attention mechanism, as hinted in the chapter introduction.
Previous NLP methods, such as recurrent neural networks, struggled to handle long
sentences. Attention mechanisms allow the transformers model to attend to long sequences
and learn long-range relationships. In other words, transformers can estimate how relevant
implementation optimized for parallelization, and research has shown that these models can
scale to handle high-complexity and high-scale datasets. Although initially designed for text
data, the transformer architecture can be flexible enough to support different data types and
The third key insight is the ability to do pre-training and fine-tuning. Traditional approaches
to a task, such as movie review classification, were limited by the availability of labeled data.
A model would be trained from scratch on a large corpus of labeled examples, attempting to
predict the label from the input text directly. This approach is often referred to as supervised
learning. However, it has a significant drawback: it requires a large amount of labeled data to
train effectively. This is a problem because labeled data is expensive to obtain and
time-consuming to label. There might not even be any available data in many domains.
To address this, researchers began looking for a way to pre-train models on existing data that
could then be fine-tuned (or adjusted) for a specific task. This approach is known as transfer
learning and is the foundation of modern ML in many fields, such as Natural Language
Processing and Computer Vision. Initial works in NLP focused on finding domain-specific
corpora for the language model pre-training phase, but papers such as ULMFiT8 showed that
even pre-training on generic text such as Wikipedia could yield impressive results when the
answering. This set the stage for the rise of transformers, which turned out to be highly
to a new target task, for which one would require much less labeled data. Before graduating
to NLP, transfer learning had already been very successful with the Convolutional Neural
Networks that form the backbone of modern Computer Vision. In this scenario, one first
trains a large model with a massive amount of labeled images in a classification task.
Through this process, the model learns common features that can be leveraged on a different
but related problem. For example, we can pre-train a model on thousands of classes and then
With transformers, things are taken further with self-supervised pre-training. We can pre-train
a model on large, unlabeled text data. How? Let’s think about causal models such as GPT.
The model predicts which is the next word. Well, we don’t need any labels to obtain training
data. Given a corpus of text, we can mask the tokens after a sequence and train the model to
learn to predict them. Like in the computer vision case, pre-training gives the model a
meaningful representation of the underlying text. We can then fine-tune the model to perform
another task, such as generating text in the style of our Tweets or a specific domain (e.g.,
your company chat). Given the model has already learned a representation of language,
fine-tuning will require much less data than if we trained from scratch.
For many tasks, a rich representation of the input is more important than being able to predict
the next token. For example, if you want to fine-tune a model to predict the sentiment of a
movie review, masked language models would be more powerful. Models such as GPT-2 are
designed to optimize for text generation rather than for building powerful representations of
the text. On the other hand, models such as BERT are ideal for this task. As briefly mentioned
before, the last layer of an encoder model outputs a dense representation of the input
sequence, called embedding. This embedding can then be leveraged by adding a small,
simple network on top of the encoder and fine-tuning the model for the specific task. As a
concrete example, we can add a simple linear layer on top of the BERT encoder output to
predict the sentiment of a document. We can take this approach to tackle a wide range of
tasks:
or organization.
● Semantic search. The features generated by the encoder can be handy to build a
embeddings for each. Then, we can compare the input embeddings with the
documents’ ones at inference time, hence identifying the most similar document in
the database.9
● And many others, including text similarity, anomaly detection, named entity
This classification model can analyze reviews and do the same as in the zero-shot
classification section. The challenge section of this chapter shows how to evaluate
classification models and compare between a zero-shot setup and this fine-tuned model.
Transformers recap
ideal for tasks that require understanding the entire input. These models output
specific task that relies on the semantic information (such as identifying entities in
● Decoder-based architectures, such as GPT-2, Falcon, and Llama, are ideal for
● Encoder-decoder architectures, or seq2seq, such as BART and T5, are great for
tasks that require generating new sentences based on a given input, such as
summarization or translation.
"Wait." - you might say - "I can do all of these tasks with ChatGPT or
Llama“. That’s true - given the vast (and growing) amount of training data, computing, and
training optimizations, the quality of generative models is significantly increasing, and the
zero-shot capabilities have improved considerably compared to a few years ago. Although
decoder-only models provide good results, the current consensus is that, provided the
resources, fine-tuning a model for your specific task and domain will work better than using
an out-of-the-box pre-trained model. For example, if you want to use a GPT model in
real-time in a game to generate character dialogs, it will usually perform better if you first
fine-tune it with similar data. If you want to use a model to extract different entities from
your dataset of chemistry papers, it might make sense first to fine-tune an encoder-based
variable-length input sequences into an embedding, which summarizes the input information.
The decoder part of the model can then leverage the context for performing the generation.
Decoder-only models have gained interest in recent years thanks to their simplicity,
scalability, efficiency, and parallelization. The three types of models are widely used in the
industry depending on the task - no single golden model is used for everything.
With over half a million open models, you might wonder which one to use. Chapter 5 will
help you navigate this landscape, providing guidelines on how to choose the right model for
your task and requirements as well as how to fine-tune a model for your specific needs.
Limitations
At this point, you might wonder what the issues are with transformers. Let’s briefly go over
● Transformers are very large. Research has consistently shown that larger
models perform better. Although that’s quite exciting, it also brings concerns.
First, some of the most powerful models require dozens of millions of U.S. dollars
to train - just in computing power. That means that only a small set of institutions
can train very large base models, limiting the kind of research that institutions
without those resources can do. Second, using such amounts of computing power
can also have ecological implications - those millions of GPU hours are, of
course, powered by lots of electricity. Third, even if some of these models are
open-sourced, running them might require many GPUs. Chapter 5 will explore
some techniques to use these LLMs even if you don’t have multiple GPUs at
frequent challenge.
● Sequential processing: If you recall the decoder section, we had to process all the
previous tokens for each new token. That means generating the 10,000th token in
a sequence will take considerably longer than generating the initial one. In
computer science terms, transformers have quadratic time complexity with respect
to the input length. This means that as the length of the input increases, the time
very long documents or use these models in some real-time scenarios. While
consideration and optimization when being used in production. That said, there
has been a lot of research on making transformers more efficient for extremely
long sequences.
● Fixed input size: Transformer models can handle a maximum number of tokens,
which depends on the base model. Some transformers can only handle 512 tokens,
while new techniques allow to scale to hundreds of thousands tokens. The number
of tokens the model can attend is called the context window. This is an essential
thing to look into when picking a pre-trained model. You cannot simply pass
interpretability.
All of the above are very active research areas - people have been exploring how to train and
run models with less computing power (e.g., QLoRA, which we’ll explore in Chapter 5),
make generation faster (e.g., flash attention and assisted generation), enable unconstrained
input sizes (e.g., RoPE and attention sinks), and interpret the attention mechanisms.
One big concern that requires diving into is the presence of biases in models. If the training
data used to pre-train transformers contains biases, the model can learn and perpetuate them.
This is a broader issue in machine learning but is also relevant to transformers. Let’s revisit
the fill-mask pipeline. Let’s say we want to predict the most likely profession. As you
can see below, the results are very different if we use the word "man" vs. "woman“.
Why does this happen? To enable pre-training, researchers usually require large amounts of
data, leading to scraping all the content they can find. This content might be of all kinds of
quality, including toxic content (which can be, to some extent, filtered out). The base model
might end up engraining and perpetuating these biases when being fine-tuned. Similar
concerns exist for conversational models, where the final model might generate toxic content
Beyond Text
Transformers have been used for many tasks representing data as text. A clear example is
code generation – rather than training a language model with English data, we can use lots of
code, and, by the same principles we just learned, it will learn how to auto-complete code.
spreadsheet.
As transformer models have been so successful in the text domain, considerable interest has
sparked in other communities to adapt these techniques to other modalities. This has led to
Transformer models being used for tasks such as image recognition, segmentation, object
Convolutional Neural Networks have been widely used as the go-to state-of-the-art models
for most computer vision techniques. With the introduction of Vision Transformers (ViT)11,
there has been a switch in recent years to explore how to tackle vision tasks with attention
and transformers-based techniques. ViTs don’t discard CNNs entirely: In the image
processing pipeline, CNNs extract feature maps of the image to detect high-level edges,
textures, and other patterns. The feature maps obtained from the CNNs are then divided into
tokens, so the attention mechanism can learn the relationships between patches in different
places.
Unfortunately, ViTs required more data (300 million images!) and compute than CNNs to get
good results. Further work has happened in recent years; for example, DeiT was able to use
augmentation and regularization techniques common in CNNs. DeiT also uses a distillation
approach involving a "teacher" model (a CNN in this case). Other models such as DETR,
SegFormer, and Swin Transformer have pushed the field further, supporting many tasks such
As we’ll see in Chapter 9, transformer models can also be used for audio tasks, such as
transcribing audio or generating synthetic speech or music. Under the hood, the same
fundamental principles of pre-training and attention mechanisms persist, but each modality
exploratory, but there are some exciting early results. Some examples of tasks that
involve graph data are predicting the toxicity of molecules, predicting the
3D, such as LiDAR point clouds in autonomous driving or CT scans for organ
multiple types of data (such as text, images, and audio) together. This opens new
possibilities, such as multimodal systems where you can speak, write, or provide
pictures and have a single model to process them. Another example is visual
question answering, where a model can answer questions about provided images.
techniques. To better understand how it works under the hood, it’s time to implement it
ourselves. We’ll use the generate() method as a reference but implement it from scratch.
We’ll also explore using the generate() method to perform different decoding techniques.
Your goal is to fill the code in the following function. Rather than use gpt2.generate(),
the idea is to iteratively call gpt2(), passing the previous tokens as input. You have to
def generate(
):
"""
# Write your code here
Summary
Congratulations! You now have learned to load and use transformers for various tasks! You
also understood how transformers model sequence data such as text and how this property
lets them "learn" valuable representations that we can use to generate or classify new
sequences. As the scale of these models increases, so do their capabilities - to the point where
massive models with hundreds of billions of parameters can now perform many tasks
We can pick powerful existing pre-trained models and modify them for specific domains and
use cases thanks to fine-tuning. The trend towards larger and more capable models has
caused a shift in how people use them. Task-specific models are often out-competed by
general-purpose LLMs, and most people now interact with these models via APIs and hosted
solutions or directly via slick chat-based user interfaces. At the same time, thanks to the
release of large and powerful open-access models, such as Llama, there is a strong wave in
the researchers’ and practitioners’ ecosystems aiming to run high-quality models directly in
consumer computers, resulting in privacy-first solutions. This trend extends beyond
inference: novel training approaches that allow individuals to fine-tune these models without
many computational resources have emerged in recent years. Chapter 5 delves into both
Although we covered how transformers work and we’ll dive into their training, diving into
the internals of these models (for example, the math behind attention mechanisms) or how to
pre-train a model from scratch is outside the scope of this book. Luckily for us, there are
if you want to dive deeper into the internals of fine-tuning these models for
● Hugging Face has a free, open-source course which teaches how to solve different
NLP tasks.
If you want to dive more into the GPT family of models, we suggest to review the following
papers:
original GPT paper, published in 2018 by Alec Radford, Karthik Narasimhan, Tim
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
parameters pre-trained on a large corpus of web text called WebText. The paper
also demonstrated that GPT-2 could perform well on various natural language
and others. This paper shows that scaling up language models dramatically
improves their ability to perform new language tasks from only a few examples or
Exercises
3. What happens if you use a tokenizer different from the one used with the model?
temperature?
generation?
transformers.
Challenges
generate summaries of a paragraph. How does it compare with the results of using
10. In the zero-shot supplementary material, we calculate the confusion matrix using
distilbert-base-uncased-finetuned-sst-2-english encoder
11. Let’s build a FAQ system! Sentence transformers are powerful models that can
determine how similar multiple texts are. While the transformer encoder usually
for the whole input text, allowing us to determine if the two texts are similar based
sentence_transformers library.
model =
SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embedding_1 = model.encode(sentences[0],
convert_to_tensor=True)
embedding_2 = model.encode(sentences[1],
convert_to_tensor=True)
util.pytorch_cos_sim(embedding_1, embedding_2)
tensor([[0.6003]], device='cuda:0')
Write a list of five questions and answers about a topic. Your goal will be to build a system
that, given a new question, can give the user the most likely answer. How can we use
sentence transformers to solve this? The supplemental material contains the solution, but
References
1. Brown, Tom B., et al. Language Models Are Few-Shot Learners. arXiv, 22 July
http://arxiv.org/abs/1810.04805
http://arxiv.org/abs/2010.11929
5. Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv, 14
http://arxiv.org/abs/1801.06146
10. Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified
http://arxiv.org/abs/1910.10683
12. Vaswani, Ashish, et al. Attention Is All You Need. arXiv, 1 Aug. 2023. arXiv.org,
http://arxiv.org/abs/1706.03762
13. Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on
http://arxiv.org/abs/2304.13712
(https://huggingface.co/blog/introducing-csearch).
2 The first example in the GPT-2 release blog post was famously a news story about unicorns
(https://openai.com/research/better-language-models).
Comprehension"
5 Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text
6 Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language
7 Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT
8 Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning for Text
search system using semantic embeddings in the challenge section of this chapter.
10 DistilBERT is a smaller model that preserves 95% of the original BERT performance while
having 40% less parameters. RoBERTa is a very powerful BERT-based model trained with
11 Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for Image