Chapter 2. Transformers: A Note For Early Release Readers

Chapter 2.
Transformers
A NOTE FOR EARLY RELEASE READERS
With Early Release ebooks, you get books in their earliest form—the authors’ raw and
unedited content as they write—so you can take advantage of these technologies long before
the official release of these titles.
This will be the third chapter of the final book. Please note that the GitHub repo will be made
active later on.
If you have comments about how we might improve the content and/or examples in this
book, or if you notice missing material within this chapter, please reach out to the editor at
jleonard@oreilly.com.
Many trace the most recent wave of advances in generative AI to the introduction of a class
of models called transformers in 2017. Their most well-known application is the powerful
Large Language Models (LLMs), such as Llama and GPT-4, used by hundreds of millions
daily. Transformers have become a backbone for modern AI applications, powering
everything from chatbots and search systems to machine translation and content
summarization. They’ve even branched out beyond text, making waves in fields like
computer vision, music generation, and protein folding. In this chapter, we’ll explore the core
ideas behind transformers and how they work, with a focus on one of the most common
applications: language modeling.
Before we delve into the nitty-gritty of transformers, let’s take a step back and understand
what language modeling is. At its core, a Language Model (LM) is a probabilistic model that
learns to predict the next word (or token) in a sequence based on the preceding or
surrounding words. Doing so captures language’s underlying structure and patterns, allowing
it to generate realistic and coherent text. For example, given the sentence "I began my
day eating“, a language model might predict the next word as "breakfast" with a
high probability.
So, how do transformers fit into this picture? Unlike traditional language models that use
fixed-sized sliding windows or recurrent neural networks (RNNs), transformers are designed
to handle long-range dependencies and complex relationships between words more efficiently
and expressively. For example, imagine that you want to use an LM to summarize a news
article, which might contain hundreds or even thousands of words. Traditional LMs struggle
with long contexts, so the summary might skip critical details from the beginning of the
article. Transformer-based LMs, however, show strong results in this task. Besides
high-quality generations, transformers have other properties, such as efficient parallelization
of training, scalability, and knowledge transfer, making them popular and well-suited for
multiple tasks. At the heart of this innovation lies the self-attention mechanism, which allows
the model to weigh the importance of each word in the context of the entire sequence.
To help us build intuition about how language models work, we’ll use code examples that
interact with existing models, and we’ll describe the relevant pieces as we find them. Let’s
get to it.
A Language Model in
Action
In this section, we will load and interact with an existing (pre-trained) transformer model to
get a high-level understanding of how they work. We’ll use the GPT-2 model, which made
headlines in 2019 for its (then) impressive text-generation capabilities. Although small and
almost quaint by today’s standards, GPT-2 is nevertheless a good illustration of how these
language models work. The same principles apply to the larger (over 100 times larger!) and
more powerful models that have since been released.
Tokenizing Text
Let’s begin our journey to generate some text based on an initial input. For example, given
the phrase "it was a dark and stormy“, we want the model to generate some words
to continue it. Models can’t receive text directly as input; their input must be data represented
as numbers. To feed text into a model, we must first find a way to turn sequences into
numbers. This process is called tokenization, a crucial step in any NLP pipeline.
An easy option would be to split the text into individual characters and assign each a unique
numerical ID. This scheme could be helpful for languages such as Chinese, where each
character carries much information. In languages like English, this creates a very small token
vocabulary, and there will be very few unknown tokens (characters not found during training)
when running inference. However, this method requires many tokens to represent a string,
which is bad for performance and erases some of the structure and meaning of the text – a
downside for accuracy. Each character carries very little information, making it hard for the
model to learn the underlying structure of the text.
Another approach could be to split the text into individual words. While this lets us capture
more meaning per token, it has the downsides that we need to deal with more unknown words
(e.g., typos, slang, etc.), we need to deal with different forms of the same word (e.g., "run“,
"runs“, "running“, etc.), and we might end with a very large vocabulary, which could
easily be over half a million words for languages such as English. Modern tokenization
strategies strike a balance between these two extremes, splitting the text into subwords that
capture both the structure and meaning of the text while still being able to handle unknown
words and different forms of the same word.
Characters that are usually found together (like most frequent words) can be assigned a single
token that represents the whole word or group. Long or complicated words, or words with
many inflections, may be split into multiple tokens, where each one usually represents a
meaningful section of the word. There is no single "best" tokenizer; each language model
comes with its own one. The differences between tokenizers reside in the number of tokens
supported and the tokenization strategy.

Let’s see how the GPT-2 tokenizer handles a sentence to see this in action. We’ll first load the
tokenizer corresponding to GPT-2. Then, we’ll run the input text (also called prompt) through
the tokenizer to encode the string into numbers representing the tokens. We’ll use the
decode() method to convert each ID back into its corresponding token for demonstration
purposes.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids = tokenizer("It was a dark and stormy",

return_tensors="pt").input_ids
input_ids
tensor([[1026, 373, 257, 3223, 290, 6388, 88]])

for t in input_ids[0]:
print(t, "\t:", tokenizer.decode(t))
tensor(1026) : It
tensor(373) : was
tensor(257) : a
tensor(3223) : dark
tensor(290) : and
tensor(6388) : storm
tensor(88) : y
As you can see, the tokenizer splits the input string into a series of tokens and assigns a
unique ID to each. Most words are represented by a single token, but "stormy" is
represented by two tokens: one for ” storm” (including the space before the word) and one for
the suffix "y“. This allows the model to learn that "stormy" is related to "storm" and that
the suffix "y" is often used to turn nouns into adjectives. With a vocabulary of around 50,000
tokens, the GPT-2 tokenizer can efficiently represent almost any input text and averages
about 1.3 tokens per word.
NOTE
Even though we usually talk about training tokenizers, this has nothing to do with training a model.
Model training is stochastic (non-deterministic) by nature, whereas we train a tokenizer using a
statistical process that identifies which subwords are the best to pick for a given dataset. How to
choose the subwords is a design decision of the tokenization algorithm. Therefore, tokenization
training is deterministic. We won’t dive into different tokenization strategies, but some of the most
popular subword approaches are Byte-level BPE, used in GPT-2, WordPiece, and SentencePiece.
Predicting Probabilities
GPT-2 was trained as a causal language model (also known as auto-regressive), which means
it was trained to predict the next token in a sequence given the preceding tokens. The
transformers library has high-level tools that enable us to use such a model to generate text or
perform other tasks quickly. It is helpful to understand how the model makes its predictions
by directly inspecting them on this language-modeling task. We begin by loading the model.
from transformers import AutoModelForCausalLM
gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")
NOTE
Note the use of AutoTokenizer and AutoModelForCausalLM. The transformers library
supports hundreds of models and their corresponding tokenizers. Rather than learning the name of
each tokenizer and model class, we will use AutoTokenizer and AutoModelFor*.
For the automatic model, we need to specify for which task we’re using the model, such as
classification (AutoModelForSequenceClassification) or object detection
(AutoModelForObjectDetection). In the case of GPT-2, we’ll use the class corresponding to
the causal language modeling task. When using the automatic classes, transformers will pick an
adequate default class based on the configuration of a model. For example, under the hood, they will
use GPT2Tokenizer and GPT2LMHeadModel.
If we feed the tokenized sentence from the previous section through the model, we get a
result back with 50,257 values for each token in the input string:
outputs = gpt2(input_ids)
outputs.logits.shape # An output for each input token
torch.Size([1, 7, 50257])
The first dimension of the output is the number of batches (1 because we just ran a single
sequence through the model). The second dimension is the sequence length, or the number of
tokens in the input sequence (7 in our case). The third dimension is the vocabulary size. We
get a list of ~50 thousand numbers for each token in the original sequence. These are the raw
model outputs, or logits, that correspond to the tokens in the vocabulary. For every input
token, the model predicts how likely each token in the vocabulary is to continue the sequence
up to that point. With our example sentence, the model will predict logits for "It“, "It
was“, "It was a“, and so on. Higher logits’ values mean the model considers the
corresponding token a more likely continuation of the sequence. The following table shows
the input sequences, the most likely token ID, and its corresponding token.
Logits are the raw output of the model (a list of numbers such as [0.1, 0.2, 0.01, …]). We can
use the logits to select the most likely token to continue the sequence. However, we can also
convert the logits into probabilities, as we’ll see soon.
ID of most likely next Corresponding

Input Sequence
token token
It 318 is
It was 257 a
It was a 845 very
It was a dark 1755 night
It was a dark and 4692 cold
It was a dark and storm 88 y

It was a dark and 1755 (let’s figure this one)
stormy
Let’s focus on the logits for the entire input sentence and see how to predict the next word of
the sentence. We can find the index of the token with the highest value using the argmax()
method:
final_logits = gpt2(input_ids).logits[0, -1] # The last set

of logits
final_logits.argmax() # The position of the maximum
tensor(1755)
1755 corresponds to the ID of the token the model considers most likely to follow the input
string "It was a dark and stormy“. Decoding this token, we can see that this model
knows a few story tropes:

tokenizer.decode(final_logits.argmax())
' night'
So ” night” is the most likely token. This makes sense considering the beginning of the
sentence we provided as input. The model learns how to pay attention to other tokens using
an algorithm called self-attention, which is the fundamental building block of transformers.
Intuitively, self-attention allows the model to identify how much each token contributes to the
meaning of the phrase.
NOTE
Transformer models contain many of these attention layers, each one specializing in some aspect of
the input. Contrary to heuristics systems, these aspects or features are learned during training, instead
of being specified beforehand.
Let’s now see which other tokens were potential candidates by selecting the top 10 values:
import torch
top10_logits = torch.topk(final_logits, 10)
for index in top10_logits.indices:
print(tokenizer.decode(index))
night
day
evening
morning
afternoon
summer
time
winter
weekend
We’ll need to convert logits into probabilities to see how confident the model is about each
prediction. We’d do that by comparing each value with all the other predicted values and
normalizing so all the numbers sum up to 1. That’s precisely what the softmax() operation
does. The following code uses softmax() to print out the top 10 most likely tokens and
their associated probabilities according to the model:

top10 = torch.topk(final_logits.softmax(dim=0), 10)
for value, index in zip(top10.values, top10.indices):
print(f"{tokenizer.decode(index):<10} {value.item():.2%}")
night 46.18%
day 23.46%
evening 5.87%
morning 4.42%
afternoon 4.11%
summer 1.34%
time 1.33%
winter 1.22%
weekend 0.39%
, 0.38%
Before going further, we suggest to experiment with the code above. Here are some ideas for
you to try:
● Change few words: Try changing the adjectives (e.g., "dark" and "stormy“) in
the input string and see how the model’s predictions change. Is the predicted word
still "night“? How do the probabilities change?
● Change the input string: Try different input strings and see how the model’s
predictions change. Do you agree with the model’s predictions?
● Grammar: What happens if you provide a string that is not a grammatically
correct sentence? How does the model handle it? Look at the probabilities of the
top predictions.
Generating Text
Once we know how to get the model’s predictions for the next token in a sequence, it is easy
to generate text by repeatedly feeding the model’s predictions back into itself. We can call
gpt2(ids), generate a new token ID, add it to the list, and call the function again. To make
it more convenient to generate multiple words, transformers auto-regressive models have a
generate() method ideal for this case. Let’s explore an example.
output_ids = gpt2.generate(input_ids, max_new_tokens=20)
decoded_text = tokenizer.decode(output_ids[0])
print("Input IDs", input_ids[0])
print("Output IDs", output_ids)

print(f"Generated text: {decoded_text}")
Input IDs tensor([1026, 373, 257, 3223, 290, 6388, 88])
Output IDs tensor([ 1026, 373, 257, 3223, 290, 6388,

88, 1755, 13,
383, 2344, 373, 19280, 11, 290, 262,

15114, 547,
7463, 13, 383, 2344, 373, 19280, 11,

290, 262])
Generated text: It was a dark and stormy night. The wind was
blowing,
and the clouds were falling. The wind was blowing, and the
When we ran the gpt2() forward method in the previous section, it returned a list of logits
for each token in the vocabulary (50257). Then, we had to calculate the probabilities and pick
the most likely token. generate() abstracts this logic away. It makes multiple forward
passes, predicts the next token repeatedly, and appends it to the input sequence.
generate() provides us with the token IDs of the final sequence, including both the input
and new tokens. Then, with the tokenizer decode() method, we can convert it back to text.
There are many possible strategies to perform generation. The one we just did, picking the
most likely token, is called greedy decoding. Although this approach is straightforward, it can
sometimes lead to suboptimal outcomes, especially in generating longer text sequences.
Greedy decoding can be problematic because it doesn’t consider the overall probability of a
sentence, focusing only on the immediate next word. For instance, given the starting word
Sky and the choices blue and rockets for the next word, greedy decoding might favor
Sky blue since blue initially seems more likely following Sky. However, this approach
might overlook a more coherent and probable overall sequence like Sky rockets soar.
Therefore, greedy decoding can sometimes miss out on the most likely overall sequence,
leading to less optimal text generation.
Rather than one token at a time, techniques such as beam search explore multiple possible
continuations of the sequence and return the most likely sequence of continuations. It keeps
the most likely num_beams of hypotheses during generation and chooses the most likely
one.
beam_output = gpt2.generate(
input_ids,
num_beams=5,
max_new_tokens=30,
print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))
It was a dark and stormy night.

"It was dark and stormy," he said.
"It was dark and stormy," he said.
As you noticed, the output includes many repetitions of the same sequence. There are
multiple parameters we can control to perform better generations. Let’s see two examples:
● repetition_penalty - how much to penalize already generated tokens,
avoiding repetition. A good default value is 1.2.
● bad_words_ids - a list of tokens that should not be generated (e.g., to avoid
generating offensive words).
Let’s see what we can achieve by penalizing repetition:

beam_output = gpt2.generate(
input_ids,
num_beams=5,
repetition_penalty=1.2,
max_new_tokens=38,
print(tokenizer.decode(beam_output[0],
It was a dark and stormy night.
"There was a lot of rain," he said. "It was very cold."
He said he saw a man with a gun in his hand.
This is much better. Which generation strategy to use? As often in Machine Learning… it
depends. Beam search works well when the desired length of the text is somewhat
predictable. This is the case for tasks such as summarization or translation but not for
open-ended generation, where the output length can vary greatly, leading to repetition.
Although we can inhibit the model to avoid repeating itself, doing so can also lead to
performing worse. Also note that beam search will be slower than greedy search as it needs to
run inference for multiple beams simultaneously, which can be an issue for large models.
When we generate with greedy search and beam search, we push the model to generate text
with a distribution of high-probability next words. Interestingly, high-quality human language

does not follow a similar distribution. Human text tends to be more unpredictable. An
excellent paper about this counter-intuitive observation is The Curious Case of Neural Text
Degeneration. The authors conjecture that human language disfavors predictable words -
people optimize against stating the obvious. The paper proposes a method called nucleus
sampling.
With sampling, we pick the next token by sampling from the probability distribution of the
next tokens. This means that sampling is not a deterministic generation process. If the next
possible tokens are night (60%), day (35%), and apple (5%), rather than choosing night (with
greedy search), we will sample from the distribution. In other words, there will be a 5%
chance of picking "apple" even if it’s a low-probability token and leads to a nonsensical
generation. Sampling avoids creating repetitive text, hence leading to more diverse
generations. Sampling is done in transformers using the do_sample parameter.
from transformers import set_seed
# Setting the seed ensures we get the same results every time
we run this code
set_seed(70)
sampling_output = gpt2.generate(
input_ids,
do_sample=True,
max_length=34,
top_k=0, # We'll come back to this parameter
)
print(tokenizer.decode(sampling_output[0],
It was a dark and stormy day until it broke down the big
canvas on my
sleep station, making me money dilapidated, and, with a big

soothing m
ug
We can manipulate the probability distribution before we sample from it, making it
"sharper" or "flatter" using a temperature parameter. A temperature higher than
one will increase the randomness of the distribution, which we can use to encourage
generation of less probable tokens. A temperature between 0 and 1 will reduce the
randomness, increasing the probability of the more likely tokens and avoiding predictions
that might be too unexpected. A temperature of 0 will move all the probability to the most
likely next token, which is equivalent to greedy decoding. Compare the effect of this
temperature parameter on the generated text in the following example.

input_ids,
do_sample=True,
temperature=0.4,
max_length=40,
top_k=0,
)
It was a dark and stormy night, and I was alone. I was in the
middle o
f the night, and I was suddenly awakened bygoodness, and I was

thinkin
g of the old man
input_ids,
do_sample=True,
temperature=0.001,
max_length=40,
top_k=0,
It was a dark and stormy night. The wind was blowing, and the
clouds w
ere falling. The wind was blowing, and the clouds were
falling. The wi
nd was blowing, and the clouds were

input_ids,
do_sample=True,
temperature=3.0,
max_length=40,
top_k=0,
)
It was a dark and stormy corporation street compliment

ideallylake ame
nded Churchill ty set crou 175 dualKing Bucc ceiling

wrapped.......my
tryhouse fragileREG Robinson lower display magn Simon spectral

warmth
HP274 Lur Welsh
Well, the first test is much more coherent than the second one. The second, which uses a very
low temperature, is repetitive (similar to greedy decoding). Finally, the third sample, with an
extremely high temperature, gives gibberish text.
One parameter you likely noticed is top_k. What is it? Top-K sampling is a simple sampling
approach in which only the K most likely next tokens are considered. For example, using
top_k=5, the generation method will first filter the most likely five tokens and redistribute
the probabilities so they add to one.
input_ids,
do_sample=True,
max_length=40,
top_k=10,
)
It was a dark and stormy night and I was not expecting to be

here at 9
:30 AM. It felt cold and rainy. I didn't know why I was here.
There wa
s no
Hmm…this could be better. An issue with Top-K Sampling is that the number of relevant
candidates in practice could vary greatly. If we define top_k=5, some distributions will still
include tokens with very low probability, while others will consist of only high-probability
tokens.
The final generation strategy we’ll visit is Top-p sampling (also known as nucleus sampling).
Rather than sampling the K words with the highest probability, we will use all the most likely
words whose cumulative probability exceeds a given value. If we use a top_p=0.94, we’ll
first filter only to keep the most likely words that cumulatively have a probability of 0.94 or
higher. We then redistribute the probability and do regular sampling. Let’s see it in action.
input_ids,
do_sample=True,
max_length=40,
top_p=0.94,
top_k=0,
)
It was a dark and stormy hour, a formation of what looked like

beggar
to an armoire-upper of the home that flickered down the

cobbled main r
oad, leaned slowly against
Both Top-K and Top-p are commonly used in practice. They can even be combined to filter
out low-probability words but have more generation control. The issue with the stochastic
generation methods is that the generated text doesn’t necessarily contain coherence.
We’ve seen three different generation methods: greedy search, beam-search decoding, and
sampling (with temperature, Top-K, and Top-p providing further control). Those are lots of
approaches! If you want to further experiment with generation, here are some suggestions to
experiment with:
● Experiment with different parameter values. How does increasing the number of
beams impact the quality of your generation? What happens if you reduce or
increase your top_p value?
● One approach to reduce repetition in Beam Search is introducing penalties for
n-grams (word sequence of n words). This can be configured using
no_repeat_ngram_size, which avoids repeating the same n-gram. For
example, if you use no_repeat_ngram_size=4, the generation will never
contain the exact four consecutive words.
● A newer method, contrastive search, can generate long, coherent output while
avoiding repetition. This is achieved by considering both the probabilities
predicted by the model and the similarity with the context. This can be controlled
via penalty_alpha and top_k.1
If all of this sounds too empirical, it’s because it is. Generation is an active area of research,
with new papers coming up with different proposals, such as more sophisticated filtering.
We’ll briefly discuss these in the final chapter. No single rule works for all models, so it’s
always important to experiment with different techniques.
Zero-Shot Generalization
Generating language is a fun and exciting application of transformers, but writing fake
articles about unicorns2 is not the reason why they are so popular. To predict the next token
well, these models must learn a fair amount about the world. We can take advantage of this to
perform various tasks. For example, instead of training a model dedicated to translation, we
can prompt a sufficiently powerful language model with an input like:
Translate the following sentence from English to French:
Input: The cat sat on the mat.
Translation:
I typed this example with GitHub Copilot active, and it helpfully suggested "Le chat
était assis sur le tapis" as a continuation of the above prompt - a perfect
illustration of how a language model can perform tasks not explicitly trained for. The more
powerful the model, the more tasks it can perform without additional training. This flexibility
makes transformers quite powerful and has made them so popular in recent years.
To see this in action for ourselves, let’s use GPT-2 as a classification model. Specifically,
we’ll classify movie reviews as positive or negative - a classic benchmark task in the NLP
field. We’ll use a zero-shot approach to make things interesting, which means we won’t
provide the model with any labeled data. Instead, we’ll prompt the model with the text of a
review and ask it to predict the sentiment. Let’s see how it does.
To do this, we’ll insert the review into a prompt template that provides context for the model
and helps it understand what we’re asking it to do. After feeding the prompt through the
model, we’ll look at its prediction for the next token and see which possible token is assigned
a higher probability: "positive" or "negative“? To do that, let’s find the IDs
corresponding to those tokens.
# Check the token IDs for the words ' positive' and '
negative'
# (note the space before the words)
tokenizer.encode(" positive"), tokenizer.encode(" negative")
([3967], [4633])
Once we have the IDs, we can now run inference with the model and see which token has a
higher probability:
def score(review):
"""Predict whether it is positive or negative
This function predicts whether a review is positive or

negative
using a bit of clever prompting. It looks at the logits

for the
tokens ' positive' and ' negative' (note the space before
the
words), and returns the label with the highest score.
"""
prompt = f"""Question: Is the following review positive or

negative about the movie?
Review: {review} Answer:"""f
input_ids = tokenizer(prompt,
return_tensors="pt").input_ids
final_logits = gpt2(input_ids).logits[0, -1]
if final_logits[3967] > final_logits[4633]:
print("Positive")
else:
print("Negative")
Tokenize the prompt
Get the logits for each token in the vocabulary. Note that we’re using gpt2() rather than
gpt2.generate(), as gpt2() returns the logits for each token in the vocabulary,
while gpt2.generate() returns only the chosen token.
Check if the logit for the Positive token is higher than the logit for the Negative token.
We can try out this zero-shot classifier on a few fake reviews to see how it does:
score("This movie was terrible!")
Negative
score("That was a delight to watch, 10/10 would recommend :)")
Positive
score("A complex yet wonderful film about the depravity of

man") # A mistake
Negative
In the supplementary material, you’ll find a dataset of labeled reviews and code to assess the
accuracy of this zero-shot approach. Can you tweak the prompt template to improve the
model’s performance? Can you think of other tasks that could be performed using a similar
approach?
The zero-shot capabilities of recent models have been a game-changer. As the models
improve, they can perform more tasks out-of-the-box, making them more accessible and
easier to use and reducing the need for specialized models for each task.
Few-Shot Generalization
Despite the release of ChatGPT and the quest for the perfect prompts, zero-shot
generalization (or prompt tuning) is not the only way to bend powerful language models to
perform arbitrary tasks.
Zero-shot is the extreme application of a technique called few-shot generalization, in which
we provide the language model a few examples about the task we want it to perform and then
ask it to provide similar answers for us. Instead of training the model, we show some
examples to influence generation by increasing the probability that the continuation text
follows the same structure and pattern as our prompt.

Let’s see an example. Apart from providing examples, providing a short description of what
the model should do, e.g., "Translate English to French“, will help with
higher-quality generations. This time, we’ll use a more robust model: GPT-Neo 1.3B.
GPT-Neo is a family of transformer models from EleutherAI, a non-profit research lab. These
models outperform GPT-2 in many tasks and tend to do few-shot learning better. We’ll use
the variant with 1.3 billion parameters, small by today standards, but still quite powerful and
a few times larger than GPT-2, which has just 124 million parameters.
model =
AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B"
)
prompt = """\
Translate English to Spanish:
English: I do not speak Spanish.

Spanish: No hablo español.
English: See you later!
Spanish: ¡Hasta luego!
English: Where is a good restaurant?
Spanish: ¿Dónde hay un buen restaurante?

English: What rooms do you have available?
Spanish: ¿Qué habitaciones tiene disponibles?
English: I like soccer
Spanish:"""
inputs = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(
inputs,
do_sample=False,
max_new_tokens=10,
print(tokenizer.decode(output[0], skip_special_tokens=True))
Translate English to Spanish:
English: I do not speak Spanish.
Spanish: No hablo español.

English: See you later!
Spanish: ¡Hasta luego!
English: Where is a good restaurant?
Spanish: ¿Dónde hay un buen restaurante?
English: What rooms do you have available?

Spanish: ¿Qué habitaciones tiene disponibles?
English: I like soccer
Spanish: Me gusta el fútbol
We state the task we want to achieve and provide four examples to set the context for the
model. Hence, this is a 4-shot generalization task. Then, we ask the model to generate more
text to follow the pattern and provide the requested translation. Some ideas to explore:
● Would this work with fewer examples?
● Would it work without the task description?
● How about other tasks?
● How does GPT-2 score in this setting?
NOTE
GPT-2, given its size and training process, is not very good at few-shot tasks, and it’s even worse at
zero-shot generalization. How is it possible that we managed to use it for sentiment classification in
our previous example? We cheated a bit: we didn’t look at the text generated by the model, just
checked whether the probability for ” Positive” was larger than the ” Negative ” probability.
Understanding how models work under the hood can unlock powerful applications even with small
models. Remember to think about your problem; don’t be afraid to explore.
GPT-2 is an example of a base model. Some base models in the style of GPT-2 have zero-shot
and few-shot capabilities that we can use at inference time. Another approach is to fine-tune
a model: we take the base model and keep training it a bit longer on domain or task-specific
data. We rarely need the extreme generalization capabilities showcased by the most powerful
models in the world; if you only want to solve a particular task, it will usually be cheaper and
better to fine-tune and deploy a smaller model specialized on a single task. It’s also important
to note that base models are not conversational; although you can write a very nice prompt
that will help make a chatbot with a base model, it’s often more convenient to fine-tune the
base model with conversational data, hence improving the conversational capabilities of the
model. That’s precisely what we’ll do in Chapter 5.
A Transformer Block
After our brief experiments using language models, we are ready to introduce an architecture
diagram for transformer-based language generation models. The high-level pieces involved
include:
● Tokenization. The input text is broken down into individual tokens (which can be
words and subwords). Each token has a corresponding ID used to index the token
embeddings.
● Input Token Embedding. The tokens are represented as vectors called
embeddings. These embeddings serve as numerical representations that capture
the semantic meaning of each token. You can think of vectors as list of numbers,
where each number corresponds to a particular aspect of the token’s meaning.
During training, a model learns how to map each token to its corresponding
embedding. The embedding will always be the same for each token, regardless of
its position in the input sequence.
● Positional Encoding. The transformer model has no notion of order, so we need
to enrich the token embeddings with positional information. This is done by
adding a positional encoding to the token embeddings. The positional encoding is
a set of vectors that encode the position of each token in the input sequence. This
allows the model to differentiate between tokens based on their position in the
sequence, which can be useful as the same token appearing in different places can
have different meanings.
● Transformer blocks: The core of the transformer model is the transformer block.
The power of transformers comes from stacking multiple blocks, allowing the
model to learn increasingly complex and abstract relationships between the input
tokens. It consists of two main components:
● Self-Attention Mechanism. This mechanism allows the model to
weigh the importance of each token in the context of the entire
sequence. It helps the model understand the relationships between

different tokens in the input. The self-attention mechanism is the key to
the transformer’s ability to handle long-range dependencies and
complex relationships between words, and it helps generate coherent
and contextually appropriate text.
● Feed-Forward Neural Network. The self-attention output is passed
through a feed-forward neural network, which further refines the
representation of the input sequence.
● Contextual Embeddings. The output of the transformer block is a set of
contextual embeddings that capture the relationships between tokens in the input
sequence. Unlike the input embeddings, which are fixed for each token, the
contextual embeddings are updated at each layer of the transformer model based
on the relationships between tokens.
● Prediction. An additional layer processes the final representation into a
task-dependent final output. In the case of text generation, this involves having a
linear layer that maps the contextual embeddings to the vocabulary space,
followed by a softmax operation to predict the next token in the sequence.

Figure 2-1. Architecture of a transformer-based language model
Of course, this is a simplification of the transformer architecture. Diving into the internals of
how self-attention works or the internals of the transformer block is beyond the scope of this
book. However, understanding the high-level architecture of a transformer model can be
helpful to grasp how these models work and how they can be applied to various tasks. This
architecture has enabled transformers to achieve unprecedented performance in various tasks
and domains, and you’ll see them cropping up again and again –not only in the rest of this
book, but also in the discipline as a whole.
Transformer Models
Genealogy
Sequence-To-Sequence Tasks
At the beginning of the chapter, we experimented with GPT-2 to auto-regressively generate
text. GPT-2, an example of a decoder-based transformer, has a single stack of transformer
blocks that process an input sequence. This is a popular approach today, but the original
transformer paper, attention is all you need,3 used a more complicated architecture called the
encoder-decoder architecture, which is still in common use today.

The transformer paper focused on machine translation as the example sequence-to-sequence
task. The best results in machine translation at the time were achieved by recurrent neural
networks (RNNs), such as LSTM and GRU (don’t worry if you’re unfamiliar with them). The
paper demonstrated better results by focusing solely on the attention method and showed that
scalability and training were much easier. These factors –excellent performance, stable
training, and easy scalability– are why transformers took off and were adapted to multiple
tasks, as the next section explores in more depth.
In encoder-decoder models, like the original transformer model described in the paper, one
stack of transformer blocks, called encoder, processes an input sequence into a set of rich
representations, which are then fed into another stack of transformer blocks, called decoder,
that decodes them into an output sequence. This approach to convert one sequence into a
different one is called sequence-to-sequence or seq2seq and is naturally well suited for tasks
such as translation, summarization, or question-answering.
For example, you feed an English sentence through the encoder of a translation model, which
generates a rich embedding that captures the meaning of the input. Then, the decoder
generates the corresponding French sentence using this embedding. The generation happens
in the decoder one token at a time, as we saw when generating sequences earlier in the
chapter. However, the predictions for each successive token are informed not just by the
previous tokens in the sequence being generated but also by the output from the encoder.
The mechanism by which the output from the encoder side is incorporated into the decoder
stack is called cross-attention. It resembles self-attention, except that each token in the input
(the sequence being processed by the decoder) attends to the context from the encoder rather
than other tokens in its sequence. The cross-attention layers are interleaved with
self-attention, allowing the decoder to use both contexts within its sequence and the
information from the encoder.
After the transformer paper, existing sequence-to-sequence models, such as Marian NMT,
incorporated these techniques as a central part of their architecture. New models were
developed using these ideas. A notable one is BART (short for "Bidirectional and
Auto-Regressive Transformers"4). During pre-training, BART corrupts input
sequences and attempts to reconstruct them in the decoder output. Afterward, BART is
fine-tuned for other generation tasks, such as translation or summarization, leveraging the
rich sequence representations achieved during pre-training. Input corruption, by the way, is
one of the key ideas behind diffusion models, as we’ll see in Chapter 3.
Another notable sequence-to-sequence model is T5.5 T5 approaches the multitude of NLP
tasks in a general way by formulating 60 of them as text-to-text transformations. No custom
layers or code are required for different tasks, training uses the same hyperparameters, and
the model learns from a very diverse dataset.
We just discussed encoder-decoder and decoder-only architectures. A common question is
why one might need an encoder-decoder model for tasks like translation if decoder-only
models like GPT-2 can show good results. Encoder-decoder models are designed to translate
an entire input sequence to an output sequence, making them well-suited for translation. In
contrast, decoder-only models focus on predicting the next token in a sequence. Initially,
decoder-only models like GPT-2 were less capable in zero-shot learning scenarios than more
recent models like GPT-3, but this was due to more than just the absence of an encoder. The
improvement in zero-shot capabilities in advanced models like GPT-3 is also due to larger
training data, better training techniques, and increased model sizes. While encoders in
seq2seq models play a crucial role in understanding the full context of input sequences,
advancements in decoder-only models have made them more effective and versatile, even for
tasks traditionally relying on seq2seq models.
Encoder-only models
As we’ve seen, the original transformer model was based on an encoder-decoder architecture
that has been further explored in models such as BART or T5. In addition, the encoder or the
decoder can be trained and used independently, giving rise to distinct transformer families.
The first sections of this chapter explored decoder-only, or autoregressive models. These
models are specialized in text generation using the techniques we described and have shown
impressive performance, as demonstrated by ChatGPT, Claude, Llama, or Falcon.
Encoder models, on the other hand, are specialized in obtaining rich representations from text
sequences and can be used for tasks such as classification or to prepare semantic embeddings
(usually a vector of a few hundred numbers) for a multitude of documents that can be used in
retrieval systems. The best-known transformer encoder model is probably BERT6, which
introduced the masked language model objective that was later picked up and further
explored by BART.
Causal language modeling predicts the next token given the previous ones - it’s what we did
with GPT-2. The model can only attend to the context on the left of a given token. A different
approach used in encoder models is called masked language modeling (MLM). Masked
language modeling, proposed in the famous BERT paper, pre-trains a model to learn to
"fill in the blanks“. Given an input text, we randomly mask some tokens, and the
model must predict the hidden tokens. Unlike causal language modeling, MLM uses both the
sequence at the masked token’s left and right, hence the B of "bidirectional" in BERT’s
name. This helps create strong representations of the given text. Under the hood, these
models use the encoder part of the transformer’s architecture.
from transformers import pipeline
fill_masker = pipeline(model="bert-base-uncased")
fill_masker("The [MASK] is made of milk.")

[{'score': 0.19546695053577423,
'token': 9841,
'token_str': 'dish',
'sequence': 'the dish is made of milk.'},
{'score': 0.1290755718946457,
'token': 8808,
'token_str': 'cheese',
'sequence': 'the cheese is made of milk.'},

{'score': 0.10590697824954987,
'token': 6501,
'token_str': 'milk',
'sequence': 'the milk is made of milk.'},
{'score': 0.04112089052796364,
'token': 4392,
'token_str': 'drink',
'sequence': 'the drink is made of milk.'},

{'score': 0.03712352365255356,
'token': 7852,
'token_str': 'bread',
'sequence': 'the bread is made of milk.'}]
What happens under the hood? The encoder receives the input sequence and generates a
contextualized representation for each token. This representation is a vector of numbers that
captures the meaning of the token in the context of the entire sequence. The encoder is
usually followed by a task-specific layer that uses the representations to perform tasks such as
classification, question answering, or masked language modeling. The encoder is trained to
generate representations that are useful for understanding-heavy tasks.
Between encoder-only, decoder-only, and encoder-decoder models, we’ve seen a large
number of new open and closed language models, such as GPT-4, Mistral, Falcon, Llama 2,
Qwen, Yi, Claude, Bloom, PaLM, and hundreds more. Yann LeCun posted this delightful
genealogy diagram in Twitter, taken from a survey paper7 shows transformers’ rich and
fruitful impact on the NLP landscape as of 2024.

The Power of
Pre-training
The key Insights of Transformers
Having access to existing models is quite powerful. In the previous sections, we explored
using GPT2 and GPT-NeoX to generate text and perform zero-shot classification.
Transformer models have shown state-of-the-art performance across many other language
tasks, such as text classification, machine translation, and answering questions based on an
input text. Why do transformers work so well?
The first insight is the usage of the attention mechanism, as hinted in the chapter introduction.
Previous NLP methods, such as recurrent neural networks, struggled to handle long
sentences. Attention mechanisms allow the transformers model to attend to long sequences
and learn long-range relationships. In other words, transformers can estimate how relevant
some tokens are to other tokens.

The second key aspect is their ability to scale. The transformer architecture has an
implementation optimized for parallelization, and research has shown that these models can
scale to handle high-complexity and high-scale datasets. Although initially designed for text
data, the transformer architecture can be flexible enough to support different data types and
handle irregular inputs.
The third key insight is the ability to do pre-training and fine-tuning. Traditional approaches
to a task, such as movie review classification, were limited by the availability of labeled data.
A model would be trained from scratch on a large corpus of labeled examples, attempting to
predict the label from the input text directly. This approach is often referred to as supervised
learning. However, it has a significant drawback: it requires a large amount of labeled data to
train effectively. This is a problem because labeled data is expensive to obtain and
time-consuming to label. There might not even be any available data in many domains.
To address this, researchers began looking for a way to pre-train models on existing data that
could then be fine-tuned (or adjusted) for a specific task. This approach is known as transfer
learning and is the foundation of modern ML in many fields, such as Natural Language
Processing and Computer Vision. Initial works in NLP focused on finding domain-specific
corpora for the language model pre-training phase, but papers such as ULMFiT8 showed that
even pre-training on generic text such as Wikipedia could yield impressive results when the
models were fine-tuned on downstream tasks, such as sentiment analysis or question
answering. This set the stage for the rise of transformers, which turned out to be highly
well-suited to learning rich representations of language.

The idea of pre-training is to train a model on a large unlabeled dataset and then fine-tune it
to a new target task, for which one would require much less labeled data. Before graduating
to NLP, transfer learning had already been very successful with the Convolutional Neural
Networks that form the backbone of modern Computer Vision. In this scenario, one first
trains a large model with a massive amount of labeled images in a classification task.
Through this process, the model learns common features that can be leveraged on a different
but related problem. For example, we can pre-train a model on thousands of classes and then
fine-tune it to classify whether a picture is of a hot dog.
With transformers, things are taken further with self-supervised pre-training. We can pre-train
a model on large, unlabeled text data. How? Let’s think about causal models such as GPT.
The model predicts which is the next word. Well, we don’t need any labels to obtain training
data. Given a corpus of text, we can mask the tokens after a sequence and train the model to
learn to predict them. Like in the computer vision case, pre-training gives the model a
meaningful representation of the underlying text. We can then fine-tune the model to perform
another task, such as generating text in the style of our Tweets or a specific domain (e.g.,
your company chat). Given the model has already learned a representation of language,
fine-tuning will require much less data than if we trained from scratch.
For many tasks, a rich representation of the input is more important than being able to predict
the next token. For example, if you want to fine-tune a model to predict the sentiment of a
movie review, masked language models would be more powerful. Models such as GPT-2 are
designed to optimize for text generation rather than for building powerful representations of
the text. On the other hand, models such as BERT are ideal for this task. As briefly mentioned
before, the last layer of an encoder model outputs a dense representation of the input
sequence, called embedding. This embedding can then be leveraged by adding a small,
simple network on top of the encoder and fine-tuning the model for the specific task. As a
concrete example, we can add a simple linear layer on top of the BERT encoder output to
predict the sentiment of a document. We can take this approach to tackle a wide range of
tasks:
● Token classification. Identify each entity in a sentence, such as a person, location,
or organization.
● Extractive question answering. Given a paragraph, answer a specific question
and extract the answer from the input.
● Semantic search. The features generated by the encoder can be handy to build a
search system. Given a database of a hundred documents, we can compute the
embeddings for each. Then, we can compare the input embeddings with the
documents’ ones at inference time, hence identifying the most similar document in
the database.9
● And many others, including text similarity, anomaly detection, named entity
linking, recommendation systems, and document classification.
from transformers import pipeline

classifier =
pipeline(model="distilbert-base-uncased-finetuned-sst-2-englis
h")
classifier("This movie is disgustingly good !")
[{'label': 'POSITIVE', 'score': 0.9998536109924316}]
This classification model can analyze reviews and do the same as in the zero-shot
classification section. The challenge section of this chapter shows how to evaluate
classification models and compare between a zero-shot setup and this fine-tuned model.
Transformers recap
We’ve discussed three types of architectures.
● Encoder-based architectures , such as BERT, DistilBERT, and RoBERTa10, are
ideal for tasks that require understanding the entire input. These models output
contextualized embeddings that capture the meaning of the input sequence. We

can then add a small network on top of these embeddings and train it for a new
specific task that relies on the semantic information (such as identifying entities in
the text or classifying the sequence).
● Decoder-based architectures, such as GPT-2, Falcon, and Llama, are ideal for
new text generation.
● Encoder-decoder architectures, or seq2seq, such as BART and T5, are great for
tasks that require generating new sentences based on a given input, such as
summarization or translation.
"Wait." - you might say - "I can do all of these tasks with ChatGPT or
Llama“. That’s true - given the vast (and growing) amount of training data, computing, and
training optimizations, the quality of generative models is significantly increasing, and the
zero-shot capabilities have improved considerably compared to a few years ago. Although
decoder-only models provide good results, the current consensus is that, provided the
resources, fine-tuning a model for your specific task and domain will work better than using
an out-of-the-box pre-trained model. For example, if you want to use a GPT model in
real-time in a game to generate character dialogs, it will usually perform better if you first
fine-tune it with similar data. If you want to use a model to extract different entities from
your dataset of chemistry papers, it might make sense first to fine-tune an encoder-based
model with chemistry papers to achieve this.
The success of seq2seq models are a consequence of their capability to encode
variable-length input sequences into an embedding, which summarizes the input information.
The decoder part of the model can then leverage the context for performing the generation.
Decoder-only models have gained interest in recent years thanks to their simplicity,
scalability, efficiency, and parallelization. The three types of models are widely used in the
industry depending on the task - no single golden model is used for everything.
With over half a million open models, you might wonder which one to use. Chapter 5 will
help you navigate this landscape, providing guidelines on how to choose the right model for
your task and requirements as well as how to fine-tune a model for your specific needs.
Limitations
At this point, you might wonder what the issues are with transformers. Let’s briefly go over
some of the limitations:
● Transformers are very large. Research has consistently shown that larger
models perform better. Although that’s quite exciting, it also brings concerns.
First, some of the most powerful models require dozens of millions of U.S. dollars
to train - just in computing power. That means that only a small set of institutions
can train very large base models, limiting the kind of research that institutions
without those resources can do. Second, using such amounts of computing power
can also have ecological implications - those millions of GPU hours are, of
course, powered by lots of electricity. Third, even if some of these models are
open-sourced, running them might require many GPUs. Chapter 5 will explore
some techniques to use these LLMs even if you don’t have multiple GPUs at
home. Even then, deploying them in resource-constrained environments is a
frequent challenge.
● Sequential processing: If you recall the decoder section, we had to process all the
previous tokens for each new token. That means generating the 10,000th token in
a sequence will take considerably longer than generating the initial one. In
computer science terms, transformers have quadratic time complexity with respect
to the input length. This means that as the length of the input increases, the time
taken for processing grows quadratically, making it challenging to scale them to
very long documents or use these models in some real-time scenarios. While
transformers excel in many tasks, their computational demands require careful
consideration and optimization when being used in production. That said, there
has been a lot of research on making transformers more efficient for extremely
long sequences.
● Fixed input size: Transformer models can handle a maximum number of tokens,
which depends on the base model. Some transformers can only handle 512 tokens,
while new techniques allow to scale to hundreds of thousands tokens. The number
of tokens the model can attend is called the context window. This is an essential
thing to look into when picking a pre-trained model. You cannot simply pass
entire books to transformers, expecting they will be able to summarize them.
● Limited interpretability: Transformers are often criticized for their lack of
interpretability.
All of the above are very active research areas - people have been exploring how to train and
run models with less computing power (e.g., QLoRA, which we’ll explore in Chapter 5),
make generation faster (e.g., flash attention and assisted generation), enable unconstrained
input sizes (e.g., RoPE and attention sinks), and interpret the attention mechanisms.
One big concern that requires diving into is the presence of biases in models. If the training
data used to pre-train transformers contains biases, the model can learn and perpetuate them.
This is a broader issue in machine learning but is also relevant to transformers. Let’s revisit
the fill-mask pipeline. Let’s say we want to predict the most likely profession. As you
can see below, the results are very different if we use the word "man" vs. "woman“.
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK] during summer.")
print([r["token_str"] for r in result])
result = unmasker("This woman works as a [MASK] during

summer.")
print([r["token_str"] for r in result])

['farmer', 'carpenter', 'gardener', 'fisherman', 'miner']
['maid', 'nurse', 'servant', 'waitress', 'cook']
Why does this happen? To enable pre-training, researchers usually require large amounts of
data, leading to scraping all the content they can find. This content might be of all kinds of
quality, including toxic content (which can be, to some extent, filtered out). The base model
might end up engraining and perpetuating these biases when being fine-tuned. Similar
concerns exist for conversational models, where the final model might generate toxic content
learned from the pre-training dataset.
Beyond Text
Transformers have been used for many tasks representing data as text. A clear example is
code generation – rather than training a language model with English data, we can use lots of
code, and, by the same principles we just learned, it will learn how to auto-complete code.
Another example is using transformers to answer questions from a table, such as a
spreadsheet.
As transformer models have been so successful in the text domain, considerable interest has
sparked in other communities to adapt these techniques to other modalities. This has led to
Transformer models being used for tasks such as image recognition, segmentation, object
detection, video understanding, and more.
Convolutional Neural Networks have been widely used as the go-to state-of-the-art models
for most computer vision techniques. With the introduction of Vision Transformers (ViT)11,
there has been a switch in recent years to explore how to tackle vision tasks with attention
and transformers-based techniques. ViTs don’t discard CNNs entirely: In the image
processing pipeline, CNNs extract feature maps of the image to detect high-level edges,
textures, and other patterns. The feature maps obtained from the CNNs are then divided into
fixed-size, non-overlapping patches. These patches can be treated similarly to a sequence of
tokens, so the attention mechanism can learn the relationships between patches in different
places.
Unfortunately, ViTs required more data (300 million images!) and compute than CNNs to get
good results. Further work has happened in recent years; for example, DeiT was able to use
transformer-based models with mid-sized datasets (1.2M images) thanks to using
augmentation and regularization techniques common in CNNs. DeiT also uses a distillation
approach involving a "teacher" model (a CNN in this case). Other models such as DETR,
SegFormer, and Swin Transformer have pushed the field further, supporting many tasks such
as image classification, object detection, image segmentation, video classification, document
understanding, image restoration, super-resolution, and others.
As we’ll see in Chapter 9, transformer models can also be used for audio tasks, such as
transcribing audio or generating synthetic speech or music. Under the hood, the same
fundamental principles of pre-training and attention mechanisms persist, but each modality
has different data types, requiring different approaches and modifications.

Other modalities where transformers are being explored are:
● Graphs: An excellent introductory read is Introduction to Graph Machine
Learning by Fourrier, 2023.12Using transformers for graphs is still very
exploratory, but there are some exciting early results. Some examples of tasks that
involve graph data are predicting the toxicity of molecules, predicting the
evolution of systems, or generating new plausible molecules.
● 3D data: For example, perform segmentation of data that can be represented in
3D, such as LiDAR point clouds in autonomous driving or CT scans for organ
segmentation. Another example is estimating an object’s 6 degrees of freedom,
which can be helpful in robotics applications.
● Time series: Analyzing stock price or performing weather forecasting.
● Multimodal: Some transformer models are designed to process our output
multiple types of data (such as text, images, and audio) together. This opens new
possibilities, such as multimodal systems where you can speak, write, or provide
pictures and have a single model to process them. Another example is visual
question answering, where a model can answer questions about provided images.
Project Time: Using

LMs to generate text
We used the generate() method in the generation section to perform different decoding
techniques. To better understand how it works under the hood, it’s time to implement it
ourselves. We’ll use the generate() method as a reference but implement it from scratch.
We’ll also explore using the generate() method to perform different decoding techniques.
Your goal is to fill the code in the following function. Rather than use gpt2.generate(),
the idea is to iteratively call gpt2(), passing the previous tokens as input. You have to
implement greedy search when do_sample=False, sampling when do_sample=True,
and Top-K sampling when do_sample=True and top_k is not None.
def generate(
model, tokenizer, input_ids, max_length=50,

do_sample=False, top_k=None
):
"""Generate a sequence without using model.generate()

Args:
model: The model to use for generation.
tokenizer: The tokenizer to use for generation.
input_ids: The input IDs
max_length: The maximum length of the sequence.

Defaults to 50.
do_sample: Whether to use sampling. Defaults to False.
top_k: The number of tokens to sample from. Defaults

to 0.
"""
# Write your code here
# Begin by the simplest approach, greedy decoding.
# Then add sampling and finally top-k sampling.
Summary
Congratulations! You now have learned to load and use transformers for various tasks! You
also understood how transformers model sequence data such as text and how this property
lets them "learn" valuable representations that we can use to generate or classify new
sequences. As the scale of these models increases, so do their capabilities - to the point where
massive models with hundreds of billions of parameters can now perform many tasks
previously thought impossible for computers.
We can pick powerful existing pre-trained models and modify them for specific domains and
use cases thanks to fine-tuning. The trend towards larger and more capable models has
caused a shift in how people use them. Task-specific models are often out-competed by
general-purpose LLMs, and most people now interact with these models via APIs and hosted
solutions or directly via slick chat-based user interfaces. At the same time, thanks to the
release of large and powerful open-access models, such as Llama, there is a strong wave in
the researchers’ and practitioners’ ecosystems aiming to run high-quality models directly in
consumer computers, resulting in privacy-first solutions. This trend extends beyond
inference: novel training approaches that allow individuals to fine-tune these models without
many computational resources have emerged in recent years. Chapter 5 delves into both
traditional and novel fine-tuning techniques.
Although we covered how transformers work and we’ll dive into their training, diving into
the internals of these models (for example, the math behind attention mechanisms) or how to
pre-train a model from scratch is outside the scope of this book. Luckily for us, there are
excellent resources to learn about this:
● The Illustrated Transformer by Jay Alammar is a beautiful visual guide that
explains transformers in a detailed and intuitive way.
● We recommend reading the Natural Language Processing with Transformers book
if you want to dive deeper into the internals of fine-tuning these models for
multiple specific tasks.
● Hugging Face has a free, open-source course which teaches how to solve different
NLP tasks.
If you want to dive more into the GPT family of models, we suggest to review the following
papers:
● Improving Language Understanding by Generative Pre-Training. This is the
original GPT paper, published in 2018 by Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. It introduced the idea of using a Transformer-based
model pre-trained on a large corpus of text to learn general language
representations and then fine-tuning it on specific downstream tasks. The paper

also showed that the GPT model achieved state-of-the-art results on several
natural language understanding benchmarks at the time.
● Language Models are Unsupervised Multitask Learners, published in 2019 by
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. It presented GPT-2, a Transformer-based model with 1.5 billion
parameters pre-trained on a large corpus of web text called WebText. The paper
also demonstrated that GPT-2 could perform well on various natural language
tasks without fine-tuning, such as text generation, summarization, translation,
reading comprehension, and commonsense reasoning. Finally, it discussed
large-scale language models’ potential ethical and social implications.
● Language Models are Few-Shot Learners, published in 2020 by Tom B. Brown
and others. This paper shows that scaling up language models dramatically
improves their ability to perform new language tasks from only a few examples or
simple instructions without fine-tuning or gradient updates. The paper also
presents GPT-3, an autoregressive language model with 175 billion parameters,
which achieves strong performance on many NLP datasets and tasks.
Exercises
1. What’s the role of the attention mechanism in text generation?
2. In which cases would a character-based tokenizer be preferred?
3. What happens if you use a tokenizer different from the one used with the model?
4. What’s the risk of using no_repeat_ngram_size when doing generation?
(hint: think of city names)
5. What would happen if you combine Beam-search and sampling?

6. Imagine you’re using a LLM that generates code in a code editor by doing
sampling. What would be more convenient? A low temperature or a high
temperature?
7. What’s the importance of fine-tuning, and why is it different than zero-shot
generation?
8. Explain the difference and application of encoder, decoder, and encoder-decoder
transformers.
Challenges
9. Use a summarization model (you can do pipeline(“summarization)) to
generate summaries of a paragraph. How does it compare with the results of using
zero-shot? Can it be beaten by providing few-shot examples?
10. In the zero-shot supplementary material, we calculate the confusion matrix using
zero-shot classification. Explore using the
distilbert-base-uncased-finetuned-sst-2-english encoder
model that can do sentiment analysis. What results do you get?
11. Let’s build a FAQ system! Sentence transformers are powerful models that can
determine how similar multiple texts are. While the transformer encoder usually
outputs an embedding for each token, sentence transformers output an embedding
for the whole input text, allowing us to determine if the two texts are similar based
on their similarity score. Let’s look at a simple example using the
sentence_transformers library.
from sentence_transformers import SentenceTransformer, util

sentences = ["I'm happy", "I'm full of happiness"]
model =
SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# Compute embedding for both lists
embedding_1 = model.encode(sentences[0],
convert_to_tensor=True)
embedding_2 = model.encode(sentences[1],
convert_to_tensor=True)
util.pytorch_cos_sim(embedding_1, embedding_2)
tensor([[0.6003]], device='cuda:0')
Write a list of five questions and answers about a topic. Your goal will be to build a system
that, given a new question, can give the user the most likely answer. How can we use
sentence transformers to solve this? The supplemental material contains the solution, but
although challenging, we suggest trying it first before looking there!
References
1. Brown, Tom B., et al. Language Models Are Few-Shot Learners. arXiv, 22 July
2020. arXiv.org, http://arxiv.org/abs/2005.14165
2. Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for
Language Understanding. arXiv, 24 May 2019. arXiv.org,
http://arxiv.org/abs/1810.04805
3. Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for
Image Recognition at Scale. arXiv, 3 June 2021. arXiv.org,
4. Fourrier. Clémentine, "Introduction to Graph Machine Learning"
Hugging Face Blog, https://huggingface.co/blog/intro-graphml
5. Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv, 14
Feb. 2020. arXiv.org, http://arxiv.org/abs/1904.09751

6. Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning
for Text Classification. arXiv, 23 May 2018. arXiv.org,
7. Lewis, Mike, et al. BART: Denoising Sequence-to-Sequence Pre-Training for
Natural Language Generation, Translation, and Comprehension. arXiv, 29 Oct.
2019. arXiv.org, http://arxiv.org/abs/1910.13461
8. Radford, Alec, et al. "Language models are unsupervised
multitask learners." OpenAI blog 1, no. 8 (2019): 9.
9. Radford, Alec, et al. "Improving language understanding by
generative pre-training." (2018).
10. Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer. arXiv, 28 July 2020. arXiv.org,
11. T. Lan, “Generating human-level text with contrastive search in Transformers,”
Hugging Face Blog, https://huggingface.co/blog/introducing-csearch
12. Vaswani, Ashish, et al. Attention Is All You Need. arXiv, 1 Aug. 2023. arXiv.org,
13. Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on
ChatGPT and Beyond. arXiv, 27 Apr. 2023. arXiv.org,
1 An excellent deep dive into contrastive search is the "Generating Human-level
Text with Contrastive Search" blog post
(https://huggingface.co/blog/introducing-csearch).
2 The first example in the GPT-2 release blog post was famously a news story about unicorns
(https://openai.com/research/better-language-models).
3 Vaswani et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).
4 Lewis et al. "BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and
Comprehension"
5 Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer. arXiv, 28 July 2020. arXiv.org, http://arxiv.org/abs/1910.10683
6 Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language
Understanding. arXiv, 24 May 2019. arXiv.org, http://arxiv.org/abs/1810.04805
7 Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT
and Beyond. arXiv, 27 Apr. 2023. arXiv.org, http://arxiv.org/abs/2304.13712
8 Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning for Text
Classification. arXiv, 23 May 2018. arXiv.org, http://arxiv.org/abs/1801.06146

9 This oversimplifies how semantic search works, but we’ll get a chance to build a simple
search system using semantic embeddings in the challenge section of this chapter.
10 DistilBERT is a smaller model that preserves 95% of the original BERT performance while
having 40% less parameters. RoBERTa is a very powerful BERT-based model trained with
different hyperparameters and training for longer.
11 Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for Image
Recognition at Scale. arXiv, 3 June 2021. arXiv.org, http://arxiv.org/abs/2010.11929
12 The "Introduction to Graph Machine Learning" blog post
(https://huggingface.co/blog/intro-graphml) is a great resource to jump into the topic.

Chapter 2. Transformers: A Note For Early Release Readers

Uploaded by

Copyright:

Available Formats

Chapter 2. Transformers: A Note For Early Release Readers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2. Transformers: A Note For Early Release Readers

Uploaded by

Copyright:

Available Formats

Chapter 2.

A NOTE FOR EARLY RELEASE READERS

the official release of these titles.

active later on.

daily. Transformers have become a backbone for modern AI applications, powering

applications: language modeling.

high-quality generations, transformers have other properties, such as efficient parallelization

more powerful models that have since been released.

model to learn the underlying structure of the text.

words and different forms of the same word.

supported and the tokenization strategy.

from transformers import AutoTokenizer

input_ids = tokenizer("It was a dark and stormy",

tensor([[1026, 373, 257, 3223, 290, 6388, 88]])

print(t, "\t:", tokenizer.decode(t))

about 1.3 tokens per word.

Model training is stochastic (non-deterministic) by nature, whereas we train a tokenizer using a

from transformers import AutoModelForCausalLM

Note the use of AutoTokenizer and AutoModelForCausalLM. The transformers library

classification (AutoModelForSequenceClassification) or object detection

(AutoModelForObjectDetection). In the case of GPT-2, we’ll use the class corresponding to

use GPT2Tokenizer and GPT2LMHeadModel.

outputs.logits.shape # An output for each input token

convert the logits into probabilities, as we’ll see soon.

ID of most likely next Corresponding

It was a 845 very

It was a dark 1755 night

It was a dark and 4692 cold

It was a dark and storm 88 y

final_logits = gpt2(input_ids).logits[0, -1] # The last set

final_logits.argmax() # The position of the maximum

knows a few story tropes:

an algorithm called self-attention, which is the fundamental building block of transformers.

meaning of the phrase.

of being specified beforehand.

top10_logits = torch.topk(final_logits, 10)

for index in top10_logits.indices:

their associated probabilities according to the model:

for value, index in zip(top10.values, top10.indices):

still "night“? How do the probabilities change?

predictions change. Do you agree with the model’s predictions?

● Grammar: What happens if you provide a string that is not a grammatically

it more convenient to generate multiple words, transformers auto-regressive models have a

generate() method ideal for this case. Let’s explore an example.

output_ids = gpt2.generate(input_ids, max_new_tokens=20)

print("Input IDs", input_ids[0])

print("Output IDs", output_ids)

Input IDs tensor([1026, 373, 257, 3223, 290, 6388, 88])

Output IDs tensor([ 1026, 373, 257, 3223, 290, 6388,

383, 2344, 373, 19280, 11, 290, 262,

7463, 13, 383, 2344, 373, 19280, 11,

sometimes lead to suboptimal outcomes, especially in generating longer text sequences.

leading to less optimal text generation.

It was a dark and stormy night.

"It was dark and stormy," he said.

● repetition_penalty - how much to penalize already generated tokens,

avoiding repetition. A good default value is 1.2.

● bad_words_ids - a list of tokens that should not be generated (e.g., to avoid

generating offensive words).