Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 2. Transformers: A Note For Early Release Readers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Chapter 2.

Transformers

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the authors’ raw and

unedited content as they write—so you can take advantage of these technologies long before

the official release of these titles.

This will be the third chapter of the final book. Please note that the GitHub repo will be made

active later on.

If you have comments about how we might improve the content and/or examples in this

book, or if you notice missing material within this chapter, please reach out to the editor at

jleonard@oreilly.com.

Many trace the most recent wave of advances in generative AI to the introduction of a class

of models called transformers in 2017. Their most well-known application is the powerful

Large Language Models (LLMs), such as Llama and GPT-4, used by hundreds of millions

daily. Transformers have become a backbone for modern AI applications, powering

everything from chatbots and search systems to machine translation and content

summarization. They’ve even branched out beyond text, making waves in fields like

computer vision, music generation, and protein folding. In this chapter, we’ll explore the core
ideas behind transformers and how they work, with a focus on one of the most common

applications: language modeling.

Before we delve into the nitty-gritty of transformers, let’s take a step back and understand

what language modeling is. At its core, a Language Model (LM) is a probabilistic model that

learns to predict the next word (or token) in a sequence based on the preceding or

surrounding words. Doing so captures language’s underlying structure and patterns, allowing

it to generate realistic and coherent text. For example, given the sentence "I began my

day eating“, a language model might predict the next word as "breakfast" with a

high probability.

So, how do transformers fit into this picture? Unlike traditional language models that use

fixed-sized sliding windows or recurrent neural networks (RNNs), transformers are designed

to handle long-range dependencies and complex relationships between words more efficiently

and expressively. For example, imagine that you want to use an LM to summarize a news

article, which might contain hundreds or even thousands of words. Traditional LMs struggle

with long contexts, so the summary might skip critical details from the beginning of the

article. Transformer-based LMs, however, show strong results in this task. Besides

high-quality generations, transformers have other properties, such as efficient parallelization

of training, scalability, and knowledge transfer, making them popular and well-suited for

multiple tasks. At the heart of this innovation lies the self-attention mechanism, which allows

the model to weigh the importance of each word in the context of the entire sequence.

To help us build intuition about how language models work, we’ll use code examples that

interact with existing models, and we’ll describe the relevant pieces as we find them. Let’s

get to it.
A Language Model in
Action

In this section, we will load and interact with an existing (pre-trained) transformer model to

get a high-level understanding of how they work. We’ll use the GPT-2 model, which made

headlines in 2019 for its (then) impressive text-generation capabilities. Although small and

almost quaint by today’s standards, GPT-2 is nevertheless a good illustration of how these

language models work. The same principles apply to the larger (over 100 times larger!) and

more powerful models that have since been released.

Tokenizing Text

Let’s begin our journey to generate some text based on an initial input. For example, given

the phrase "it was a dark and stormy“, we want the model to generate some words

to continue it. Models can’t receive text directly as input; their input must be data represented

as numbers. To feed text into a model, we must first find a way to turn sequences into

numbers. This process is called tokenization, a crucial step in any NLP pipeline.
An easy option would be to split the text into individual characters and assign each a unique

numerical ID. This scheme could be helpful for languages such as Chinese, where each

character carries much information. In languages like English, this creates a very small token

vocabulary, and there will be very few unknown tokens (characters not found during training)

when running inference. However, this method requires many tokens to represent a string,

which is bad for performance and erases some of the structure and meaning of the text – a

downside for accuracy. Each character carries very little information, making it hard for the

model to learn the underlying structure of the text.

Another approach could be to split the text into individual words. While this lets us capture

more meaning per token, it has the downsides that we need to deal with more unknown words

(e.g., typos, slang, etc.), we need to deal with different forms of the same word (e.g., "run“,

"runs“, "running“, etc.), and we might end with a very large vocabulary, which could

easily be over half a million words for languages such as English. Modern tokenization

strategies strike a balance between these two extremes, splitting the text into subwords that

capture both the structure and meaning of the text while still being able to handle unknown

words and different forms of the same word.

Characters that are usually found together (like most frequent words) can be assigned a single

token that represents the whole word or group. Long or complicated words, or words with

many inflections, may be split into multiple tokens, where each one usually represents a

meaningful section of the word. There is no single "best" tokenizer; each language model

comes with its own one. The differences between tokenizers reside in the number of tokens

supported and the tokenization strategy.


Let’s see how the GPT-2 tokenizer handles a sentence to see this in action. We’ll first load the

tokenizer corresponding to GPT-2. Then, we’ll run the input text (also called prompt) through

the tokenizer to encode the string into numbers representing the tokens. We’ll use the

decode() method to convert each ID back into its corresponding token for demonstration

purposes.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

input_ids = tokenizer("It was a dark and stormy",


return_tensors="pt").input_ids

input_ids

tensor([[1026, 373, 257, 3223, 290, 6388, 88]])


for t in input_ids[0]:

print(t, "\t:", tokenizer.decode(t))

tensor(1026) : It

tensor(373) : was

tensor(257) : a

tensor(3223) : dark

tensor(290) : and

tensor(6388) : storm

tensor(88) : y
As you can see, the tokenizer splits the input string into a series of tokens and assigns a

unique ID to each. Most words are represented by a single token, but "stormy" is

represented by two tokens: one for ” storm” (including the space before the word) and one for

the suffix "y“. This allows the model to learn that "stormy" is related to "storm" and that

the suffix "y" is often used to turn nouns into adjectives. With a vocabulary of around 50,000

tokens, the GPT-2 tokenizer can efficiently represent almost any input text and averages

about 1.3 tokens per word.

NOTE

Even though we usually talk about training tokenizers, this has nothing to do with training a model.

Model training is stochastic (non-deterministic) by nature, whereas we train a tokenizer using a

statistical process that identifies which subwords are the best to pick for a given dataset. How to

choose the subwords is a design decision of the tokenization algorithm. Therefore, tokenization

training is deterministic. We won’t dive into different tokenization strategies, but some of the most

popular subword approaches are Byte-level BPE, used in GPT-2, WordPiece, and SentencePiece.

Predicting Probabilities
GPT-2 was trained as a causal language model (also known as auto-regressive), which means

it was trained to predict the next token in a sequence given the preceding tokens. The

transformers library has high-level tools that enable us to use such a model to generate text or

perform other tasks quickly. It is helpful to understand how the model makes its predictions

by directly inspecting them on this language-modeling task. We begin by loading the model.

from transformers import AutoModelForCausalLM

gpt2 = AutoModelForCausalLM.from_pretrained("gpt2")

NOTE

Note the use of AutoTokenizer and AutoModelForCausalLM. The transformers library

supports hundreds of models and their corresponding tokenizers. Rather than learning the name of

each tokenizer and model class, we will use AutoTokenizer and AutoModelFor*.
For the automatic model, we need to specify for which task we’re using the model, such as

classification (AutoModelForSequenceClassification) or object detection

(AutoModelForObjectDetection). In the case of GPT-2, we’ll use the class corresponding to

the causal language modeling task. When using the automatic classes, transformers will pick an

adequate default class based on the configuration of a model. For example, under the hood, they will

use GPT2Tokenizer and GPT2LMHeadModel.

If we feed the tokenized sentence from the previous section through the model, we get a

result back with 50,257 values for each token in the input string:

outputs = gpt2(input_ids)

outputs.logits.shape # An output for each input token

torch.Size([1, 7, 50257])

The first dimension of the output is the number of batches (1 because we just ran a single

sequence through the model). The second dimension is the sequence length, or the number of

tokens in the input sequence (7 in our case). The third dimension is the vocabulary size. We

get a list of ~50 thousand numbers for each token in the original sequence. These are the raw
model outputs, or logits, that correspond to the tokens in the vocabulary. For every input

token, the model predicts how likely each token in the vocabulary is to continue the sequence

up to that point. With our example sentence, the model will predict logits for "It“, "It

was“, "It was a“, and so on. Higher logits’ values mean the model considers the

corresponding token a more likely continuation of the sequence. The following table shows

the input sequences, the most likely token ID, and its corresponding token.

Logits are the raw output of the model (a list of numbers such as [0.1, 0.2, 0.01, …]). We can

use the logits to select the most likely token to continue the sequence. However, we can also

convert the logits into probabilities, as we’ll see soon.

ID of most likely next Corresponding


Input Sequence
token token

It 318 is
It was 257 a

It was a 845 very

It was a dark 1755 night

It was a dark and 4692 cold

It was a dark and storm 88 y


It was a dark and 1755 (let’s figure this one)

stormy

Let’s focus on the logits for the entire input sentence and see how to predict the next word of

the sentence. We can find the index of the token with the highest value using the argmax()

method:

final_logits = gpt2(input_ids).logits[0, -1] # The last set


of logits

final_logits.argmax() # The position of the maximum

tensor(1755)

1755 corresponds to the ID of the token the model considers most likely to follow the input

string "It was a dark and stormy“. Decoding this token, we can see that this model

knows a few story tropes:


tokenizer.decode(final_logits.argmax())

' night'

So ” night” is the most likely token. This makes sense considering the beginning of the

sentence we provided as input. The model learns how to pay attention to other tokens using

an algorithm called self-attention, which is the fundamental building block of transformers.

Intuitively, self-attention allows the model to identify how much each token contributes to the

meaning of the phrase.

NOTE

Transformer models contain many of these attention layers, each one specializing in some aspect of

the input. Contrary to heuristics systems, these aspects or features are learned during training, instead

of being specified beforehand.

Let’s now see which other tokens were potential candidates by selecting the top 10 values:
import torch

top10_logits = torch.topk(final_logits, 10)

for index in top10_logits.indices:

print(tokenizer.decode(index))

night

day

evening

morning
afternoon

summer

time

winter

weekend

We’ll need to convert logits into probabilities to see how confident the model is about each

prediction. We’d do that by comparing each value with all the other predicted values and

normalizing so all the numbers sum up to 1. That’s precisely what the softmax() operation

does. The following code uses softmax() to print out the top 10 most likely tokens and

their associated probabilities according to the model:


top10 = torch.topk(final_logits.softmax(dim=0), 10)

for value, index in zip(top10.values, top10.indices):

print(f"{tokenizer.decode(index):<10} {value.item():.2%}")

night 46.18%

day 23.46%

evening 5.87%

morning 4.42%

afternoon 4.11%

summer 1.34%
time 1.33%

winter 1.22%

weekend 0.39%

, 0.38%

Before going further, we suggest to experiment with the code above. Here are some ideas for

you to try:

● Change few words: Try changing the adjectives (e.g., "dark" and "stormy“) in

the input string and see how the model’s predictions change. Is the predicted word

still "night“? How do the probabilities change?

● Change the input string: Try different input strings and see how the model’s

predictions change. Do you agree with the model’s predictions?

● Grammar: What happens if you provide a string that is not a grammatically

correct sentence? How does the model handle it? Look at the probabilities of the

top predictions.
Generating Text

Once we know how to get the model’s predictions for the next token in a sequence, it is easy

to generate text by repeatedly feeding the model’s predictions back into itself. We can call

gpt2(ids), generate a new token ID, add it to the list, and call the function again. To make

it more convenient to generate multiple words, transformers auto-regressive models have a

generate() method ideal for this case. Let’s explore an example.

output_ids = gpt2.generate(input_ids, max_new_tokens=20)

decoded_text = tokenizer.decode(output_ids[0])

print("Input IDs", input_ids[0])

print("Output IDs", output_ids)


print(f"Generated text: {decoded_text}")

Input IDs tensor([1026, 373, 257, 3223, 290, 6388, 88])

Output IDs tensor([ 1026, 373, 257, 3223, 290, 6388,


88, 1755, 13,

383, 2344, 373, 19280, 11, 290, 262,


15114, 547,

7463, 13, 383, 2344, 373, 19280, 11,


290, 262])

Generated text: It was a dark and stormy night. The wind was
blowing,

and the clouds were falling. The wind was blowing, and the
When we ran the gpt2() forward method in the previous section, it returned a list of logits

for each token in the vocabulary (50257). Then, we had to calculate the probabilities and pick

the most likely token. generate() abstracts this logic away. It makes multiple forward

passes, predicts the next token repeatedly, and appends it to the input sequence.

generate() provides us with the token IDs of the final sequence, including both the input

and new tokens. Then, with the tokenizer decode() method, we can convert it back to text.

There are many possible strategies to perform generation. The one we just did, picking the

most likely token, is called greedy decoding. Although this approach is straightforward, it can

sometimes lead to suboptimal outcomes, especially in generating longer text sequences.

Greedy decoding can be problematic because it doesn’t consider the overall probability of a

sentence, focusing only on the immediate next word. For instance, given the starting word

Sky and the choices blue and rockets for the next word, greedy decoding might favor

Sky blue since blue initially seems more likely following Sky. However, this approach

might overlook a more coherent and probable overall sequence like Sky rockets soar.

Therefore, greedy decoding can sometimes miss out on the most likely overall sequence,

leading to less optimal text generation.

Rather than one token at a time, techniques such as beam search explore multiple possible

continuations of the sequence and return the most likely sequence of continuations. It keeps

the most likely num_beams of hypotheses during generation and chooses the most likely

one.
beam_output = gpt2.generate(

input_ids,

num_beams=5,

max_new_tokens=30,

print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))

It was a dark and stormy night.


"It was dark and stormy," he said.

"It was dark and stormy," he said.

As you noticed, the output includes many repetitions of the same sequence. There are

multiple parameters we can control to perform better generations. Let’s see two examples:

● repetition_penalty - how much to penalize already generated tokens,

avoiding repetition. A good default value is 1.2.

● bad_words_ids - a list of tokens that should not be generated (e.g., to avoid

generating offensive words).

Let’s see what we can achieve by penalizing repetition:


beam_output = gpt2.generate(

input_ids,

num_beams=5,

repetition_penalty=1.2,

max_new_tokens=38,

print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))
It was a dark and stormy night.

"There was a lot of rain," he said. "It was very cold."

He said he saw a man with a gun in his hand.

This is much better. Which generation strategy to use? As often in Machine Learning… it

depends. Beam search works well when the desired length of the text is somewhat

predictable. This is the case for tasks such as summarization or translation but not for

open-ended generation, where the output length can vary greatly, leading to repetition.

Although we can inhibit the model to avoid repeating itself, doing so can also lead to

performing worse. Also note that beam search will be slower than greedy search as it needs to

run inference for multiple beams simultaneously, which can be an issue for large models.

When we generate with greedy search and beam search, we push the model to generate text

with a distribution of high-probability next words. Interestingly, high-quality human language


does not follow a similar distribution. Human text tends to be more unpredictable. An

excellent paper about this counter-intuitive observation is The Curious Case of Neural Text

Degeneration. The authors conjecture that human language disfavors predictable words -

people optimize against stating the obvious. The paper proposes a method called nucleus

sampling.

With sampling, we pick the next token by sampling from the probability distribution of the

next tokens. This means that sampling is not a deterministic generation process. If the next

possible tokens are night (60%), day (35%), and apple (5%), rather than choosing night (with

greedy search), we will sample from the distribution. In other words, there will be a 5%

chance of picking "apple" even if it’s a low-probability token and leads to a nonsensical

generation. Sampling avoids creating repetitive text, hence leading to more diverse

generations. Sampling is done in transformers using the do_sample parameter.

from transformers import set_seed

# Setting the seed ensures we get the same results every time
we run this code
set_seed(70)

sampling_output = gpt2.generate(

input_ids,

do_sample=True,

max_length=34,

top_k=0, # We'll come back to this parameter

)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy day until it broke down the big
canvas on my

sleep station, making me money dilapidated, and, with a big


soothing m

ug

We can manipulate the probability distribution before we sample from it, making it

"sharper" or "flatter" using a temperature parameter. A temperature higher than

one will increase the randomness of the distribution, which we can use to encourage

generation of less probable tokens. A temperature between 0 and 1 will reduce the

randomness, increasing the probability of the more likely tokens and avoiding predictions

that might be too unexpected. A temperature of 0 will move all the probability to the most

likely next token, which is equivalent to greedy decoding. Compare the effect of this

temperature parameter on the generated text in the following example.


sampling_output = gpt2.generate(

input_ids,

do_sample=True,

temperature=0.4,

max_length=40,

top_k=0,

)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy night, and I was alone. I was in the
middle o

f the night, and I was suddenly awakened bygoodness, and I was


thinkin

g of the old man

sampling_output = gpt2.generate(

input_ids,

do_sample=True,

temperature=0.001,
max_length=40,

top_k=0,

print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy night. The wind was blowing, and the
clouds w

ere falling. The wind was blowing, and the clouds were
falling. The wi

nd was blowing, and the clouds were


sampling_output = gpt2.generate(

input_ids,

do_sample=True,

temperature=3.0,

max_length=40,

top_k=0,

)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy corporation street compliment


ideallylake ame

nded Churchill ty set crou 175 dualKing Bucc ceiling


wrapped.......my

tryhouse fragileREG Robinson lower display magn Simon spectral


warmth

HP274 Lur Welsh

Well, the first test is much more coherent than the second one. The second, which uses a very

low temperature, is repetitive (similar to greedy decoding). Finally, the third sample, with an

extremely high temperature, gives gibberish text.

One parameter you likely noticed is top_k. What is it? Top-K sampling is a simple sampling

approach in which only the K most likely next tokens are considered. For example, using
top_k=5, the generation method will first filter the most likely five tokens and redistribute

the probabilities so they add to one.

sampling_output = gpt2.generate(

input_ids,

do_sample=True,

max_length=40,

top_k=10,

)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy night and I was not expecting to be


here at 9

:30 AM. It felt cold and rainy. I didn't know why I was here.
There wa

s no

Hmm…this could be better. An issue with Top-K Sampling is that the number of relevant

candidates in practice could vary greatly. If we define top_k=5, some distributions will still

include tokens with very low probability, while others will consist of only high-probability

tokens.

The final generation strategy we’ll visit is Top-p sampling (also known as nucleus sampling).

Rather than sampling the K words with the highest probability, we will use all the most likely

words whose cumulative probability exceeds a given value. If we use a top_p=0.94, we’ll

first filter only to keep the most likely words that cumulatively have a probability of 0.94 or

higher. We then redistribute the probability and do regular sampling. Let’s see it in action.
sampling_output = gpt2.generate(

input_ids,

do_sample=True,

max_length=40,

top_p=0.94,

top_k=0,

)
print(tokenizer.decode(sampling_output[0],
skip_special_tokens=True))

It was a dark and stormy hour, a formation of what looked like


beggar

to an armoire-upper of the home that flickered down the


cobbled main r

oad, leaned slowly against

Both Top-K and Top-p are commonly used in practice. They can even be combined to filter

out low-probability words but have more generation control. The issue with the stochastic

generation methods is that the generated text doesn’t necessarily contain coherence.

We’ve seen three different generation methods: greedy search, beam-search decoding, and

sampling (with temperature, Top-K, and Top-p providing further control). Those are lots of

approaches! If you want to further experiment with generation, here are some suggestions to

experiment with:
● Experiment with different parameter values. How does increasing the number of

beams impact the quality of your generation? What happens if you reduce or

increase your top_p value?

● One approach to reduce repetition in Beam Search is introducing penalties for

n-grams (word sequence of n words). This can be configured using

no_repeat_ngram_size, which avoids repeating the same n-gram. For

example, if you use no_repeat_ngram_size=4, the generation will never

contain the exact four consecutive words.

● A newer method, contrastive search, can generate long, coherent output while

avoiding repetition. This is achieved by considering both the probabilities

predicted by the model and the similarity with the context. This can be controlled

via penalty_alpha and top_k.1

If all of this sounds too empirical, it’s because it is. Generation is an active area of research,

with new papers coming up with different proposals, such as more sophisticated filtering.

We’ll briefly discuss these in the final chapter. No single rule works for all models, so it’s

always important to experiment with different techniques.

Zero-Shot Generalization

Generating language is a fun and exciting application of transformers, but writing fake

articles about unicorns2 is not the reason why they are so popular. To predict the next token

well, these models must learn a fair amount about the world. We can take advantage of this to
perform various tasks. For example, instead of training a model dedicated to translation, we

can prompt a sufficiently powerful language model with an input like:

Translate the following sentence from English to French:

Input: The cat sat on the mat.

Translation:

I typed this example with GitHub Copilot active, and it helpfully suggested "Le chat

était assis sur le tapis" as a continuation of the above prompt - a perfect

illustration of how a language model can perform tasks not explicitly trained for. The more

powerful the model, the more tasks it can perform without additional training. This flexibility

makes transformers quite powerful and has made them so popular in recent years.

To see this in action for ourselves, let’s use GPT-2 as a classification model. Specifically,

we’ll classify movie reviews as positive or negative - a classic benchmark task in the NLP

field. We’ll use a zero-shot approach to make things interesting, which means we won’t

provide the model with any labeled data. Instead, we’ll prompt the model with the text of a

review and ask it to predict the sentiment. Let’s see how it does.
To do this, we’ll insert the review into a prompt template that provides context for the model

and helps it understand what we’re asking it to do. After feeding the prompt through the

model, we’ll look at its prediction for the next token and see which possible token is assigned

a higher probability: "positive" or "negative“? To do that, let’s find the IDs

corresponding to those tokens.

# Check the token IDs for the words ' positive' and '
negative'

# (note the space before the words)

tokenizer.encode(" positive"), tokenizer.encode(" negative")

([3967], [4633])

Once we have the IDs, we can now run inference with the model and see which token has a

higher probability:

def score(review):
"""Predict whether it is positive or negative

This function predicts whether a review is positive or


negative

using a bit of clever prompting. It looks at the logits


for the

tokens ' positive' and ' negative' (note the space before
the

words), and returns the label with the highest score.

"""

prompt = f"""Question: Is the following review positive or


negative about the movie?

Review: {review} Answer:"""f

input_ids = tokenizer(prompt,

return_tensors="pt").input_ids

final_logits = gpt2(input_ids).logits[0, -1]

if final_logits[3967] > final_logits[4633]:

print("Positive")

else:
print("Negative")

Tokenize the prompt

Get the logits for each token in the vocabulary. Note that we’re using gpt2() rather than

gpt2.generate(), as gpt2() returns the logits for each token in the vocabulary,

while gpt2.generate() returns only the chosen token.

Check if the logit for the Positive token is higher than the logit for the Negative token.
We can try out this zero-shot classifier on a few fake reviews to see how it does:

score("This movie was terrible!")

Negative

score("That was a delight to watch, 10/10 would recommend :)")

Positive

score("A complex yet wonderful film about the depravity of


man") # A mistake

Negative

In the supplementary material, you’ll find a dataset of labeled reviews and code to assess the

accuracy of this zero-shot approach. Can you tweak the prompt template to improve the
model’s performance? Can you think of other tasks that could be performed using a similar

approach?

The zero-shot capabilities of recent models have been a game-changer. As the models

improve, they can perform more tasks out-of-the-box, making them more accessible and

easier to use and reducing the need for specialized models for each task.

Few-Shot Generalization

Despite the release of ChatGPT and the quest for the perfect prompts, zero-shot

generalization (or prompt tuning) is not the only way to bend powerful language models to

perform arbitrary tasks.

Zero-shot is the extreme application of a technique called few-shot generalization, in which

we provide the language model a few examples about the task we want it to perform and then

ask it to provide similar answers for us. Instead of training the model, we show some

examples to influence generation by increasing the probability that the continuation text

follows the same structure and pattern as our prompt.


Let’s see an example. Apart from providing examples, providing a short description of what

the model should do, e.g., "Translate English to French“, will help with

higher-quality generations. This time, we’ll use a more robust model: GPT-Neo 1.3B.

GPT-Neo is a family of transformer models from EleutherAI, a non-profit research lab. These

models outperform GPT-2 in many tasks and tend to do few-shot learning better. We’ll use

the variant with 1.3 billion parameters, small by today standards, but still quite powerful and

a few times larger than GPT-2, which has just 124 million parameters.

model =
AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B"
)

prompt = """\

Translate English to Spanish:

English: I do not speak Spanish.


Spanish: No hablo español.

English: See you later!

Spanish: ¡Hasta luego!

English: Where is a good restaurant?

Spanish: ¿Dónde hay un buen restaurante?


English: What rooms do you have available?

Spanish: ¿Qué habitaciones tiene disponibles?

English: I like soccer

Spanish:"""

inputs = tokenizer(prompt, return_tensors="pt").input_ids

output = model.generate(

inputs,
do_sample=False,

max_new_tokens=10,

print(tokenizer.decode(output[0], skip_special_tokens=True))

Translate English to Spanish:

English: I do not speak Spanish.

Spanish: No hablo español.


English: See you later!

Spanish: ¡Hasta luego!

English: Where is a good restaurant?

Spanish: ¿Dónde hay un buen restaurante?

English: What rooms do you have available?


Spanish: ¿Qué habitaciones tiene disponibles?

English: I like soccer

Spanish: Me gusta el fútbol

We state the task we want to achieve and provide four examples to set the context for the

model. Hence, this is a 4-shot generalization task. Then, we ask the model to generate more

text to follow the pattern and provide the requested translation. Some ideas to explore:

● Would this work with fewer examples?

● Would it work without the task description?

● How about other tasks?

● How does GPT-2 score in this setting?

NOTE
GPT-2, given its size and training process, is not very good at few-shot tasks, and it’s even worse at

zero-shot generalization. How is it possible that we managed to use it for sentiment classification in

our previous example? We cheated a bit: we didn’t look at the text generated by the model, just

checked whether the probability for ” Positive” was larger than the ” Negative ” probability.

Understanding how models work under the hood can unlock powerful applications even with small

models. Remember to think about your problem; don’t be afraid to explore.

GPT-2 is an example of a base model. Some base models in the style of GPT-2 have zero-shot

and few-shot capabilities that we can use at inference time. Another approach is to fine-tune

a model: we take the base model and keep training it a bit longer on domain or task-specific

data. We rarely need the extreme generalization capabilities showcased by the most powerful

models in the world; if you only want to solve a particular task, it will usually be cheaper and

better to fine-tune and deploy a smaller model specialized on a single task. It’s also important

to note that base models are not conversational; although you can write a very nice prompt

that will help make a chatbot with a base model, it’s often more convenient to fine-tune the

base model with conversational data, hence improving the conversational capabilities of the

model. That’s precisely what we’ll do in Chapter 5.

A Transformer Block
After our brief experiments using language models, we are ready to introduce an architecture

diagram for transformer-based language generation models. The high-level pieces involved

include:

● Tokenization. The input text is broken down into individual tokens (which can be

words and subwords). Each token has a corresponding ID used to index the token

embeddings.

● Input Token Embedding. The tokens are represented as vectors called

embeddings. These embeddings serve as numerical representations that capture

the semantic meaning of each token. You can think of vectors as list of numbers,

where each number corresponds to a particular aspect of the token’s meaning.

During training, a model learns how to map each token to its corresponding

embedding. The embedding will always be the same for each token, regardless of

its position in the input sequence.

● Positional Encoding. The transformer model has no notion of order, so we need

to enrich the token embeddings with positional information. This is done by

adding a positional encoding to the token embeddings. The positional encoding is

a set of vectors that encode the position of each token in the input sequence. This

allows the model to differentiate between tokens based on their position in the

sequence, which can be useful as the same token appearing in different places can

have different meanings.

● Transformer blocks: The core of the transformer model is the transformer block.

The power of transformers comes from stacking multiple blocks, allowing the

model to learn increasingly complex and abstract relationships between the input

tokens. It consists of two main components:

● Self-Attention Mechanism. This mechanism allows the model to

weigh the importance of each token in the context of the entire

sequence. It helps the model understand the relationships between


different tokens in the input. The self-attention mechanism is the key to

the transformer’s ability to handle long-range dependencies and

complex relationships between words, and it helps generate coherent

and contextually appropriate text.

● Feed-Forward Neural Network. The self-attention output is passed

through a feed-forward neural network, which further refines the

representation of the input sequence.

● Contextual Embeddings. The output of the transformer block is a set of

contextual embeddings that capture the relationships between tokens in the input

sequence. Unlike the input embeddings, which are fixed for each token, the

contextual embeddings are updated at each layer of the transformer model based

on the relationships between tokens.

● Prediction. An additional layer processes the final representation into a

task-dependent final output. In the case of text generation, this involves having a

linear layer that maps the contextual embeddings to the vocabulary space,

followed by a softmax operation to predict the next token in the sequence.


Figure 2-1. Architecture of a transformer-based language model
Of course, this is a simplification of the transformer architecture. Diving into the internals of

how self-attention works or the internals of the transformer block is beyond the scope of this

book. However, understanding the high-level architecture of a transformer model can be

helpful to grasp how these models work and how they can be applied to various tasks. This

architecture has enabled transformers to achieve unprecedented performance in various tasks

and domains, and you’ll see them cropping up again and again –not only in the rest of this

book, but also in the discipline as a whole.

Transformer Models
Genealogy

Sequence-To-Sequence Tasks

At the beginning of the chapter, we experimented with GPT-2 to auto-regressively generate

text. GPT-2, an example of a decoder-based transformer, has a single stack of transformer

blocks that process an input sequence. This is a popular approach today, but the original

transformer paper, attention is all you need,3 used a more complicated architecture called the

encoder-decoder architecture, which is still in common use today.


The transformer paper focused on machine translation as the example sequence-to-sequence

task. The best results in machine translation at the time were achieved by recurrent neural

networks (RNNs), such as LSTM and GRU (don’t worry if you’re unfamiliar with them). The

paper demonstrated better results by focusing solely on the attention method and showed that

scalability and training were much easier. These factors –excellent performance, stable

training, and easy scalability– are why transformers took off and were adapted to multiple

tasks, as the next section explores in more depth.

In encoder-decoder models, like the original transformer model described in the paper, one

stack of transformer blocks, called encoder, processes an input sequence into a set of rich

representations, which are then fed into another stack of transformer blocks, called decoder,

that decodes them into an output sequence. This approach to convert one sequence into a

different one is called sequence-to-sequence or seq2seq and is naturally well suited for tasks

such as translation, summarization, or question-answering.

For example, you feed an English sentence through the encoder of a translation model, which

generates a rich embedding that captures the meaning of the input. Then, the decoder

generates the corresponding French sentence using this embedding. The generation happens

in the decoder one token at a time, as we saw when generating sequences earlier in the

chapter. However, the predictions for each successive token are informed not just by the

previous tokens in the sequence being generated but also by the output from the encoder.

The mechanism by which the output from the encoder side is incorporated into the decoder

stack is called cross-attention. It resembles self-attention, except that each token in the input
(the sequence being processed by the decoder) attends to the context from the encoder rather

than other tokens in its sequence. The cross-attention layers are interleaved with

self-attention, allowing the decoder to use both contexts within its sequence and the

information from the encoder.

After the transformer paper, existing sequence-to-sequence models, such as Marian NMT,

incorporated these techniques as a central part of their architecture. New models were

developed using these ideas. A notable one is BART (short for "Bidirectional and

Auto-Regressive Transformers"4). During pre-training, BART corrupts input

sequences and attempts to reconstruct them in the decoder output. Afterward, BART is

fine-tuned for other generation tasks, such as translation or summarization, leveraging the

rich sequence representations achieved during pre-training. Input corruption, by the way, is

one of the key ideas behind diffusion models, as we’ll see in Chapter 3.

Another notable sequence-to-sequence model is T5.5 T5 approaches the multitude of NLP

tasks in a general way by formulating 60 of them as text-to-text transformations. No custom

layers or code are required for different tasks, training uses the same hyperparameters, and

the model learns from a very diverse dataset.

We just discussed encoder-decoder and decoder-only architectures. A common question is

why one might need an encoder-decoder model for tasks like translation if decoder-only

models like GPT-2 can show good results. Encoder-decoder models are designed to translate

an entire input sequence to an output sequence, making them well-suited for translation. In
contrast, decoder-only models focus on predicting the next token in a sequence. Initially,

decoder-only models like GPT-2 were less capable in zero-shot learning scenarios than more

recent models like GPT-3, but this was due to more than just the absence of an encoder. The

improvement in zero-shot capabilities in advanced models like GPT-3 is also due to larger

training data, better training techniques, and increased model sizes. While encoders in

seq2seq models play a crucial role in understanding the full context of input sequences,

advancements in decoder-only models have made them more effective and versatile, even for

tasks traditionally relying on seq2seq models.

Encoder-only models

As we’ve seen, the original transformer model was based on an encoder-decoder architecture

that has been further explored in models such as BART or T5. In addition, the encoder or the

decoder can be trained and used independently, giving rise to distinct transformer families.

The first sections of this chapter explored decoder-only, or autoregressive models. These

models are specialized in text generation using the techniques we described and have shown

impressive performance, as demonstrated by ChatGPT, Claude, Llama, or Falcon.

Encoder models, on the other hand, are specialized in obtaining rich representations from text

sequences and can be used for tasks such as classification or to prepare semantic embeddings

(usually a vector of a few hundred numbers) for a multitude of documents that can be used in

retrieval systems. The best-known transformer encoder model is probably BERT6, which
introduced the masked language model objective that was later picked up and further

explored by BART.

Causal language modeling predicts the next token given the previous ones - it’s what we did

with GPT-2. The model can only attend to the context on the left of a given token. A different

approach used in encoder models is called masked language modeling (MLM). Masked

language modeling, proposed in the famous BERT paper, pre-trains a model to learn to

"fill in the blanks“. Given an input text, we randomly mask some tokens, and the

model must predict the hidden tokens. Unlike causal language modeling, MLM uses both the

sequence at the masked token’s left and right, hence the B of "bidirectional" in BERT’s

name. This helps create strong representations of the given text. Under the hood, these

models use the encoder part of the transformer’s architecture.

from transformers import pipeline

fill_masker = pipeline(model="bert-base-uncased")

fill_masker("The [MASK] is made of milk.")


[{'score': 0.19546695053577423,

'token': 9841,

'token_str': 'dish',

'sequence': 'the dish is made of milk.'},

{'score': 0.1290755718946457,

'token': 8808,

'token_str': 'cheese',

'sequence': 'the cheese is made of milk.'},


{'score': 0.10590697824954987,

'token': 6501,

'token_str': 'milk',

'sequence': 'the milk is made of milk.'},

{'score': 0.04112089052796364,

'token': 4392,

'token_str': 'drink',

'sequence': 'the drink is made of milk.'},


{'score': 0.03712352365255356,

'token': 7852,

'token_str': 'bread',

'sequence': 'the bread is made of milk.'}]

What happens under the hood? The encoder receives the input sequence and generates a

contextualized representation for each token. This representation is a vector of numbers that

captures the meaning of the token in the context of the entire sequence. The encoder is

usually followed by a task-specific layer that uses the representations to perform tasks such as

classification, question answering, or masked language modeling. The encoder is trained to

generate representations that are useful for understanding-heavy tasks.

Between encoder-only, decoder-only, and encoder-decoder models, we’ve seen a large

number of new open and closed language models, such as GPT-4, Mistral, Falcon, Llama 2,

Qwen, Yi, Claude, Bloom, PaLM, and hundreds more. Yann LeCun posted this delightful

genealogy diagram in Twitter, taken from a survey paper7 shows transformers’ rich and

fruitful impact on the NLP landscape as of 2024.


The Power of
Pre-training

The key Insights of Transformers

Having access to existing models is quite powerful. In the previous sections, we explored

using GPT2 and GPT-NeoX to generate text and perform zero-shot classification.

Transformer models have shown state-of-the-art performance across many other language

tasks, such as text classification, machine translation, and answering questions based on an

input text. Why do transformers work so well?

The first insight is the usage of the attention mechanism, as hinted in the chapter introduction.

Previous NLP methods, such as recurrent neural networks, struggled to handle long

sentences. Attention mechanisms allow the transformers model to attend to long sequences

and learn long-range relationships. In other words, transformers can estimate how relevant

some tokens are to other tokens.


The second key aspect is their ability to scale. The transformer architecture has an

implementation optimized for parallelization, and research has shown that these models can

scale to handle high-complexity and high-scale datasets. Although initially designed for text

data, the transformer architecture can be flexible enough to support different data types and

handle irregular inputs.

The third key insight is the ability to do pre-training and fine-tuning. Traditional approaches

to a task, such as movie review classification, were limited by the availability of labeled data.

A model would be trained from scratch on a large corpus of labeled examples, attempting to

predict the label from the input text directly. This approach is often referred to as supervised

learning. However, it has a significant drawback: it requires a large amount of labeled data to

train effectively. This is a problem because labeled data is expensive to obtain and

time-consuming to label. There might not even be any available data in many domains.

To address this, researchers began looking for a way to pre-train models on existing data that

could then be fine-tuned (or adjusted) for a specific task. This approach is known as transfer

learning and is the foundation of modern ML in many fields, such as Natural Language

Processing and Computer Vision. Initial works in NLP focused on finding domain-specific

corpora for the language model pre-training phase, but papers such as ULMFiT8 showed that

even pre-training on generic text such as Wikipedia could yield impressive results when the

models were fine-tuned on downstream tasks, such as sentiment analysis or question

answering. This set the stage for the rise of transformers, which turned out to be highly

well-suited to learning rich representations of language.


The idea of pre-training is to train a model on a large unlabeled dataset and then fine-tune it

to a new target task, for which one would require much less labeled data. Before graduating

to NLP, transfer learning had already been very successful with the Convolutional Neural

Networks that form the backbone of modern Computer Vision. In this scenario, one first

trains a large model with a massive amount of labeled images in a classification task.

Through this process, the model learns common features that can be leveraged on a different

but related problem. For example, we can pre-train a model on thousands of classes and then

fine-tune it to classify whether a picture is of a hot dog.

With transformers, things are taken further with self-supervised pre-training. We can pre-train

a model on large, unlabeled text data. How? Let’s think about causal models such as GPT.

The model predicts which is the next word. Well, we don’t need any labels to obtain training

data. Given a corpus of text, we can mask the tokens after a sequence and train the model to

learn to predict them. Like in the computer vision case, pre-training gives the model a

meaningful representation of the underlying text. We can then fine-tune the model to perform

another task, such as generating text in the style of our Tweets or a specific domain (e.g.,

your company chat). Given the model has already learned a representation of language,

fine-tuning will require much less data than if we trained from scratch.

For many tasks, a rich representation of the input is more important than being able to predict

the next token. For example, if you want to fine-tune a model to predict the sentiment of a

movie review, masked language models would be more powerful. Models such as GPT-2 are

designed to optimize for text generation rather than for building powerful representations of

the text. On the other hand, models such as BERT are ideal for this task. As briefly mentioned

before, the last layer of an encoder model outputs a dense representation of the input
sequence, called embedding. This embedding can then be leveraged by adding a small,

simple network on top of the encoder and fine-tuning the model for the specific task. As a

concrete example, we can add a simple linear layer on top of the BERT encoder output to

predict the sentiment of a document. We can take this approach to tackle a wide range of

tasks:

● Token classification. Identify each entity in a sentence, such as a person, location,

or organization.

● Extractive question answering. Given a paragraph, answer a specific question

and extract the answer from the input.

● Semantic search. The features generated by the encoder can be handy to build a

search system. Given a database of a hundred documents, we can compute the

embeddings for each. Then, we can compare the input embeddings with the

documents’ ones at inference time, hence identifying the most similar document in

the database.9

● And many others, including text similarity, anomaly detection, named entity

linking, recommendation systems, and document classification.

from transformers import pipeline


classifier =
pipeline(model="distilbert-base-uncased-finetuned-sst-2-englis
h")

classifier("This movie is disgustingly good !")

[{'label': 'POSITIVE', 'score': 0.9998536109924316}]

This classification model can analyze reviews and do the same as in the zero-shot

classification section. The challenge section of this chapter shows how to evaluate

classification models and compare between a zero-shot setup and this fine-tuned model.

Transformers recap

We’ve discussed three types of architectures.

● Encoder-based architectures , such as BERT, DistilBERT, and RoBERTa10, are

ideal for tasks that require understanding the entire input. These models output

contextualized embeddings that capture the meaning of the input sequence. We


can then add a small network on top of these embeddings and train it for a new

specific task that relies on the semantic information (such as identifying entities in

the text or classifying the sequence).

● Decoder-based architectures, such as GPT-2, Falcon, and Llama, are ideal for

new text generation.

● Encoder-decoder architectures, or seq2seq, such as BART and T5, are great for

tasks that require generating new sentences based on a given input, such as

summarization or translation.

"Wait." - you might say - "I can do all of these tasks with ChatGPT or

Llama“. That’s true - given the vast (and growing) amount of training data, computing, and

training optimizations, the quality of generative models is significantly increasing, and the

zero-shot capabilities have improved considerably compared to a few years ago. Although

decoder-only models provide good results, the current consensus is that, provided the

resources, fine-tuning a model for your specific task and domain will work better than using

an out-of-the-box pre-trained model. For example, if you want to use a GPT model in

real-time in a game to generate character dialogs, it will usually perform better if you first

fine-tune it with similar data. If you want to use a model to extract different entities from

your dataset of chemistry papers, it might make sense first to fine-tune an encoder-based

model with chemistry papers to achieve this.

The success of seq2seq models are a consequence of their capability to encode

variable-length input sequences into an embedding, which summarizes the input information.

The decoder part of the model can then leverage the context for performing the generation.

Decoder-only models have gained interest in recent years thanks to their simplicity,
scalability, efficiency, and parallelization. The three types of models are widely used in the

industry depending on the task - no single golden model is used for everything.

With over half a million open models, you might wonder which one to use. Chapter 5 will

help you navigate this landscape, providing guidelines on how to choose the right model for

your task and requirements as well as how to fine-tune a model for your specific needs.

Limitations

At this point, you might wonder what the issues are with transformers. Let’s briefly go over

some of the limitations:

● Transformers are very large. Research has consistently shown that larger

models perform better. Although that’s quite exciting, it also brings concerns.

First, some of the most powerful models require dozens of millions of U.S. dollars

to train - just in computing power. That means that only a small set of institutions

can train very large base models, limiting the kind of research that institutions

without those resources can do. Second, using such amounts of computing power

can also have ecological implications - those millions of GPU hours are, of

course, powered by lots of electricity. Third, even if some of these models are

open-sourced, running them might require many GPUs. Chapter 5 will explore
some techniques to use these LLMs even if you don’t have multiple GPUs at

home. Even then, deploying them in resource-constrained environments is a

frequent challenge.

● Sequential processing: If you recall the decoder section, we had to process all the

previous tokens for each new token. That means generating the 10,000th token in

a sequence will take considerably longer than generating the initial one. In

computer science terms, transformers have quadratic time complexity with respect

to the input length. This means that as the length of the input increases, the time

taken for processing grows quadratically, making it challenging to scale them to

very long documents or use these models in some real-time scenarios. While

transformers excel in many tasks, their computational demands require careful

consideration and optimization when being used in production. That said, there

has been a lot of research on making transformers more efficient for extremely

long sequences.

● Fixed input size: Transformer models can handle a maximum number of tokens,

which depends on the base model. Some transformers can only handle 512 tokens,

while new techniques allow to scale to hundreds of thousands tokens. The number

of tokens the model can attend is called the context window. This is an essential

thing to look into when picking a pre-trained model. You cannot simply pass

entire books to transformers, expecting they will be able to summarize them.

● Limited interpretability: Transformers are often criticized for their lack of

interpretability.

All of the above are very active research areas - people have been exploring how to train and

run models with less computing power (e.g., QLoRA, which we’ll explore in Chapter 5),

make generation faster (e.g., flash attention and assisted generation), enable unconstrained

input sizes (e.g., RoPE and attention sinks), and interpret the attention mechanisms.
One big concern that requires diving into is the presence of biases in models. If the training

data used to pre-train transformers contains biases, the model can learn and perpetuate them.

This is a broader issue in machine learning but is also relevant to transformers. Let’s revisit

the fill-mask pipeline. Let’s say we want to predict the most likely profession. As you

can see below, the results are very different if we use the word "man" vs. "woman“.

unmasker = pipeline("fill-mask", model="bert-base-uncased")

result = unmasker("This man works as a [MASK] during summer.")

print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK] during


summer.")

print([r["token_str"] for r in result])


['farmer', 'carpenter', 'gardener', 'fisherman', 'miner']

['maid', 'nurse', 'servant', 'waitress', 'cook']

Why does this happen? To enable pre-training, researchers usually require large amounts of

data, leading to scraping all the content they can find. This content might be of all kinds of

quality, including toxic content (which can be, to some extent, filtered out). The base model

might end up engraining and perpetuating these biases when being fine-tuned. Similar

concerns exist for conversational models, where the final model might generate toxic content

learned from the pre-training dataset.

Beyond Text

Transformers have been used for many tasks representing data as text. A clear example is

code generation – rather than training a language model with English data, we can use lots of

code, and, by the same principles we just learned, it will learn how to auto-complete code.

Another example is using transformers to answer questions from a table, such as a

spreadsheet.

As transformer models have been so successful in the text domain, considerable interest has

sparked in other communities to adapt these techniques to other modalities. This has led to
Transformer models being used for tasks such as image recognition, segmentation, object

detection, video understanding, and more.

Convolutional Neural Networks have been widely used as the go-to state-of-the-art models

for most computer vision techniques. With the introduction of Vision Transformers (ViT)11,

there has been a switch in recent years to explore how to tackle vision tasks with attention

and transformers-based techniques. ViTs don’t discard CNNs entirely: In the image

processing pipeline, CNNs extract feature maps of the image to detect high-level edges,

textures, and other patterns. The feature maps obtained from the CNNs are then divided into

fixed-size, non-overlapping patches. These patches can be treated similarly to a sequence of

tokens, so the attention mechanism can learn the relationships between patches in different

places.

Unfortunately, ViTs required more data (300 million images!) and compute than CNNs to get

good results. Further work has happened in recent years; for example, DeiT was able to use

transformer-based models with mid-sized datasets (1.2M images) thanks to using

augmentation and regularization techniques common in CNNs. DeiT also uses a distillation

approach involving a "teacher" model (a CNN in this case). Other models such as DETR,

SegFormer, and Swin Transformer have pushed the field further, supporting many tasks such

as image classification, object detection, image segmentation, video classification, document

understanding, image restoration, super-resolution, and others.

As we’ll see in Chapter 9, transformer models can also be used for audio tasks, such as

transcribing audio or generating synthetic speech or music. Under the hood, the same

fundamental principles of pre-training and attention mechanisms persist, but each modality

has different data types, requiring different approaches and modifications.


Other modalities where transformers are being explored are:

● Graphs: An excellent introductory read is Introduction to Graph Machine

Learning by Fourrier, 2023.12Using transformers for graphs is still very

exploratory, but there are some exciting early results. Some examples of tasks that

involve graph data are predicting the toxicity of molecules, predicting the

evolution of systems, or generating new plausible molecules.

● 3D data: For example, perform segmentation of data that can be represented in

3D, such as LiDAR point clouds in autonomous driving or CT scans for organ

segmentation. Another example is estimating an object’s 6 degrees of freedom,

which can be helpful in robotics applications.

● Time series: Analyzing stock price or performing weather forecasting.

● Multimodal: Some transformer models are designed to process our output

multiple types of data (such as text, images, and audio) together. This opens new

possibilities, such as multimodal systems where you can speak, write, or provide

pictures and have a single model to process them. Another example is visual

question answering, where a model can answer questions about provided images.

Project Time: Using


LMs to generate text
We used the generate() method in the generation section to perform different decoding

techniques. To better understand how it works under the hood, it’s time to implement it

ourselves. We’ll use the generate() method as a reference but implement it from scratch.

We’ll also explore using the generate() method to perform different decoding techniques.

Your goal is to fill the code in the following function. Rather than use gpt2.generate(),

the idea is to iteratively call gpt2(), passing the previous tokens as input. You have to

implement greedy search when do_sample=False, sampling when do_sample=True,

and Top-K sampling when do_sample=True and top_k is not None.

def generate(

model, tokenizer, input_ids, max_length=50,


do_sample=False, top_k=None

):

"""Generate a sequence without using model.generate()


Args:

model: The model to use for generation.

tokenizer: The tokenizer to use for generation.

input_ids: The input IDs

max_length: The maximum length of the sequence.


Defaults to 50.

do_sample: Whether to use sampling. Defaults to False.

top_k: The number of tokens to sample from. Defaults


to 0.

"""
# Write your code here

# Begin by the simplest approach, greedy decoding.

# Then add sampling and finally top-k sampling.

Summary

Congratulations! You now have learned to load and use transformers for various tasks! You

also understood how transformers model sequence data such as text and how this property

lets them "learn" valuable representations that we can use to generate or classify new

sequences. As the scale of these models increases, so do their capabilities - to the point where

massive models with hundreds of billions of parameters can now perform many tasks

previously thought impossible for computers.

We can pick powerful existing pre-trained models and modify them for specific domains and

use cases thanks to fine-tuning. The trend towards larger and more capable models has

caused a shift in how people use them. Task-specific models are often out-competed by

general-purpose LLMs, and most people now interact with these models via APIs and hosted

solutions or directly via slick chat-based user interfaces. At the same time, thanks to the

release of large and powerful open-access models, such as Llama, there is a strong wave in

the researchers’ and practitioners’ ecosystems aiming to run high-quality models directly in
consumer computers, resulting in privacy-first solutions. This trend extends beyond

inference: novel training approaches that allow individuals to fine-tune these models without

many computational resources have emerged in recent years. Chapter 5 delves into both

traditional and novel fine-tuning techniques.

Although we covered how transformers work and we’ll dive into their training, diving into

the internals of these models (for example, the math behind attention mechanisms) or how to

pre-train a model from scratch is outside the scope of this book. Luckily for us, there are

excellent resources to learn about this:

● The Illustrated Transformer by Jay Alammar is a beautiful visual guide that

explains transformers in a detailed and intuitive way.

● We recommend reading the Natural Language Processing with Transformers book

if you want to dive deeper into the internals of fine-tuning these models for

multiple specific tasks.

● Hugging Face has a free, open-source course which teaches how to solve different

NLP tasks.

If you want to dive more into the GPT family of models, we suggest to review the following

papers:

● Improving Language Understanding by Generative Pre-Training. This is the

original GPT paper, published in 2018 by Alec Radford, Karthik Narasimhan, Tim

Salimans, and Ilya Sutskever. It introduced the idea of using a Transformer-based

model pre-trained on a large corpus of text to learn general language

representations and then fine-tuning it on specific downstream tasks. The paper


also showed that the GPT model achieved state-of-the-art results on several

natural language understanding benchmarks at the time.

● Language Models are Unsupervised Multitask Learners, published in 2019 by

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya

Sutskever. It presented GPT-2, a Transformer-based model with 1.5 billion

parameters pre-trained on a large corpus of web text called WebText. The paper

also demonstrated that GPT-2 could perform well on various natural language

tasks without fine-tuning, such as text generation, summarization, translation,

reading comprehension, and commonsense reasoning. Finally, it discussed

large-scale language models’ potential ethical and social implications.

● Language Models are Few-Shot Learners, published in 2020 by Tom B. Brown

and others. This paper shows that scaling up language models dramatically

improves their ability to perform new language tasks from only a few examples or

simple instructions without fine-tuning or gradient updates. The paper also

presents GPT-3, an autoregressive language model with 175 billion parameters,

which achieves strong performance on many NLP datasets and tasks.

Exercises

1. What’s the role of the attention mechanism in text generation?

2. In which cases would a character-based tokenizer be preferred?

3. What happens if you use a tokenizer different from the one used with the model?

4. What’s the risk of using no_repeat_ngram_size when doing generation?

(hint: think of city names)

5. What would happen if you combine Beam-search and sampling?


6. Imagine you’re using a LLM that generates code in a code editor by doing

sampling. What would be more convenient? A low temperature or a high

temperature?

7. What’s the importance of fine-tuning, and why is it different than zero-shot

generation?

8. Explain the difference and application of encoder, decoder, and encoder-decoder

transformers.

Challenges

9. Use a summarization model (you can do pipeline(“summarization)) to

generate summaries of a paragraph. How does it compare with the results of using

zero-shot? Can it be beaten by providing few-shot examples?

10. In the zero-shot supplementary material, we calculate the confusion matrix using

zero-shot classification. Explore using the

distilbert-base-uncased-finetuned-sst-2-english encoder

model that can do sentiment analysis. What results do you get?

11. Let’s build a FAQ system! Sentence transformers are powerful models that can

determine how similar multiple texts are. While the transformer encoder usually

outputs an embedding for each token, sentence transformers output an embedding

for the whole input text, allowing us to determine if the two texts are similar based

on their similarity score. Let’s look at a simple example using the

sentence_transformers library.

from sentence_transformers import SentenceTransformer, util


sentences = ["I'm happy", "I'm full of happiness"]

model =
SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Compute embedding for both lists

embedding_1 = model.encode(sentences[0],
convert_to_tensor=True)

embedding_2 = model.encode(sentences[1],
convert_to_tensor=True)
util.pytorch_cos_sim(embedding_1, embedding_2)

tensor([[0.6003]], device='cuda:0')

Write a list of five questions and answers about a topic. Your goal will be to build a system

that, given a new question, can give the user the most likely answer. How can we use

sentence transformers to solve this? The supplemental material contains the solution, but

although challenging, we suggest trying it first before looking there!

References

1. Brown, Tom B., et al. Language Models Are Few-Shot Learners. arXiv, 22 July

2020. arXiv.org, http://arxiv.org/abs/2005.14165

2. Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for

Language Understanding. arXiv, 24 May 2019. arXiv.org,

http://arxiv.org/abs/1810.04805

3. Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for

Image Recognition at Scale. arXiv, 3 June 2021. arXiv.org,

http://arxiv.org/abs/2010.11929

4. Fourrier. Clémentine, "Introduction to Graph Machine Learning"

Hugging Face Blog, https://huggingface.co/blog/intro-graphml

5. Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv, 14

Feb. 2020. arXiv.org, http://arxiv.org/abs/1904.09751


6. Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning

for Text Classification. arXiv, 23 May 2018. arXiv.org,

http://arxiv.org/abs/1801.06146

7. Lewis, Mike, et al. BART: Denoising Sequence-to-Sequence Pre-Training for

Natural Language Generation, Translation, and Comprehension. arXiv, 29 Oct.

2019. arXiv.org, http://arxiv.org/abs/1910.13461

8. Radford, Alec, et al. "Language models are unsupervised

multitask learners." OpenAI blog 1, no. 8 (2019): 9.

9. Radford, Alec, et al. "Improving language understanding by

generative pre-training." (2018).

10. Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified

Text-to-Text Transformer. arXiv, 28 July 2020. arXiv.org,

http://arxiv.org/abs/1910.10683

11. T. Lan, “Generating human-level text with contrastive search in Transformers,”

Hugging Face Blog, https://huggingface.co/blog/introducing-csearch

12. Vaswani, Ashish, et al. Attention Is All You Need. arXiv, 1 Aug. 2023. arXiv.org,

http://arxiv.org/abs/1706.03762

13. Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on

ChatGPT and Beyond. arXiv, 27 Apr. 2023. arXiv.org,

http://arxiv.org/abs/2304.13712

1 An excellent deep dive into contrastive search is the "Generating Human-level

Text with Contrastive Search" blog post

(https://huggingface.co/blog/introducing-csearch).
2 The first example in the GPT-2 release blog post was famously a news story about unicorns

(https://openai.com/research/better-language-models).

3 Vaswani et al. "Attention is all you need." Advances in neural information

processing systems 30 (2017).

4 Lewis et al. "BART: Denoising Sequence-to-Sequence Pre-training

for Natural Language Generation, Translation, and

Comprehension"

5 Raffel, Colin, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text

Transformer. arXiv, 28 July 2020. arXiv.org, http://arxiv.org/abs/1910.10683

6 Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language

Understanding. arXiv, 24 May 2019. arXiv.org, http://arxiv.org/abs/1810.04805

7 Yang, Jingfeng, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT

and Beyond. arXiv, 27 Apr. 2023. arXiv.org, http://arxiv.org/abs/2304.13712

8 Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning for Text

Classification. arXiv, 23 May 2018. arXiv.org, http://arxiv.org/abs/1801.06146


9 This oversimplifies how semantic search works, but we’ll get a chance to build a simple

search system using semantic embeddings in the challenge section of this chapter.

10 DistilBERT is a smaller model that preserves 95% of the original BERT performance while

having 40% less parameters. RoBERTa is a very powerful BERT-based model trained with

different hyperparameters and training for longer.

11 Dosovitskiy, Alexey, et al. An Image Is Worth 16x16 Words: Transformers for Image

Recognition at Scale. arXiv, 3 June 2021. arXiv.org, http://arxiv.org/abs/2010.11929

12 The "Introduction to Graph Machine Learning" blog post

(https://huggingface.co/blog/intro-graphml) is a great resource to jump into the topic.

You might also like