0% found this document useful (0 votes)

49 views

Transformer Architecture Explained

Uploaded by

Uc Ngô

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

Transformer Architecture Explained

Uploaded by

Uc Ngô

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Transformer Architecture explained

medium.com/@amanatulla1606/transformer-architecture-explained-2c49e2257b4c

Amanatullah 1 tháng 9, 2023

Top highlight

Amanatullah

Transformers are a new development in machine learning that have been making a lot of
noise lately. They are incredibly good at keeping track of context, and this is why the text
that they write makes sense. In this chapter, we will go over their architecture and how
they work.

Transformer models are one of the most exciting new developments in machine learning.
They were introduced in the paper Attention is All You Need. Transformers can be used to
write stories, essays, poems, answer questions, translate between languages, chat with
humans, and they can even pass exams that are hard for humans! But what are they?
You’ll be happy to know that the architecture of transformer models is not that complex, it
simply is a concatenation of some very useful components, each of which has its own
function. In this chapter, you will learn all of these components.

In a nutshell, what does a transformer do? Imagine that you’re writing a text message on
your phone. After each word, you may get three words suggested to you. For example, if
you type “Hello, how are”, the phone may suggest words such as “you”, or “your” as the
next word. Of course, if you continue selecting the suggested word in your phone, you’ll
quickly find that the message formed by these words makes no sense. If you look at each

1/8
set of 3 or 4 consecutive words, it may make sense, but these words don’t concatenate to
anything with a meaning. This is because the model used in the phone doesn’t carry the
overall context of the message, it simply predicts which word is more likely to come up
after the last few. Transformers, on the other hand, keep track of the context of what is
being written, and this is why the text that they write makes sense.

The phone can suggest the next word to use in a text message, but does not have the
power to generate coherent text.

I have to be honest with you, the first time I found out that transformers build text one
word at a time, I couldn’t believe it. First of all, this is not how humans form sentences
and thoughts. We first form a basic thought, and then start refining it and adding words to
it. This is also not how ML models do other things. For example, images are not built this
way. Most neural network based graphical models form a rough version of the image, and
slowly refine it or add detail until it is perfect. So why would a transformer model build text
word by word? One answer is, because that works really well. A more satisfying one is
that because transformers are so incredibly good at keeping track of the context, that the
next word they pick is exactly what it needs to keep going with an idea.

And how are transformers trained? With a lot of data, all the data on the internet, in fact.
So when you input the sentence “Hello, how are” into the transformer, it simply knows
that, based on all the text in the internet, the best next word is “you”. If you were to give it
a more complicated command, say, “Write a story.”, it may figure out that a good next
word to use is “Once”. Then it adds this word to the command, and figures out that a
good next word is “upon”, and so on. And word by word, it will continue until it writes a
story.

Command: Write a story.

Response: Once

Next command: Write a story. Once

Response: upon

2/8
Next command: Write a story. Once upon
Response: a

Next command: Write a story. Once upon a

Response: time

Next command: Write a story. Once upon a time

Response: there

Now that we know what transformers do, let’s get to their architecture. If you’ve seen the
architecture of a transformer model, you may have jumped in awe like I did the first time I
saw it, it looks quite complicated! However, when you break it down into its most
important parts, it’s not so bad. The transformer has 4 main parts:

Tokenization
Embedding
Positional encoding
Transformer block (several of these)
Softmax

The fourth one, the transformer block, is the most complex of all. Many of these can be
concatenated, and each one contains two main parts: The attention and the feedforward
components.

Tokenization
Tokenization is the most basic step. It consists of a large dataset of tokens, including all
the words, punctuation signs, etc. The tokenization step takes every word, prefix, suffix,
and punctuation signs, and sends them to a known token from the library.

3/8
Embedding
Once the input has been tokenized, it’s time to turn words into numbers. For this, we use
an embedding. In a previous chapter you learned about how text embeddings send every
piece of text to a vector (a list) of numbers. If two pieces of text are similar, then the
numbers in their corresponding vectors are similar to each other (componentwise,
meaning each pair of numbers in the same position are similar). Otherwise, if two pieces
of text are different, then the numbers in their corresponding vectors are different.

Positional encoding
Once we have the vectors corresponding to each of the tokens in the sentence, the next
step is to turn all these into one vector to process. The most common way to turn a bunch
of vectors into one vector is to add them, componentwise. That means, we add each
coordinate separately. For example, if the vectors (of length 2) are [1,2], and [3,4], their
corresponding sum is [1+3, 2+4], which equals [4, 6]. This can work, but there’s a small
caveat. Addition is commutative, meaning that if you add the same numbers in a different
order, you get the same result. In that case, the sentence “I’m not sad, I’m happy” and the
sentence “I’m not happy, I’m sad”, will result in the same vector, given that they have the
same words, except in different order. This is not good. Therefore, we must come up with
some method that will give us a different vector for the two sentences. Several methods
work, and we’ll go with one of them: positional encoding. Positional encoding consists of
adding a sequence of predefined vectors to the embedding vectors of the words. This
ensures we get a unique vector for every sentence, and sentences with the same words

4/8
in different order will be assigned different vectors. In the example below, the vectors
corresponding to the words “Write”, “a”, “story”, and “.” become the modified vectors that
carry information about their position, labeled “Write (1)”, “a (2)”, “story (3)”, and “. (4)”.

Transformer block
Let’s recap what we have so far. The words come in and get turned into tokens
(tokenization), tokenized words are turned into numbers (embeddings), then order gets
taken into account (positional encoding). This gives us a vector for every token that we
input to the model. Now, the next step is to predict the next word in this sentence. This is
done with a really really large neural network, which is trained precisely with that goal, to
predict the next word in a sentence.

We can train such a large network, but we can vastly improve it by adding a key step: the
attention component. Introduced in the seminal paper Attention is All you Need, it is one
of the key ingredients in transformer models, and one of the reasons they work so well.
Attention is explained in the previous section, but for now, imagine it as a way to add
context to each word in the text.

The attention component is added at every block of the feedforward network. Therefore, if
you imagine a large feedforward neural network whose goal is to predict the next word,
formed by several blocks of smaller neural networks, an attention component is added to
each one of these blocks. Each component of the transformer, called a transformer block,
is then formed by two main components:

The attention component.

The feedforward component.

Attention

5/8
The next step is attention. As you learned attention mechanism deals with a very
important problem: the problem of context. Sometimes, as you know, the same word can
be used with different meanings. This tends to confuse language models, since an
embedding simply sends words to vectors, without knowing which definition of the word
they’re using.

Attention is a very useful technique that helps language models understand the context.
In order to understand how attention works, consider the following two sentences:

Sentence 1: The bank of the river.

Sentence 2: Money in the bank.

As you can see, the word ‘bank’ appears in both, but with different definitions. In sentence
1, we are referring to the land at the side of the river, and in the second one to the
institution that holds money. The computer has no idea of this, so we need to somehow
inject that knowledge into it. What can help us? Well, it seems that the other words in the
sentence can come to our rescue. For the first sentence, the words ‘the’, and ‘of’ do us no
good. But the word ‘river’ is the one that is letting us know that we’re talking about the
land at the side of the river. Similarly, in sentence 2, the word ‘money’ is the one that is
helping us understand that the word ‘bank’ is now referring to the institution that holds
money.

In short, what attention does is it moves the words in a sentence (or piece of text) closer
in the word embedding. In that way, the word “bank” in the sentence “Money in the bank”
will be moved closer to the word “money”. Equivalently, in the sentence “The bank of the
river”, the word “bank” will be moved closer to the word “river”. That way, the modified
word “bank” in each of the two sentences will carry some of the information of the
neighboring words, adding context to it.

The attention step used in transformer models is actually much more powerful, and it’s
called multi-head attention. In multi-head attention, several different embeddings are used
to modify the vectors and add context to them. Multi-head attention has helped language
models reach much higher levels of efficacy when processing and generating text.

The Softmax Layer

Now that you know that a transformer is formed by many layers of transformer blocks,
each containing an attention and a feedforward layer, you can think of it as a large neural
network that predicts the next word in a sentence. The transformer outputs scores for all

6/8
the words, where the highest scores are given to the words that are most likely to be next
in the sentence.

The last step of a transformer is a softmax layer, which turns these scores into
probabilities (that add to 1), where the highest scores correspond to the highest
probabilities. Then, we can sample out of these probabilities for the next word. In the
example below, the transformer gives the highest probability of 0.5 to “Once”, and
probabilities of 0.3 and 0.2 to “Somewhere” and “There”. Once we sample, the word
“once” is selected, and that’s the output of the transformer.

Now what? Well, we repeat the step. We now input the text “Write a story. Once” into the
model, and most likely, the output will be “upon”. Repeating this step again and again, the
transformer will end up writing a story, such as “Once upon a time, there was a …”.

Post Training
Now that you know how transformers work, we still have a bit of work to do. Imagine the
following: You ask the transformer “What is the capital of Algeria?”. We would love for it to
answer “Algiers”, and move on. However, the transformer is trained on the entire internet.
The internet is a big place, and it’s not necessarily the best question/answer repository.
Many pages, for example, would have long lists of questions without answers. In this
case, the next sentence after “What is the capital of Algeria?” could be another question,
such as “What is the population of Algeria?”, or “What is the capital of Burkina Faso?”.
The transformer is not a human who thinks about their responses, it simply mimics what it
sees on the internet (or any dataset that has been provided). So how do we get the
transformer to answer questions?

The answer is post-training. In the same way that you would teach a person to do certain
tasks, you can get a transformer to perform tasks. Once a transformer is trained on the
entire internet, then it is trained again on a large dataset which corresponds to lots of

7/8
questions and their respective answers. Transformers (like humans), have a bias towards
the last things they’ve learned, so post-training has proven a very useful step to help
transformers succeed at the tasks they are asked to.

Post-training also helps with many other tasks. For example, one can post-train a
transformer with large datasets of conversations, in order to help it perform well as a
chatbot, or to help us write stories, poems, or even code.

8/8

Whitepaper - Foundational Large Language Models & Text Generation
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation
75 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
272 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Transformers
No ratings yet
Transformers
10 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
No ratings yet
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
19 pages
Unit_2_Generative_AI[1]
No ratings yet
Unit_2_Generative_AI[1]
14 pages
Transformers
No ratings yet
Transformers
12 pages
The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
No ratings yet
The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time.
5 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
2022-markowitz-Transformers, Explained_ Understand the Model Behind GPT-3, BERT, and T5
No ratings yet
2022-markowitz-Transformers, Explained_ Understand the Model Behind GPT-3, BERT, and T5
11 pages
Transformers
No ratings yet
Transformers
27 pages
Transformers - Introduction
No ratings yet
Transformers - Introduction
22 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
L.7
No ratings yet
L.7
54 pages
JioDiscover-What is the neural networ
No ratings yet
JioDiscover-What is the neural networ
5 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
8.1+The+Decoder+Deconstructed
No ratings yet
8.1+The+Decoder+Deconstructed
35 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
The Transformer Architecture Explai
No ratings yet
The Transformer Architecture Explai
2 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
GenAI Workshop
No ratings yet
GenAI Workshop
35 pages
LLM
No ratings yet
LLM
41 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Unit 4 LLM
No ratings yet
Unit 4 LLM
11 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Transformer
No ratings yet
Transformer
5 pages
14.Chapter10_AdvancedDeepLearningForText
No ratings yet
14.Chapter10_AdvancedDeepLearningForText
22 pages
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
No ratings yet
Transformers Explained Visually (Part 2) - How It Works, Step-By-step - by Ketan Doshi - Towards Data Science
30 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Understanding Transformer model architectures - Practical Artificial Intelligence
No ratings yet
Understanding Transformer model architectures - Practical Artificial Intelligence
6 pages
Transformer
No ratings yet
Transformer
21 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Transformers
No ratings yet
Transformers
2 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
GenAI_Syllabus
No ratings yet
GenAI_Syllabus
17 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Openai Chatgpt Arhitektura
No ratings yet
Openai Chatgpt Arhitektura
13 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Transformer
No ratings yet
Transformer
10 pages
LA2 Presentation
No ratings yet
LA2 Presentation
21 pages
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
No ratings yet
Transformers - Intuitively and Exhaustively Explained - by Daniel Warfield - Towards Data Science
38 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
LLMS&TRANSFORMERS
No ratings yet
LLMS&TRANSFORMERS
4 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Transformers Laid Out _ Pramod’s Blog
No ratings yet
Transformers Laid Out _ Pramod’s Blog
59 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Module 5
No ratings yet
Module 5
76 pages
transformer slides
No ratings yet
transformer slides
21 pages
From Simple IO to Monad Transformers
From Everand
From Simple IO to Monad Transformers
J Adrian Zimmer
2/5 (1)
Hexagonal Architecture Explained
From Everand
Hexagonal Architecture Explained
Alistair Cockburn
No ratings yet
Algorithm Challenges: The Dojo Collection
From Everand
Algorithm Challenges: The Dojo Collection
Martin Puryear
No ratings yet
Artificial Intelligence
No ratings yet
Artificial Intelligence
1 page
Ann-Unit Ii
No ratings yet
Ann-Unit Ii
21 pages
CS 6476 Project 5
No ratings yet
CS 6476 Project 5
10 pages
Lec 6
No ratings yet
Lec 6
39 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
15 pages
ReportCS336 M11 KHCL
No ratings yet
ReportCS336 M11 KHCL
12 pages
UNIT-4 Foundations of Deep Learning
100% (1)
UNIT-4 Foundations of Deep Learning
43 pages
Image Classification Using CNN: Page - 1
No ratings yet
Image Classification Using CNN: Page - 1
13 pages
Fa22 Bee 105 (KAMRAN AZIZ) Lab 6 Ai
No ratings yet
Fa22 Bee 105 (KAMRAN AZIZ) Lab 6 Ai
7 pages
Introduction to Deep Learning Using R: A Step-by-Step Guide to Learning and Implementing Deep Learning Models Using R Taweh Beysolow Ii - The latest ebook is available for instant download now
No ratings yet
Introduction to Deep Learning Using R: A Step-by-Step Guide to Learning and Implementing Deep Learning Models Using R Taweh Beysolow Ii - The latest ebook is available for instant download now
68 pages
Introduction To Generative Models - Pre Quiz - Attempt Review
No ratings yet
Introduction To Generative Models - Pre Quiz - Attempt Review
4 pages
DPS Ai Annual Sample Paper 2023-24
50% (2)
DPS Ai Annual Sample Paper 2023-24
5 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Artificial Intelligency
No ratings yet
Artificial Intelligency
23 pages
HomeWork - 2 2
No ratings yet
HomeWork - 2 2
2 pages
Jntuk R20 ML Unit-Iii
100% (1)
Jntuk R20 ML Unit-Iii
21 pages
Supersizing Self-Supervision Learning To Grasp From 50K Tries and 700 Robot Hours
No ratings yet
Supersizing Self-Supervision Learning To Grasp From 50K Tries and 700 Robot Hours
8 pages
21-3SSREB-12023-2024AIForEveryoneFundamentals-CompleteE-Book
No ratings yet
21-3SSREB-12023-2024AIForEveryoneFundamentals-CompleteE-Book
185 pages
2020 Deep CNN TR Le
No ratings yet
2020 Deep CNN TR Le
6 pages
(Ebook) Artificial Intelligence in Business Management by Teik Toe Teoh, Yu Jin Goh ISBN 9789819945573, 9789819945580, 9819945577, 9819945585 - Discover the ebook with all chapters in just a few seconds
100% (1)
(Ebook) Artificial Intelligence in Business Management by Teik Toe Teoh, Yu Jin Goh ISBN 9789819945573, 9789819945580, 9819945577, 9819945585 - Discover the ebook with all chapters in just a few seconds
84 pages
SB8008 Machine Learningl TPC
No ratings yet
SB8008 Machine Learningl TPC
2 pages
NNML
No ratings yet
NNML
113 pages
Applied Sciences: Amharic OCR: An End-to-End Learning
No ratings yet
Applied Sciences: Amharic OCR: An End-to-End Learning
13 pages
Rilis Sertifikasi Global Batch 3 - AIM
No ratings yet
Rilis Sertifikasi Global Batch 3 - AIM
2 pages
Image-Based Air Quality Analysis Using Deep Convolutional Neural Network
No ratings yet
Image-Based Air Quality Analysis Using Deep Convolutional Neural Network
5 pages
Book AI Driven Software Development 13 August
No ratings yet
Book AI Driven Software Development 13 August
219 pages
Japanese Named Entity Recognition
No ratings yet
Japanese Named Entity Recognition
6 pages
BITS Pilani: Machine Learning (IS ZC464)
No ratings yet
BITS Pilani: Machine Learning (IS ZC464)
26 pages
Artificial Neural Network - Quick Guide - Tutorialspoint
No ratings yet
Artificial Neural Network - Quick Guide - Tutorialspoint
61 pages
hw4 PDF
No ratings yet
hw4 PDF
6 pages