Ai

Basic
I. One Node
II. Four Nodes
III. One Hidden Layer
IV. Three Inputs
V. Seven Layers
Advanced
1. Mixture of Experts (MOEs)
2. Recurrent Neural Network (RNN)
3. Mamba
4. Matrix Multiplication
5. LLM Sampling
6. MLP in PyTorch
7. Backpropagation
8. Transformer
9. Batch Normalization
10. Generative Adversarial Network (GAN)
11. Self Attention
12. Dropout
13. Autoencoder
14. Vector Database
15. CLIP
16. Residual Network (ResNet)
17. Graph Convolution Network (GCN)
18. SORA’s Diffusion Transformer (DiT)
19. Gemini 1.5's Switch Transformer
20. Reinforcement Learning with Human Feedback (RLHF)
© 2024 Tom Yeh

Link to my original LinkedIn post
with animation and explanation
Date originally posted

I. One Node 406
12.5.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

II. Four Nodes 148

III. Hidden Layer 82

IV. Three Inputs 105

V. Seven Layers 197

1. Mixture of Experts 683
X1 X2
2 3
1 1
3 2
1 1
Gate 1 1 0 0 Max
Network 0 1 1 0 ≈
1 0 1 0 0 1 1 0
1 1 0 0 1 1 0 0 ReLU
0 0 1 0 1 -1 0 0 ≈
-1 0 1 0 1 0 1 0
Y1 Y2
Expert 1 Expert 2

2. Recurrent Neural Network (RNN)
406
Input Sequence X 3 4 5 6
1 -1 1
Parameters A B C -1 1
1 1 2
Activation Function ɸ: ReLU
Hidden States H0
0
0
Output Sequence Y
1 1 -1 ɸ ɸ ɸ ɸ
2 1 1 ≈ ≈ ≈ ≈
-1 1

3. Mamba’s S6 Model 263
Input Sequence 3 4 5 6 Parameters
Output Sequence
Selective Structured State-Space
Scan
1 0
1 -1 0 0
0 -1
0 -1 0 1
1 0 -1 0
1 0 0 -1
1 0 -1 0 1 0
0 1 0 -1 0 -1
1 -1 0 0
0 0 -1 1
-1 0 0 0 -1 0
1 0 0 0 0 1
0 0 -1 0
0 1 0 0
1 -1 0 0
-1 0
0 0 -1 1
0 1
1 0 0 0
0 -1 1 0

4. Matrix Multiplication 127
1 1 1 5 2
X =?
-1 1 2 4 2

5. How does an LLM sample a sentence?
Input Embeddings 1123
LLM
Probability Distributions Random
Numbers
I .01 .01 .03
Vocab you .01 .01 .50
they .01 .01 .40
are .01 .40 .01 .34
am .01 .40 .01 .52
how .50 .05 .01 .92
why .10 .05 .01 .65
where .10 .05 .01
who .15 .01 .01
what .10 .01 .01
______ ______ ______

6. Multi Layer Perceptron in pytorch
2
337
1 mlp_model = nn.Sequential(
1
3
2 nn._______( ___, ___, bias = ___ ),
1
1 -1 1 -5 -1 0
3
nn._______(),
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
4
nn._______( ___, ___, bias = ___ ),
1 0 1 -2 3 3
5
nn._______(), 1 -1 1 0 2 ReLU 2
0 1 -1 1 1
≈ 1
nn._______( ___, ___, bias = ___ ),
6 1
1 -1 2 3 .95
7
nn._______() -1 1 1 0 .50
σ
8
)
1 -2 -2 -2
≈ .12
2 1 0 5 .99
-3 0 1 -5 .01
Hints:
Linear Layer: { Identity | Linear | Bilinear }

Activation Function: { ReLU | Tanh | Sigmoid }
in_features: { int }
out_features: { int }
bias: { T | F }

7. Backpropagation 1260
X
2
1
3
1
Layer 1
1 -1 1 -5 -1 0
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
1 0 1 -2 3 3
1
Layer 2
1 -1 1 0 0 2 ReLU 2
0 1 -1 1 3 4 ≈ 4
1
Layer 3 3
Soft
0
2 0 -1 max .5
0 2 -5 3 ≈ .5 1
1 1 1 -1 0 0
YPred YTarget
L: Cross-Entropy Loss

8. Transformer 1281
Features from the

Previous Block
↓ ↓ ↓ ↓ ↓
1 0 0 0 1
Attention
1 1 0 0 0
Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0 Features
1 1 1 1 1
1 -1 0 1
1 1 0 0 ReLU
0 1 1 1 ≈
-1 1 1 0
1 1 1 1 1
Position-wise 1 0 0 -1 0
Feed-Forward 0 1 1 0 0
Network (FFN) 0 0 1 -1 1
↓ ↓ ↓ ↓ ↓
Next Block

9. Batch Normalization 504
Mini-batch: X1 X2 X3 X4
1 0 3 0
0 3 1 1 Batch Statistics
2 1 0 2
Linear Layer 1 1 1 1 Σ µ σ2 σ
1 0 1 0
ReLU
1 1 0 -1
0 2 -1 0 ≈
Normalize µ
Sum (Σ)
- Mean (µ)
Variance (σ2)
Std Dev (σ)
σ
÷
1 1 1 1
Scale & Shift 2 0 0 0 Trainable

0 3 0 0 Parameters
0 0 -1 1
Next Layer

10. Generative Adversarial Network (GAN)
Noise: N1 N2 N3 N4
1004
1 1 0 1
1 0 1 -1
1 1 1 1
Generator 1 1 0
0 1 2
-1 1 0
[≈ ReLU] 1 1 1 1 Real:
Fake: F1 F2 F3 F4 X1 X2 X3 X4
-1 1 0 0 2 3 3 4
1 0 1 0 1 1 1 1
0 1 1 0 2 3 4 3
0 0 1 1 1 1 1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
Discriminator 1 0 0 -1 0
0 1 1 0 0
0 0 1 -1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
1 1 -1 -1 Z
[≈ σ] [≈ σ]
Predictions: Y
Training the
Discriminator
Targets: - YD 0 0 0 0 1 1 1 1
𝜕𝐿𝐷
Loss Gradients:
𝜕𝑍
Training the
Generator
Targets: - YG 1 1 1 1
𝜕𝐿𝐺
Loss Gradients:
𝜕𝑍

11. Self Attention 1376
q1 q2 q3 q4
x1 x2 x3 x4
2 0 0 2 MatMul
Features
0 1 0 0 (KTQ)
0 2 1 0
0 0 1 1 k1 T
2 0 0 0 k2 T
1 0 1 1 k3 T
WQ q1 q2 q3 q4
k4 T
1 1 0 0 0 0
Scale
0 1 0 1 0 0
0 0 1 0 1 1
□
!!
WK k1 k2 k3 k4
0 0 1 0 0 0
Softmax
0 1 0 0 0 0
1 0 0 0 0 -1
e□
÷ Σ
Attention
Weight
Matrix (A)
MatMul
WV v1 v2 v3 v4 z1 z2 z3 z4
10 0 0 0 0 0 Attention
Weighted
0 0 0 10 0 0 Features
0 10 0 0 0 0
FFN
12. Dropout 557
Random
Sequence
X1 X2 Inference
3 5 3 3
Training Data: Unseen Data:
.61 4 1 2 1
.39 1 1 1 1
.75 Linear 1 0 0 1 0 0
.40 1 1 0 1 1 1
.65 0 1 1 -1 1 1
.42 1 -1 0 1 -1 0
.23 [≈ ReLU] 1 1 [≈ ReLU] 1 1
.19 Dropout 0 0 0 0 0 0
.93 (p=0.5) 0 0 0 0 0 0
.42 0 0 0 0 0 0
.87 0 0 0 0 0 0
.53
Linear 1 0 0 1 0 1 0 1 1 0
.27
.69 0 1 1 0 0 1 1 1 0 0
.50 1 0 -1 -1 1 1 0 -1 0 1
[≈ ReLU] 1 1 [≈ ReLU] 1 1
.11
Dropout 0 0 0 0
.42 (p=0.33) 0 0 0 0
0 0 0 0
1 1
Linear 1 -1 0 0 Outputs 1 1 0 0
0 1 -1 -2 Y 0 1 -1 -1
-4 7 Targets
Training
- 10 5 Y’ MSE Loss
Gradients
𝜕𝐿
X2
𝜕𝑌

13. Autoencoder X1 X2 X3 X4
1 1 2 1
849
2 1 2 0
3 2 4 1
1 1 2 1
1 1 1 1
Encoder 1 0 0 1 0
0 1 1 0 0
-1 0 1 0 -1
[≈ ReLU] 1 1 1 1
1 0 1 0
Bottleneck
-1 1 0 0
[≈ ReLU] 1 1 1 1
1 0 0
Decoder
0 1 1
1 -1 0
[≈ ReLU] 1 1 1 1
1 0 -1 0 Outputs
1 -1 0 0 Y
0 0 1 1
0 1 1 -3
Targets
Reconstruction
Loss Y’
- MSE Loss 𝜕𝐿
Gradients 𝜕𝑌
X2

14. Vector Database 2224
Query
Data how are you who are you who am I am I you
Word Embeddings a an the how why who what are is am be was you we I they she he she me him her
0 -1 0 1 0 1 0 0 -1 1 0 0 0 3 1 0 -1 0 0 0 -1 0
2 0 2 0 0 0 -1 1 0 0 0 2 1 0 2 0 2 0 0 2 0 0
-1 0 -1 1 2 0 0 1 0 1 -1 0 0 -1 0 3 0 0 -1 0 2 -1
0 1 0 0 1 0 1 0 1 0 1 -2 0 0 0 1 0 1 0 1 0 1
Text Embeddings 1 1 1 1 1 1 1 1 1 1 1 1
Encoder 1 1 0 0 0
0 1 0 1 0
1 0 1 0 -1
Linear &
ReLU 1 -1 0 0 0
Mean Pooling
Indexing Projection
1 1 0 0 Vector
Storage
0 0 1 1
Retrieval Dot Products
Nearest Neighbor (argmax)

mini batch of text-image pairs
15. CLIP 400 millions more …
885
big table mini chair top hat table top big chair
word2vec mini big top hat chair table Flatten

0 1 1 0 1 0 Patches
1 0 0 0 1 1
0 1 0 1 0 1
Word
Image Encoder
Embeddings 1 1 1 1 1 1
1 1 0 0 0
Text Encoder 1 1 1 1 1 1 0 1 0 1 1
1 0 1 0 0 0 1 -1 1
0 3 0 -2 0 1 1 1 -1
1 1 0 1
[Mean Pooling] [Mean Pooling]

(round) (round)
[Projection] 1 1 1 [Projection] 1 1 1
1 1 0 -2 -1 1 1 0 -1
0 1 1 -2 0 0 -1 1 0
T1 T2 T3 Shared Embedding Space

Cross Entropy
[Softmax] Loss Gradients
÷ Similarity -
e□ Σ ImageàText Target ImageàText
I1 1 0 0
I2 0 1 0
I3 0 0 1
÷ Σ TextàImage
1 0 0
Similarity
TextàImage - 0 1 0
0 0 1
16. Residual Network 625
2 1 0
0 3 4
X
1 0 2
1 1 1
weight
layer 1 1 0 0
ReLU 0 1 -1 0
1 0 1 -1
weight 1 1 1
layer
1 0 0 0 0 1 0
+ 0 1 0 -1 0 0 0
ReLU 0 0 1 0 -1 0 0
1 1 1
Transformer’s Encoder Block

1 0 0
0 1 0
Input
Embedding 0 0 1
1 0 1 1 0 0
Q
K 1 1 0 0 1 0
Attention 0 1 1 0 0 1
Add & Norm
2 0 1
1 3 -2
Feed 1 1 1
Forward 1 1 1
Add & Norm
1 -1 2
1 1 1
1 0 0 -1 3
Next Block
0 1 -1 0 0
↓ ↓ ↓
Next Block
17. Graph Convolutional Network
573
Graph Data A B C D E
A B C D E
Graph A B C D E A
Convolutional 2 0 1 0 1 Adjacency
Network B
Matrix
1 1 0 0 0
C
0 0 -1 1 1 A B C D E
D
0 3 0 1 0
E
1 1 1 1 1
1 1 0 0 0
0 1 0 -1 0
1 0 0 1 -1
[ReLU] Messages 1 1 1 1 1
0 1 1 0
1 -1 0 -2
1 0 0 0
[ReLU] Messages 1 1 1 1 1
Fully 1 0 0 -2
Connected 0 1 0 -2
Network
0 0 1 -5
1 -1 0 0
1 0 -1 0
[ReLU] 1 1 1 1 1
1 1 1 1 1 -9

18. SORA’s Diffusion Transformer 2118
Training Video Diffusion Prompt
Step t = 3
1 0 2 0 0 1 0 1 “sora is sky”
0 1 1 0 3 0 4 0
Text
Encoder
Spacetime
Patches 0
1
(Pixels) 1
1 0 0 1 0 0 -1
1 1 1 1 0 1 0 -1 1 0
Visual Encoder Latent
1 1 0 0 1 0 Self-Attention
1 0 -1 0 0
0 2 0 1 0 2 1 0 0 0
0 1 0 1 1
[ReLU] Adaptive 1 1 0 0
Sampled 0 2 1 -1 Q 0 1 1 0
Layer Norm
Noise K
+ -1 0 -2 1 0 0 1 1
X +
Noised
Latent
1 1 1 1
Predicted Pointwise -1 1 -2
Noise FFN
- 0 1 -5
Noise-free Train Sampled 0 2 1 -1

Noise
Latent
- -1 0 -2 1
Visual Decoder 1 1 1 1
MSE Loss
1 0 1
Gradients
0 1 0
1 1 0 Generated Video
-1 1 0
[ReLU]

19. Switch Transformer 576
(Gemini 1.5’s Sparse Mixture of Experts)
1 0 0 0 1 Attention
Previous Block 1 1 0 0 0
↓ ↓ ↓ ↓ ↓ Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0
Features
Switch A 1 -1 0 Gate
B 1 0 -2 Values
C -1 1 1
argmax Expert IDs
Expert A 1 1
Expert B 1 1
Expert C 1 1
1 0 -1 0 0 0 -1 0 1 0 0 0
0 1 0 0 1 0 0 0 0 1 1 1
0 0 -1 1 1 1 -1 1 0 1 0 0
Position-wise 1 2 3 4 5
Feed-Forward x3
Network (FFN)
↓ ↓ ↓ ↓ ↓
AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Next Block

2.24.24
20. Reinforcement
LLM 5 1 0
prompt
0 4 1 Learning with Human
[S] CEO is 0 0 4 Feedback (RLHF)
550
Preferences
{Winner | Loser} {Winner | Loser}
1 0 -1 2 prompt next prompt next
[ReLU] -1 1 0 0 doc is him doc is them
him 1 1 0
her 0 1 0 Word Embeddings
them 1 0 0 him her them is doc CEO [S]
is/are -1 1 1 0 1 1 1 1 0 -1
doc 2 0 -2 1 0 0 1 1 1 -1
0 1 0 1 0 1 -1
CEO 2 0 -1
Sample (max) Mean Mean

Pool Pool
1/3 1/3 1/3
Reward
1/3 1/3 1/3
Model (RM)
1/3 1/3 1/3
1 0 1 0
0 1 0 0
1 0 -1 0
1 1 0 0
[ReLU]
3 3 3 -3 1 Reward
Align LLM Loss Train RM Winner - Loser
Loss Gradient Predicted σ

- Target -1
3.4.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Loss Gradient

Ai

Uploaded by

Copyright:

Available Formats

Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai

Uploaded by

Copyright:

Available Formats

Basic

© 2024 Tom Yeh

Date originally posted

12.5.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

12.6.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

12.7.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

12.13.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

12.13.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

12.15.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Activation Function ɸ: ReLU

12.18.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Input Sequence 3 4 5 6 Parameters

Selective Structured State-Space

12.19.23 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

1.5.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

______ ______ ______

1.6.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Linear Layer: { Identity | Linear | Bilinear }

1.8.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

1.9.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Features from the

1.11.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Scale & Shift 2 0 0 0 Trainable

1.14.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

1.15.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

1.19.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

1.22.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Retrieval Dot Products

Nearest Neighbor (argmax)

2.1.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

word2vec mini big top hat chair table Flatten

[Mean Pooling] [Mean Pooling]

T1 T2 T3 Shared Embedding Space

Transformer’s Encoder Block

2.17.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

Noise-free Train Sampled 0 2 1 -1

2.19.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh

argmax Expert IDs

AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Next Block

Sample (max) Mean Mean

Align LLM Loss Train RM Winner - Loser

Loss Gradient Predicted σ

3.4.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh Loss Gradient

You might also like

__