Ai
Ai
Ai
I. One Node
II. Four Nodes
III. One Hidden Layer
IV. Three Inputs
V. Seven Layers
Advanced
1. Mixture of Experts (MOEs)
2. Recurrent Neural Network (RNN)
3. Mamba
4. Matrix Multiplication
5. LLM Sampling
6. MLP in PyTorch
7. Backpropagation
8. Transformer
9. Batch Normalization
10. Generative Adversarial Network (GAN)
11. Self Attention
12. Dropout
13. Autoencoder
14. Vector Database
15. CLIP
16. Residual Network (ResNet)
17. Graph Convolution Network (GCN)
18. SORA’s Diffusion Transformer (DiT)
19. Gemini 1.5's Switch Transformer
20. Reinforcement Learning with Human Feedback (RLHF)
X1 X2
2 3
1 1
3 2
1 1
Gate 1 1 0 0 Max
Network 0 1 1 0 ≈
1 0 1 0 0 1 1 0
1 1 0 0 1 1 0 0 ReLU
0 0 1 0 1 -1 0 0 ≈
-1 0 1 0 1 0 1 0
Y1 Y2
Expert 1 Expert 2
1 -1 1
Parameters A B C -1 1
1 1 2
Hidden States H0
0
0
Output Sequence Y
1 1 -1 ɸ ɸ ɸ ɸ
2 1 1 ≈ ≈ ≈ ≈
-1 1
Output Sequence
Scan
1 0
1 -1 0 0
0 -1
0 -1 0 1
1 0 -1 0
1 0 0 -1
1 0 -1 0 1 0
0 1 0 -1 0 -1
1 -1 0 0
0 0 -1 1
-1 0 0 0 -1 0
1 0 0 0 0 1
0 0 -1 0
0 1 0 0
1 -1 0 0
-1 0
0 0 -1 1
0 1
1 0 0 0
0 -1 1 0
1 1 1 5 2
X =?
-1 1 2 4 2
LLM
Probability Distributions Random
Numbers
I .01 .01 .03
Vocab you .01 .01 .50
they .01 .01 .40
are .01 .40 .01 .34
am .01 .40 .01 .52
how .50 .05 .01 .92
why .10 .05 .01 .65
where .10 .05 .01
who .15 .01 .01
what .10 .01 .01
1 -1 1 -5 -1 0
3
nn._______(),
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
4
nn._______( ___, ___, bias = ___ ),
1 0 1 -2 3 3
5
nn._______(), 1 -1 1 0 2 ReLU 2
0 1 -1 1 1
≈ 1
nn._______( ___, ___, bias = ___ ),
6 1
1 -1 2 3 .95
7
nn._______() -1 1 1 0 .50
σ
8
)
1 -2 -2 -2
≈ .12
2 1 0 5 .99
-3 0 1 -5 .01
Hints:
X
2
1
3
1
Layer 1
1 -1 1 -5 -1 0
1 1 0 0 3 ReLU 3
0 1 1 1 5 ≈ 5
1 0 1 -2 3 3
1
Layer 2
1 -1 1 0 0 2 ReLU 2
0 1 -1 1 3 4 ≈ 4
1
Layer 3 3
Soft
0
2 0 -1 max .5
0 2 -5 3 ≈ .5 1
1 1 1 -1 0 0
YPred YTarget
L: Cross-Entropy Loss
1 0 0 0 1
Attention
1 1 0 0 0
Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0 Features
1 1 1 1 1
1 -1 0 1
1 1 0 0 ReLU
0 1 1 1 ≈
-1 1 1 0
1 1 1 1 1
Position-wise 1 0 0 -1 0
Feed-Forward 0 1 1 0 0
Network (FFN) 0 0 1 -1 1
↓ ↓ ↓ ↓ ↓
Next Block
Mini-batch: X1 X2 X3 X4
1 0 3 0
0 3 1 1 Batch Statistics
2 1 0 2
Linear Layer 1 1 1 1 Σ µ σ2 σ
1 0 1 0
ReLU
1 1 0 -1
0 2 -1 0 ≈
Normalize µ
Sum (Σ)
- Mean (µ)
Variance (σ2)
Std Dev (σ)
σ
÷
1 1 1 1
Next Layer
Generator 1 1 0
0 1 2
-1 1 0
[≈ ReLU] 1 1 1 1 Real:
Fake: F1 F2 F3 F4 X1 X2 X3 X4
-1 1 0 0 2 3 3 4
1 0 1 0 1 1 1 1
0 1 1 0 2 3 4 3
0 0 1 1 1 1 1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
Discriminator 1 0 0 -1 0
0 1 1 0 0
0 0 1 -1 1
[≈ ReLU] 1 1 1 1 1 1 1 1
1 1 -1 -1 Z
[≈ σ] [≈ σ]
Predictions: Y
Training the
Discriminator
Targets: - YD 0 0 0 0 1 1 1 1
𝜕𝐿𝐷
Loss Gradients:
𝜕𝑍
Training the
Generator
Targets: - YG 1 1 1 1
𝜕𝐿𝐺
Loss Gradients:
𝜕𝑍
1 1 0 0 0 0
Scale
0 1 0 1 0 0
0 0 1 0 1 1
□
!!
WK k1 k2 k3 k4
0 0 1 0 0 0
Softmax
0 1 0 0 0 0
1 0 0 0 0 -1
e□
÷ Σ
Attention
Weight
Matrix (A)
MatMul
WV v1 v2 v3 v4 z1 z2 z3 z4
10 0 0 0 0 0 Attention
Weighted
0 0 0 10 0 0 Features
0 10 0 0 0 0
FFN
1.16.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
12. Dropout 557
Random
Sequence
X1 X2 Inference
3 5 3 3
Training Data: Unseen Data:
.61 4 1 2 1
.39 1 1 1 1
.75 Linear 1 0 0 1 0 0
.40 1 1 0 1 1 1
.65 0 1 1 -1 1 1
.42 1 -1 0 1 -1 0
.23 [≈ ReLU] 1 1 [≈ ReLU] 1 1
.19 Dropout 0 0 0 0 0 0
.93 (p=0.5) 0 0 0 0 0 0
.42 0 0 0 0 0 0
.87 0 0 0 0 0 0
.53
Linear 1 0 0 1 0 1 0 1 1 0
.27
.69 0 1 1 0 0 1 1 1 0 0
.50 1 0 -1 -1 1 1 0 -1 0 1
[≈ ReLU] 1 1 [≈ ReLU] 1 1
.11
Dropout 0 0 0 0
.42 (p=0.33) 0 0 0 0
0 0 0 0
1 1
Linear 1 -1 0 0 Outputs 1 1 0 0
0 1 -1 -2 Y 0 1 -1 -1
-4 7 Targets
Training
- 10 5 Y’ MSE Loss
Gradients
𝜕𝐿
X2
𝜕𝑌
Encoder 1 0 0 1 0
0 1 1 0 0
-1 0 1 0 -1
[≈ ReLU] 1 1 1 1
1 0 1 0
Bottleneck
-1 1 0 0
[≈ ReLU] 1 1 1 1
1 0 0
Decoder
0 1 1
1 -1 0
[≈ ReLU] 1 1 1 1
1 0 -1 0 Outputs
1 -1 0 0 Y
0 0 1 1
0 1 1 -3
Targets
Reconstruction
Loss Y’
- MSE Loss 𝜕𝐿
Gradients 𝜕𝑌
X2
Word Embeddings a an the how why who what are is am be was you we I they she he she me him her
0 -1 0 1 0 1 0 0 -1 1 0 0 0 3 1 0 -1 0 0 0 -1 0
2 0 2 0 0 0 -1 1 0 0 0 2 1 0 2 0 2 0 0 2 0 0
-1 0 -1 1 2 0 0 1 0 1 -1 0 0 -1 0 3 0 0 -1 0 2 -1
0 1 0 0 1 0 1 0 1 0 1 -2 0 0 0 1 0 1 0 1 0 1
Text Embeddings 1 1 1 1 1 1 1 1 1 1 1 1
Encoder 1 1 0 0 0
0 1 0 1 0
1 0 1 0 -1
Linear &
ReLU 1 -1 0 0 0
Mean Pooling
Indexing Projection
1 1 0 0 Vector
Storage
0 0 1 1
big table mini chair top hat table top big chair
1 1 0 0 0
Text Encoder 1 1 1 1 1 1 0 1 0 1 1
1 0 1 0 0 0 1 -1 1
0 3 0 -2 0 1 1 1 -1
1 1 0 1
[Projection] 1 1 1 [Projection] 1 1 1
1 1 0 -2 -1 1 1 0 -1
0 1 1 -2 0 0 -1 1 0
1 0 1 1 0 0
Q
K 1 1 0 0 1 0
Attention 0 1 1 0 0 1
Add & Norm
2 0 1
1 3 -2
Feed 1 1 1
Forward 1 1 1
Add & Norm
1 -1 2
1 1 1
1 0 0 -1 3
Next Block
0 1 -1 0 0
↓ ↓ ↓
Next Block
2.15.24 AI by Hand ✍ Vol. 1 © 2024 Tom Yeh
17. Graph Convolutional Network
573
Graph Data A B C D E
A B C D E
Graph A B C D E A
Convolutional 2 0 1 0 1 Adjacency
Network B
Matrix
1 1 0 0 0
C
0 0 -1 1 1 A B C D E
D
0 3 0 1 0
E
1 1 1 1 1
1 1 0 0 0
0 1 0 -1 0
1 0 0 1 -1
[ReLU] Messages 1 1 1 1 1
0 1 1 0
1 -1 0 -2
1 0 0 0
[ReLU] Messages 1 1 1 1 1
Fully 1 0 0 -2
Connected 0 1 0 -2
Network
0 0 1 -5
1 -1 0 0
1 0 -1 0
[ReLU] 1 1 1 1 1
1 1 1 1 1 -9
Predicted Pointwise -1 1 -2
Noise FFN
- 0 1 -5
MSE Loss
1 0 1
Gradients
0 1 0
1 1 0 Generated Video
-1 1 0
[ReLU]
1 0 0 0 1 Attention
Previous Block 1 1 0 0 0
↓ ↓ ↓ ↓ ↓ Q Attention
0 1 1 0 0 Weight
K
0 0 1 1 0 Matrix (A)
0 0 0 1 1
X1 X2 X3 X4 X5 Z1 Z2 Z3 Z4 Z5
5 6 0 7 0 Attention
0 2 4 0 3 Weighted
1 0 1 1 0
Features
Switch A 1 -1 0 Gate
B 1 0 -2 Values
C -1 1 1
Expert A 1 1
Expert B 1 1
Expert C 1 1
1 0 -1 0 0 0 -1 0 1 0 0 0
0 1 0 0 1 0 0 0 0 1 1 1
0 0 -1 1 1 1 -1 1 0 1 0 0
Position-wise 1 2 3 4 5
Feed-Forward x3
Network (FFN)
↓ ↓ ↓ ↓ ↓
him 1 1 0
her 0 1 0 Word Embeddings
them 1 0 0 him her them is doc CEO [S]
is/are -1 1 1 0 1 1 1 1 0 -1
doc 2 0 -2 1 0 0 1 1 1 -1
0 1 0 1 0 1 -1
CEO 2 0 -1
1 0 1 0
0 1 0 0
1 0 -1 0
1 1 0 0
[ReLU]
3 3 3 -3 1 Reward