“Temporal Event Neural Networks: A More Efficient Alternative to the Transformer,” a Presentation from BrainChip

Chris Jones
Director Product Management
BrainChip Inc.
Temporal Event Neural
Networks: A More Efficient
Alternative to the Transformer

Brainchip AI – At a Glance
• First to commercialize neuromorphic IP
platform and reference chip.
• 15+ yrs fundamental research
• 65+ data science, hardware & software
engineers
• Publicly traded Austrialian Stock
Exchange (BRD:ASX)
• 10 Customers – Early Access, Proof of
Concept, IP License
*Fulfillment through VVDN technologies
©2024 BrainChip Inc.
PRODUCTS
IP
Reference
SoC
Software
Tools
TRUSTED
BY
PARTNERS
Edge Box*
2

• Provide path to run complex models on the Edge
• Reduce cost of training
• Reduce cost of inference
Key Focal Areas
©2024 BrainChip Inc. 3

Temporal Event Neural Networks (TENNs)

Change the Game
Unleash Unprecedented Edge Devices
ONE DIMENSIONAL
STREAMING DATA
Up to 5000X
More Energy Efficient
Up to 50X
Fewer Parameters
Same Or Better
Accuracy
10-30X
Lower Training cost vs. GPT-2
5

TENNs Application Areas
1. Multi-dimensional streaming requiring spatiotemporal integration
(3D)
• Video object detection – frames are correlated in time.
• Action recognition – classifying an action across many frames
• Video frame prediction – path prediction & planning
2. Sequence classification and generation in time:
• Raw audio classification: keyword spotting without MFCC preprocessing
• Audio denoising: generate contextual denoising
• ASR and GenAI: compressing LLMs
3. Any other sequence classification or prediction algorithms
• Healthcare: vital signs estimation
• Anything that can be transformed into a time-series/sequence prediction
problem
Spatiotemporal Integration
Kinetics400 KITT
I
Sequence classification & generation
BIDMC Vital Signs SC10 Raw Audio
Microsoft DNS Challenge
6

Improve Video Object Detection
Frame Based Camera Comparison
(vs SimCLR + ResNet50 using Kitti2D Dataset**)
Network mAP
(%)
Parameters
(millions)
MACs / sec
(Billions)
Akida TENN* +
CenterNet
57.6 0.57 18
Equivalent
precision
50x fewer
parameters
5x fewer
operations
< 20 mW
For 30 FPS in 7 nm***
Resolution
1382 x 512
Event Based Camera Comparison
(vs Gray Retinanet + Prophesee Road Object Dataset*)
Network mAP
(%)
Parameters
(millions)
MACs / sec
(Billions)
Akida TENN* +
CenterNet
56 0.57 94
30% better
precision
50x fewer
parameters
30x fewer
operations
Resolution
1280 x 720
* Gray Retinanet is the latest state of art in event-camera
object detection
** SimCLR with a RESNET50 backbone is the benchmark in
object detection -- Source: SiMCLR Review
*** Estimates for Akida neural processing scaled from 28 nm
7

TENN Can Be Extended to Spatio-Temporal Data
DVS Hand Gesture Recognition: IBM DVS128 Dataset
State of the Art
Network Accuracy
(%)
Parameters MACs (billion) /
sec
Latency*
(ms)
TrueNorth-CNN 96.5 18 M - 155
Loihi-Slayer 93.6 - - 1450
ANN-Rollouts 97.0 500 k 10.4 1500
TA-SNN 98.6 - - 1500
Akida-CNN 95.2 138 k 0.12 200
TENN-Fast 97.6 192 k 0.429 105
TENN 100.0 192 k 0.499 510
8

Enhance Raw Audio and Speech Processing

Task: Audio Denoising
Comparison of TENN Versus SoTA
Model Deep Filter
Net V1
TENN Deep Filter
Net V2
Deep Filter
Net V3
PESQ 2.49 2.61 2.67 2.68
Params
(relative
to TENN)
2.98 1 3.86 3.56
MACs
(relative
to TENN)
11.7 1 12.1 11.5
BRAINCHIP | TENN
STFT iSTFT
Conv1D/LSTM/
GRU
Traditional Denoising Model Approach
TENNs
TENNs Model Approach
Potentially consume 50%+ of
total power
STFT/iSTFT overhead and BOM not
needed with TENNs
• Audio denoising isolates a voice signal obscured by background noise
• Traditional approach employs computationally intensive time domain to
frequency domain transform and the inverse transform
• TENNs approach avoids expensive data transformations

TENN vs GPT2
Single thread CPU performance, 11th Gen Intel i7 - 3.00 GHz
Both models were prompted with the first 1024 words of the Harry Potter 1st novel
> 2100 tokens/minute < 10 tokens/minute

Task: Sentence Generation
Model GPT2
Small
GPT2
Medium
TENN Mamba
130M
GPT2 large GPT2 full Mamba
370M
Train_size 13 GB 13GB 0.1 GB 836GB 13GB 13GB 836GB
Score 9.7 10.2 10.3 10.4 10.4 10.8 10.9
Params
(relative to TENN)
1.35 4.8 1 2.06 10.4 21.7 5.9
Energy
(relative to TENN)
1700 5700 1 2.06 13000 27000 5.9
Training Time
(relative to TENN)
~768 GPU
hours
21x
~2264 GPU
hours
62.8x
35 GPU hours
1. TENN trained on WikiText-103. 100M tokens
2. GPT models trained on open_web_text, Mamba trained on the Pile
3. TENN training time: ~1.5 days on (1) A100 (35 GPU hours)
4. GPT-2 Small training time: 4 days on (8) A100 (768 hours)
5. GPT-2 Medium estimated training time
6. Scores reported as negative entropy:−𝑙𝑜𝑔2 1/𝑉𝑜𝑐𝑎𝑏𝑆𝑖𝑧𝑒 − 𝑙𝑜𝑔2 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 (higher better)
7. Input (context) was 1024 tokens ©2024 BrainChip Inc.

Technical Details

• Colored plane represents the continuous
kernel we’re trying to learn
• Red arrows represent the individual weights
in a 7x7 filter
• A large number of weights requires a large
amount of computation
• Results in slow training and large memory
bottlenecks
Learning Continuous Convolution Kernels

Representing Convolution Kernels with Orthogonal
Polynomials
Chebyshev polynomial basis can lead to exponential
convergence for a wide range of functions, including
those with singularities or discontinuities.*
*Lloyd N. Trefethen. 2019. Approximation Theory and Approximation Practice, Extended Edition. SIAM-
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
• TENNs learns the continuous kernel directly
through polynomial expansion.
• Learn coefficients for polynomials through
backpropagation.
• Training is much faster because the polynomial
coefficients (weights) converge independently and
do not affect each other due to polynomials being
orthogonal to each other.
Chebyshev polynomial
15

Visualizing the Computation
22 23 24 25
Polynomials
Coefficients
𝐶1−12 ∙
𝑎𝑙
Input Buffer 𝐼(𝑡)
h(t − τ) =෍
𝑙=0
𝐿
𝑎𝑙 𝐶𝑙 𝑡 − τ
Kernel
𝜒 𝑡 = h ∗ 𝐼 𝑡 = න
𝑡−𝐷
𝑡
h t − τ 𝐼 𝜏 𝑑𝜏 ≈ ෍
𝑘=22
25
h t − 𝑘 𝐼 𝑘
Time (𝑡)
Convolution
Convolution:
ℎ
[0.011, 0.871, 0.235, 0.678, 0.547, 0.298, 0.045, 0.945, 0.478, 0.284, 0.765, 0.199]
h ∙ = 𝑎1 𝐶1 ∙ + 𝑎2 𝐶2 ∙ + 𝑎3 𝐶3 ∙ + 𝑎4 𝐶4 ∙ + 𝑎5 𝐶5 ∙ + 𝑎6 𝐶6 ∙ + 𝑎𝑛 𝐶𝑛 ∙
𝜒 𝑡 = 25 = σ𝑘=22
25
h 25 − 𝑘 𝐼 𝑘 = ℎ(3) 𝐼(22) + ℎ(2) 𝐼(23) + ℎ(1) 𝐼(24) + ℎ(0) 𝐼(25)
𝜒
Nonlinear Output: 𝑜 𝑡 = 𝑓 𝜒 𝑡 𝑓 ∙ : nonlinear activation function:
16

Buffer Mode vs Recurrent Mode
Recurrence: Chebyshev polynomials have a recurrence relationship.
Duality: This particular recurrence imputes duality to buffer mode as well as
recurrent mode.
Buffer (Convolutional) Mode
Overview
Buffering inputs over time
Benefit
Speed up training by reading the
memory buffer in parallel
Training stability improved by
orthogonality
Drawbacks
Higher memory usage
Recurrent Mode
Overview
Update previous state over time
Benefit
Save memory by generating polynomials
recurrently, timestep-by-timestep
Lower memory usage benefits inference
Drawback
Training has to be done sequentially
17

Getting It to Market

Key Hardware Features
• Digital, event-based, at memory compute
• Highly scalable
• Each node connected by mesh network
• Inside each node is an event-based TENN
processing unit
Hardware IP to Run TENNs on the Edge
19

Fundamentally different. Extremely efficient.
Brainchip’s Differentiation: Akida Technology Foundations

BrainChip Resources
TENNs Paper “Building Temporal Kernels with Orthogonal Polynomials
https://bit.ly/brainchip_tenns
TENNs White Paper
https://brainchip.com/temporal-event-based-neural-networks-a-new-approach-to-temporal-processing/
Akida 2nd Generation
https://brainchip.com/wp-content/uploads/2023/03/BrainChip_second_generation_Platform_Brief.pdf
BrainChip Enablement Platforms
https://brainchip.com/akida-enablement-platforms/
Visit Us @ Booth #618
21

Backup Slides
22

Improve Efficiency Without Compromising Accuracy
Simplifies solution to complex problems
Reduces model size and footprint without loss in
accuracy
Easy to train (CNN-like pipeline)
Supports longer range dependencies than RNNs
Temporal Event Based Neural Nets (TENNs)
23

Principles:
1. Recurrence: Chebyshev and Legendre polynomials
have recurrence relationship.
2. Duality: Recurrence imputes duality: Buffer mode
as well as recurrent mode.
3. Stable training: Train in buffer mode
4. Fast Running: Run in recurrent mode. Small foot-
print
5. Insight: TENNs and SSM are a stack of generalized
Fourier filters running in a recurrent mode, with
non-linearities between layers.
TENN Has Two Modes: Buffer and Recurrent Modes
Recurrent Mode
24

TENN Has Two Modes: Buffer and Recurrent Modes
h 𝑡 = σ𝑙=0
𝐿
𝑎𝑙 𝐶𝑙 𝑡
kernel
convolution
Buffer mode:
buffer for h(t) & buffer for I(t)
convolution: dot product over 2 buffers
Recurrent mode:
h 𝑡 = σ𝑙=0
𝐿
𝑎𝑙 𝐶𝑙 𝑡
kernel
L convolutions
over polynomials
𝜒𝑙 = 𝐶𝑙 ∗ 𝐼(𝑡)
kernel convolution 𝜒 = σ𝑙=0
𝐿
𝑎𝑙 𝜒𝑙
𝜒 = h ∗ 𝐼(𝑡)
𝜒 = ෩
𝒉 ∙ 𝑰 = σ𝑘
෩
𝒉𝑘𝐼𝑘
Entire kernel is stored in a memory buffer accessible at
once
Convolution is computed in conventional way
Polynomials generated recurrently, timestep by timestep &
not stored in memory
Convolution of input over L polynomials computed timestep
by timestep, accumulated over time; L separate convolutions
Kernel convolution is L polynomial convolutions weighted
by the polynomial coefficients & summed
Buffer mode for fast parallel training:
Recurrent mode saves memory :
25

“Temporal Event Neural Networks: A More Efficient Alternative to the Transformer,” a Presentation from BrainChip

Recommended

Recommended

More Related Content

Similar to “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer,” a Presentation from BrainChip

Similar to “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer,” a Presentation from BrainChip (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

“Temporal Event Neural Networks: A More Efficient Alternative to the Transformer,” a Presentation from BrainChip