HC2023 Qualcomm Hexagon NPU

Qualcomm® Hexagon™ NPU
Eric Mahurin
Senior Director, Technology
Qualcomm Technologies, Inc
Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
Hexagon NPU
High Performance, Power Efficient ML Inference Processor for Qualcomm® SoCs
Hexagon
Hexagon NPU
+ Vector eXtensions
2
Hardware
Hexagon NPU
• Processor executing 3 instruction sets: Multi-thread

• Scalar: For control flow and general purpose Scalar
• Vector: General purpose data-parallel compute Int:8,16,32,64
Float:32,64 Tightly Coupled
• Tensor: Matrix multiply and convolutional layer
Memory (TCM)
• Over multiple threads using shared memories (core D$ I$ Tensor
Int: 4, 8, 16
local & cached DDR) Float: 16
DMA/BUS
• DSP features:
• VLIW, hardware looping Vector
L2 Int: 8, 16, 32
• Targets DSP and compute-heavy workloads Float: 16, 32
• CPU-like features:
• Virtual → Physical translation, security, caching • Maximized efficient single-core performance
• Branching (call/return/indirect), exceptions, interrupts
• Make the most of resources
• Conventional software tools (including LLVM)
4
Vector
• Vector SIMD - like other SIMD extensions, but wider

• 1 Kb = 128B = 64H = 32W wide Memory
• 8/16/32 bit fixed-point, 16/32 bit floating-point
• Compute and registers

• 4 compute, 1 load, and 1 store VLIW resources
• 32 vector registers and 4 vector predicate registers X-lane
• Memory access
• Load/store with L2/DDR or TCM Shift Store
• Fully parallel scatter & gather with TCM to address arbitrary data- Load Scat
parallel workloads Multiply Gath
• Target applications
• Originally for image processing Multiply
• Adapted to additional workloads including DNNs
32 Vreg, 4 Vpred
5
Tensor
• Tensor SIMD
• Tensor instead of vector as data-parallel quantum
• 2D matrices, 3D (X, Y, depth), and 4D (multiple 3D)
Weights
• Bit widths (activations * weights): Memory
Matrix
• Integer: (8/16) * (4/8/16)
• Float: FP16 * FP16
• ISA accelerates:
• Matrix multiply
• Convolutional layer
Activations Accumulation
• Depth-wise and other small group sized convolutions
• Fused activation functions
Matrix Matrix
• Per output-channel scaling
6
Programming Model
Architecture – Threads
• Each thread has its own program

• VLIW for predictable Instruction Level Parallelism
• Scalar ISA for control flow and serial compute
• SIMD extension acquired/released

• While acquired, a program has extension
capabilities
• Scalar, vector, and tensor each have Scalar Vector Tensor
VLIW
dedicated registers I-Cache
• Instructions operate on thread-local registers and
[potentially shared] memory RegFile RegFile RegFile
8
Architecture – Memory model
• Coherent memory between threads

• Normal synchronization between threads
DDR/SoC/DMA
• Only scalar instructions uses L1 D-Cache
• Prevents pollution by vectors & tensors L2 Cache
TCM
• Larger L2 acts as an L1 for vectors D-Cache
• In addition to backing I-Cache & D-Cache
• Software prefetch hides DDR latency
• TCM for vectors & tensors: Scalar Vector Tensor
VLIW
I-Cache
• Acts as a software-managed cache
• More scalable than a hardware cache
RegFile RegFile RegFile
• Much higher bandwidth than a typical cache
• Enables very high-bandwidth scatter/gather
• Predictable performance – no misses
• Virtually addressed DMA for hiding DDR latency
9
Efficiency
Tensor Data Locality – Temporal and Spatial
• Data locality is key to tensor compute efficiency

• N:1 compute:memory matrix multiply
• 2N2 data read and transferred N2 data
• For 2N3 compute
• Single biggest reason for a dedicated tensor or matrix engine
• Output stationary:
• Accumulators are wider bit-width than input activations & weights
• Accumulate across all input channels and filter taps
• Convolution activations reuse:

• Input activations with halo read once for each hardware output tensor N2 data N3 MACs
• Convolution as a sequence of unaligned matrix multiplies
• Hardware SIMD tensor contiguous in memory

• Maximized memory bandwidth, just like a SIMD vector is contiguous
• Responsibility of software to organize data in these chunks
11
Floating-Point vs. Fixed-Point
• With tensor hardware, raw computational area & energy

dominates, making compute efficiency paramount
• With Normal data, precision yields accuracy, not
exponent
• Beyond 2-bit exponent only harms accuracy with optimal SNR scaling
• Simple int/fixed-point yields lowest cost for any accuracy
• Floating point vs. fixed point (FXP) accumulation increases

overhead
• Fixed-point or Kulisch accumulation best for small exponents widths: exact
and lower cost
• DNNs typically have Normal data, but not always

• Floating point vs. fixed point studied in (van Baalen et al., 2023)
• CNNs relatively Normal - INT8 outperforming FP8 with Post Training
Quantization (PTQ) and Quantization Aware Training (QAT) Model FP32 INT8 PTQ FP8E4 PTQ A8W4 QAT INT8 QAT FP8E4 QAT
• Transformers have outliers due to softmax (Bondarenko et al, 2023), but ResNet18 69.72% -0.08% -1.15% 0.29% 0.71% -0.37%
QAT naturally pulls in the outliers MobileNetV2 71.70% -0.76% -5.65% -0.53% 0.12% -0.81%
HRNet 81.05% -0.12% -0.28% 0.22% 0.01%
• Large fixed-point capacity + floating-point DeeplabV3 72.91% -1.67% -34.98% 0.10% 1.08% 0.31%
SalsaNext 55.80% -1.58% -0.68% -0.80% -0.60%
• Fixed-point: A16W16, A16W8, A8W8, A8W4
BERT(GLUE) 83.06% -12.03% -0.26% -0.42% 0.20% 0.85%
• Floating-point: FP16*FP16 12
Performance & Energy vs. Bits
• Lower bit widths affect accuracy, but

improve many other dimensions:
• Memory footprint/bandwidth/energy – TCM & DDR,
activations & weights
• Compute bandwidth/energy
• Can scale quadratically vs. bit width:

• For matrix/convolution dominated workloads
• Linear scaling with bit width of each operand
• Smaller widths fit better into memory (more locality)
• Multiple PTQ & QAT techniques are

used to maximize accuracy with
reduced bits
13
Pruning vs. Quantization P: Pruning Q: Quantization
• At the cost of accuracy, pruning weights:

• Enables a smaller [compressed] model
More accuracy
• Reduces compute energy with more zeros
• Allows for skipping compute for each zero
• But, costly (area/energy) for dense case in tensor architecture
• Quantization also costs accuracy with similar

[potential] benefits
• Zero specific lower bits rather than random elements
• Comparison from (Kuzmin et al., 2023) shown:

• Quantization consistently better for given
compressed/packed model bits
• Focus on deep quantization rather than deep

pruning
• But pruned/compressed models are supported
14
Application
Target Industries
• Single architecture across a wide range of platforms

• Specific implementations with different configurations are used across industries
• Maximized energy-efficient single-core performance and multi-core when needed
• Typically, multiple concurrent uses on any given platform, including with single-core
• Programmability is key to adapting to new demands
IoT Mobile AR PC Auto Cloud
16
Performance
Scalar
+Vector
Scalar
+Vector
+Tensor
Leading edge performance with 50% to >100% year-over-year gains

17
References
Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar
, , ,J , j “
”, https://arxiv.org/abs/2303.17951
Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort. “ :
O H H ”, https://arxiv.org/abs/2306.12929
Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort. (2023).
“ : ?”, https://arxiv.org/abs/2307.02973
18
Thank you
Nothing in these materials is an offer to sell any of the components References in this presentation to “Qualcomm” may mean Qualcomm Incorporated,
or devices
Nothing referenced
in these herein.
materials is an offer to sell any of the components Qualcomm Technologies,
References Inc., and/or to
in this presentation other subsidiaries
“Qualcomm” mayormean
business units within
Qualcomm Incorporated,
or devices referenced herein. the Qualcomm
Qualcomm corporate structure,
Technologies, Inc.,asand/or
applicable. Qualcomm Incorporated
other subsidiaries includes
or business units within
©2018-2023 Qualcomm Technologies, Inc. and/or its affiliated
our licensing business,
the Qualcomm QTL, and
corporate the vast as
structure, majority of ourQualcomm
applicable. patent portfolio. Qualcomm
Incorporated
Follow us on: companies.
©2018-2023All Rights Reserved.
Qualcomm Technologies, Inc. and/or its affiliated
Technologies,
includes Inc., a subsidiary
our licensing of Qualcomm
business, QTL, andIncorporated, operates,
the vast majority of ouralong
patentwith its
portfolio.
companies. All Rights Reserved.
Qualcomm is a trademark or registered trademark of Qualcomm subsidiaries,
Qualcommsubstantially all of our
Technologies, Inc.,engineering,
a subsidiaryresearch and development
of Qualcomm Incorporated, operates,
For more information, visit us at: Incorporated.
Qualcomm andOther productsare
Snapdragon andtrademarks
brand names may be trademarks
or registered functions,
alongand substantially
with all of substantially
its subsidiaries, our products all
and ofservices businesses,
our engineering, including
research and
or registeredoftrademarks
trademarks QualcommofIncorporated.
their respective
Otherowners.
products and brand our QCT semiconductor
development business.
functions, and substantially all of our products and services businesses,
qualcomm.com & qualcomm.com/blog names may be trademarks or registered trademarks of their including our QCT semiconductor business. Snapdragon and Qualcomm branded
Snapdragon and Qualcomm branded products are products of Qualcomm
respective owners. products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are
Qualcomm patented technologies are licensed by Qualcomm Incorporated.
licensed by Qualcomm Incorporated.

HC2023 Qualcomm Hexagon NPU

Uploaded by

Copyright:

Available Formats

HC2023 Qualcomm Hexagon NPU

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HC2023 Qualcomm Hexagon NPU

Uploaded by

Copyright:

Available Formats

Qualcomm® Hexagon™ NPU

• Processor executing 3 instruction sets: Multi-thread

• Vector SIMD - like other SIMD extensions, but wider

• Compute and registers

• Each thread has its own program

• SIMD extension acquired/released

• Scalar, vector, and tensor each have Scalar Vector Tensor

• Coherent memory between threads

• TCM for vectors & tensors: Scalar Vector Tensor

• Data locality is key to tensor compute efficiency

• Convolution activations reuse:

• Hardware SIMD tensor contiguous in memory

• With tensor hardware, raw computational area & energy

• Floating point vs. fixed point (FXP) accumulation increases

• DNNs typically have Normal data, but not always

• Lower bit widths affect accuracy, but

• Can scale quadratically vs. bit width:

• Multiple PTQ & QAT techniques are

• At the cost of accuracy, pruning weights:

• Quantization also costs accuracy with similar

• Comparison from (Kuzmin et al., 2023) shown:

• Focus on deep quantization rather than deep

• Single architecture across a wide range of platforms

IoT Mobile AR PC Auto Cloud

Leading edge performance with 50% to >100% year-over-year gains

You might also like