1119810450-3
1119810450-3
1119810450-3
AL
Contents
RI
Author Biographies xi
TE
Preface xiii
Acknowledgments xv
Table of Figures xvii
MA
1 Introduction 1
1.1 Development History 2
D
References 12
PY
2 Deep Learning 13
2.1 Neural Network Layer 13
CO
3 Parallel Architecture 25
3.1 Intel Central Processing Unit (CPU) 25
3.1.1 Skylake Mesh Architecture 27
3.1.2 Intel Ultra Path Interconnect (UPI) 28
3.1.3 Sub Non-unified Memory Access Clustering (SNC) 29
3.1.4 Cache Hierarchy Changes 31
3.1.5 Single/Multiple Socket Parallel Processing 32
3.1.6 Advanced Vector Software Extension 33
3.1.7 Math Kernel Library for Deep Neural Network (MKL-DNN) 34
3.2 NVIDIA Graphics Processing Unit (GPU) 39
3.2.1 Tensor Core Architecture 41
3.2.2 Winograd Transform 44
3.2.3 Simultaneous Multithreading (SMT) 45
3.2.4 High Bandwidth Memory (HBM2) 46
3.2.5 NVLink2 Configuration 47
3.3 NVIDIA Deep Learning Accelerator (NVDLA) 49
3.3.1 Convolution Operation 50
3.3.2 Single Data Point Operation 50
3.3.3 Planar Data Operation 50
3.3.4 Multiplane Operation 50
3.3.5 Data Memory and Reshape Operations 51
3.3.6 System Configuration 51
3.3.7 External Interface 52
3.3.8 Software Design 52
3.4 Google Tensor Processing Unit (TPU) 53
3.4.1 System Architecture 53
3.4.2 Multiply–Accumulate (MAC) Systolic Array 55
3.4.3 New Brain Floating-Point Format 55
3.4.4 Performance Comparison 57
3.4.5 Cloud TPU Configuration 58
3.4.6 Cloud Software Architecture 60
3.5 Microsoft Catapult Fabric Accelerator 61
3.5.1 System Configuration 64
3.5.2 Catapult Fabric Architecture 65
3.5.3 Matrix-Vector Multiplier 65
3.5.4 Hierarchical Decode and Dispatch (HDD) 67
3.5.5 Sparse Matrix-Vector Multiplication 68
Exercise 70
References 71
Contents vii
5 Convolution Optimization 85
5.1 eep Convolutional Neural Network Accelerator
D 85
5.1.1 System Architecture 86
5.1.2 Filter Decomposition 87
5.1.3 Streaming Architecture 90
5.1.3.1 Filter Weights Reuse 90
5.1.3.2 Input Channel Reuse 92
5.1.4 Pooling 92
5.1.4.1 Average Pooling 92
5.1.4.2 Max Pooling 93
5.1.5 Convolution Unit (CU) Engine 94
5.1.6 Accumulation (ACCU) Buffer 94
5.1.7 Model Compression 95
5.1.8 System Performance 95
5.2 Eyeriss Accelerator 97
5.2.1 Eyeriss System Architecture 97
5.2.2 2D Convolution to 1D Multiplication 98
5.2.3 Stationary Dataflow 99
5.2.3.1 Output Stationary 99
5.2.3.2 Weight Stationary 101
5.2.3.3 Input Stationary 101
5.2.4 Row Stationary (RS) Dataflow 104
5.2.4.1 Filter Reuse 104
5.2.4.2 Input Feature Maps Reuse 106
5.2.4.3 Partial Sums Reuse 106
viii Contents