1119810450-3

v
AL
Contents
RI
Author Biographies xi
TE
Preface xiii
Acknowledgments xv
Table of Figures xvii
MA
1 Introduction 1
1.1 Development History 2
D
1.2 Neural Network Models 4

1.3 Neural Network Classification 4
TE
1.3.1 Supervised Learning 4

1.3.2 Semi-supervised Learning 5
GH
1.3.3 Unsupervised Learning 6

1.4 Neural Network Framework 6
1.5 Neural Network Comparison 10
Exercise 11
RI
References 12
PY
2 Deep Learning 13
2.1 Neural Network Layer 13
CO
2.1.1 Convolutional Layer 13

2.1.2 Activation Layer 17
2.1.3 Pooling Layer 18
2.1.4 Normalization Layer 19
2.1.5 Dropout Layer 20
2.1.6 Fully Connected Layer 20
2.2 Deep Learning Challenges 22
Exercise 22
References 24
vi Contents
3 Parallel Architecture 25
3.1 Intel Central Processing Unit (CPU) 25
3.1.1 Skylake Mesh Architecture 27
3.1.2 Intel Ultra Path Interconnect (UPI) 28
3.1.3 Sub Non-unified Memory Access Clustering (SNC) 29
3.1.4 Cache Hierarchy Changes 31
3.1.5 Single/Multiple Socket Parallel Processing 32
3.1.6 Advanced Vector Software Extension 33
3.1.7 Math Kernel Library for Deep Neural Network (MKL-DNN) 34
3.2 NVIDIA Graphics Processing Unit (GPU) 39
3.2.1 Tensor Core Architecture 41
3.2.2 Winograd Transform 44
3.2.3 Simultaneous Multithreading (SMT) 45
3.2.4 High Bandwidth Memory (HBM2) 46
3.2.5 NVLink2 Configuration 47
3.3 NVIDIA Deep Learning Accelerator (NVDLA) 49
3.3.1 Convolution Operation 50
3.3.2 Single Data Point Operation 50
3.3.3 Planar Data Operation 50
3.3.4 Multiplane Operation 50
3.3.5 Data Memory and Reshape Operations 51
3.3.6 System Configuration 51
3.3.7 External Interface 52
3.3.8 Software Design 52
3.4 Google Tensor Processing Unit (TPU) 53
3.4.1 System Architecture 53
3.4.2 Multiply–Accumulate (MAC) Systolic Array 55
3.4.3 New Brain Floating-Point Format 55
3.4.4 Performance Comparison 57
3.4.5 Cloud TPU Configuration 58
3.4.6 Cloud Software Architecture 60
3.5 Microsoft Catapult Fabric Accelerator 61
3.5.1 System Configuration 64
3.5.2 Catapult Fabric Architecture 65
3.5.3 Matrix-Vector Multiplier 65
3.5.4 Hierarchical Decode and Dispatch (HDD) 67
3.5.5 Sparse Matrix-Vector Multiplication 68
Exercise 70
References 71
Contents vii
4 Streaming Graph Theory 73

4.1 Blaize Graph Streaming Processor 73
4.1.1 Stream Graph Model 73
4.1.2 Depth First Scheduling Approach 75
4.1.3 Graph Streaming Processor Architecture 76
4.2 Graphcore Intelligence Processing Unit 79
4.2.1 Intelligence Processor Unit Architecture 79
4.2.2 Accumulating Matrix Product (AMP) Unit 79
4.2.3 Memory Architecture 79
4.2.4 Interconnect Architecture 79
4.2.5 Bulk Synchronous Parallel Model 81
Exercise 83
References 84
5 Convolution Optimization 85
5.1 eep Convolutional Neural Network Accelerator
D 85
5.1.2 Filter Decomposition 87
5.1.3 Streaming Architecture 90
5.1.3.1 Filter Weights Reuse 90
5.1.3.2 Input Channel Reuse 92
5.1.4 Pooling 92
5.1.4.1 Average Pooling 92
5.1.4.2 Max Pooling 93
5.1.5 Convolution Unit (CU) Engine 94
5.1.6 Accumulation (ACCU) Buffer 94
5.1.7 Model Compression 95
5.1.8 System Performance 95
5.2 Eyeriss Accelerator 97
5.2.1 Eyeriss System Architecture 97
5.2.2 2D Convolution to 1D Multiplication 98
5.2.3 Stationary Dataflow 99
5.2.3.1 Output Stationary 99
5.2.3.2 Weight Stationary 101
5.2.3.3 Input Stationary 101
5.2.4 Row Stationary (RS) Dataflow 104
5.2.4.1 Filter Reuse 104
5.2.4.2 Input Feature Maps Reuse 106
5.2.4.3 Partial Sums Reuse 106
viii Contents
5.2.5 Run-Length Compression (RLC) 106

5.2.6 Global Buffer 108
5.2.7 Processing Element Architecture 108
5.2.8 Network-on-Chip (NoC) 108
5.2.9 Eyeriss v2 System Architecture 112
5.2.10 Hierarchical Mesh Network 116
5.2.10.1 Input Activation HM-NoC 118
5.2.10.2 Filter Weight HM-NoC 118
5.2.10.3 Partial Sum HM-NoC 119
5.2.11 Compressed Sparse Column Format 120
5.2.12 Row Stationary Plus (RS+) Dataflow 122
Exercise 125
References 125
6 In-Memory Computation 127

6.1 Neurocube Architecture 127
6.1.1 Hybrid Memory Cube (HMC) 127
6.1.2 Memory Centric Neural Computing (MCNC) 130
6.1.3 Programmable Neurosequence Generator (PNG) 131
6.2 Tetris Accelerator 133
6.2.1 Memory Hierarchy 133
6.2.2 In-Memory Accumulation 133
6.2.3 Data Scheduling 135
6.2.4 Neural Network Vaults Partition 136
6.3 NeuroStream Accelerator 138
6.3.2 NeuroStream Coprocessor 140
6.3.3 4D Tiling Mechanism 140
Exercise 143
References 143
7 Near-Memory Architecture 145

7.1 DaDianNao Supercomputer 145
7.1.1 Memory Configuration 145
7.1.2 Neural Functional Unit (NFU) 146
7.2 Cnvlutin Accelerator 150
Contents ix
7.2.1 Basic Operation 151

7.2.3 Processing Order 154
7.2.4 Zero-Free Neuron Array Format (ZFNAf) 155
7.2.5 The Dispatcher 155
7.2.6 Network Pruning 157
7.2.8 Raw or Encoded Format (RoE) 158
7.2.9 Vector Ineffectual Activation Identifier Format (VIAI) 159
7.2.10 Ineffectual Activation Skipping 159
7.2.11 Ineffectual Weight Skipping 161
Exercise 161
References 161
8 Network Sparsity 163

8.1 Energy Efficient Inference Engine (EIE) 163
8.1.1 Leading Nonzero Detection (LNZD) Network 163
8.1.2 Central Control Unit (CCU) 164
8.1.3 Processing Element (PE) 164
8.1.4 Deep Compression 166
8.1.5 Sparse Matrix Computation 167
8.2 Cambricon-X Accelerator 169
8.2.1 Computation Unit 171
8.2.2 Buffer Controller 171
8.3 SCNN Accelerator 175
8.3.1 SCNN PT-IS-CP-Dense Dataflow 175
8.3.2 SCNN PT-IS-CP-Sparse Dataflow 177
8.3.3 SCNN Tiled Architecture 178
8.3.4 Processing Element Architecture 179
8.3.5 Data Compression 180
8.4 SeerNet Accelerator 183
8.4.1 Low-Bit Quantization 183
8.4.2 Efficient Quantization 184
8.4.3 Quantized Convolution 185
8.4.4 Inference Acceleration 186
8.4.5 Sparsity-Mask Encoding 186
Exercise 188
References 188
x Contents
9 3D Neural Processing 191

9.1 3D Integrated Circuit Architecture 191
9.2 Power Distribution Network 193
9.3 3D Network Bridge 195
9.3.1 3D Network-on-Chip 195
9.3.2 Multiple-Channel High-Speed Link 195
9.4 Power-Saving Techniques 198
9.4.1 Power Gating 198
9.4.2 Clock Gating 199
Exercise 200
References 201
Appendix A: Neural Network Topology 203

Index 205

1119810450-3

Uploaded by

Copyright:

Available Formats

1119810450-3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1119810450-3

Uploaded by

Copyright:

Available Formats

v

1.2 Neural Network Models 4

1.3.1 Supervised Learning 4

1.3.3 Unsupervised Learning 6

2.1.1 Convolutional Layer 13

4 Streaming Graph Theory 73

5.2.5 Run-Length Compression (RLC) 106

6 In-Memory Computation 127

7 Near-Memory Architecture 145

7.2.1 Basic Operation 151

8 Network Sparsity 163

9 3D Neural Processing 191

Appendix A: Neural Network Topology 203

You might also like

1119810450-3

Uploaded by

Copyright:

Available Formats

1119810450-3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1119810450-3

Uploaded by

Copyright:

Available Formats

v

1.2 ­Neural Network Models 4

1.3.1 Supervised Learning 4

1.3.3 Unsupervised Learning 6

2.1.1 Convolutional Layer 13

4 Streaming Graph Theory 73

5.2.5 Run-­Length Compression (RLC) 106

6 In-­Memory Computation 127

7 Near-­Memory Architecture 145

7.2.1 Basic Operation 151

8 Network Sparsity 163

9 3D Neural Processing 191

Appendix A: Neural Network Topology 203

You might also like

1.2 Neural Network Models 4

5.2.5 Run-Length Compression (RLC) 106

6 In-Memory Computation 127

7 Near-Memory Architecture 145