An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators
Abstract
1 Introduction
2 Related Work
3 Key Definitions and Notation
4 Problem Formulation
4.1 Prediction Framework
4.2 Design Space Exploration
5 Overview of Our Approach
5.1 Demonstration Platforms and Simulators
5.2 Sampling Method
5.3 ML Models
5.4 Two-stage Model
5.5 Design Space Exploration
6 Graph Generation
7 Experimental Setup
7.1 Data Generation
Platforms | Feature | Candidate Values | Description |
---|---|---|---|
TABLA | PU | 4, 8 | # processing units |
PE | 8, 16 | # processing engines in each PU | |
bitwidth | 8, 16 | bit width of internal bus | |
input bitwidth | 16, 32 | bit width of IO bus | |
benchmark3 | recommender systems | ML algorithms | |
backpropagation | |||
GeneSys | weight data width | 4 – 8 (integer) | bit width of weight data (bit) |
activation data width | 4 – 8 (integer) | bit width of input activation data (bit) | |
accumulation width | 32 (integer) | bit width of output accumulation (bit) | |
WBUF capacity | 16 – 256 (integer) | size of weight buffer (KB) | |
IBUF capacity | 16 – 128 (integer) | size of input buffer (KB) | |
OBUF capacity | 128 – 1024 (integer) | size of output buffer (KB) | |
SIMD VMEM capacity | 128 – 1024 (integer) | size of vector memory in VMEM (KB) | |
WBUF AXI data width | 64 – 256 (integer) | AXI bandwidth for the WBUF (bits/cycle) | |
IBUF AXI data width | 128 – 256 (integer) | AXI bandwidth for the IBUF (bits/cycle) | |
OBUF AXI data width | 128 – 256 (integer) | AXI bandwidth for the OBUF (bits/cycle) | |
SIMD AXI data width | 128 – 256 (integer) | AXI bandwidth for the VMEM (bits/cycle) | |
VTA | weight data width | 8 (integer) | bit width of weight data (bit) |
activation data width | 8 (integer) | bit width of input activation data (bit) | |
accumulation width | 32 (integer) | bit width of output accumulation (bit) | |
WBUF capacity | 16 – 256 (integer) | size of weight buffer (KB) | |
IBUF capacity | 16 – 128 (integer) | size of input buffer (KB) | |
OBUF capacity | 32 – 512 (integer) | size of output buffer (KB) | |
off-chip bandwidth | 64 – 512 (integer) | total external bandwidth (bits/cycle) | |
Axiline | benchmark\(^{3}\) | SVM, linear regression, | ML algorithms |
logistic regression, | |||
recommender systems | |||
bitwidth | 8, 16 | bit width for computation units | |
input bitwidth | 4, 8 | bit width for initial inputs | |
size | 5 – 60 (integer) | dimension of inner product stage or | |
SGD stage (both are the same) | |||
num of cycles | 1 – 25 (integer) | number of cycles required for stages | |
1 or 3 to process one input vector |
7.2 Dataset Separation
7.3 Model Training
Model | Parameters | Type | Range | Description |
---|---|---|---|---|
GBDT | n_estimator | integer | [20–500] | # gradient boosted trees |
max_depth | integer | [2–20] | maximum tree depth | |
RF | n_estimator | integer | [50–1000] | # decision trees in the forest |
mtries | enum | [1–total feature count] | # features considered for best split | |
max_depth | integer | [5–100] | max tree depth | |
ANN | num_layer | integer | [3–9] | # hidden layers |
num_node | enum | [8, 16, 32] | nodeCount input in Algorithm 2 | |
act_func | enum | [Tanh, Rectifier, Maxout] | activation function | |
GCN | conv_layer | enum | [GraphConv, GCNConv] | type of graph convolutional layer |
num_conv_layer | integer | [2–6] | # convolutional layers | |
num_fc_layer | integer | [2–9] | # fully connected layers | |
batch_size | integer | [16, 32, 64] | training batch size | |
lr | float | [\(10^{-2} \text{--} 10^{-5}\)] | learning rate |
8 Experimental Results
8.1 Assessment of Sampling Methods and Sample Sizes
Sampling Details | ML Model | Power | System-Energy | |||||
---|---|---|---|---|---|---|---|---|
Method | Size | \(\mu APE\) | STD APE | MAPE | \(\mu APE\) | STD APE | MAPE | |
LHS | 16 | GBDT | 20.37 | 13.14 | 84.16 | 32.65 | 25.68 | 88.69 |
RF | 17.63 | 10.01 | 57.00 | 36.58 | 23.28 | 78.84 | ||
ANN | 2.67 | 1.31 | 23.02 | 4.36 | 2.95 | 28.13 | ||
GCN | 2.96 | 0.45 | 15.89 | 3.06 | 0.99 | 13.92 | ||
24 | GBDT | 13.20 | 8.07 | 58.63 | 15.25 | 9.92 | 65.02 | |
RF | 14.76 | 12.29 | 84.38 | 14.38 | 12.07 | 68.17 | ||
ANN | 1.80 | 0.54 | 15.83 | 3.44 | 2.26 | 21.73 | ||
GCN | 3.00 | 0.91 | 15.38 | 2.71 | 0.74 | 16.21 | ||
32 | GBDT | 13.61 | 6.07 | 59.70 | 9.75 | 6.69 | 48.34 | |
RF | 12.06 | 6.73 | 44.10 | 21.88 | 20.00 | 84.68 | ||
ANN | 2.03 | 0.70 | 13.24 | 4.00 | 3.65 | 31.51 | ||
GCN | 2.57 | 0.70 | 15.99 | 2.20 | 0.66 | 19.92 | ||
Sobol | 16 | GBDT | 18.02 | 15.64 | 72.16 | 28.19 | 16.12 | 82.50 |
RF | 22.58 | 15.07 | 75.50 | 39.63 | 15.15 | 85.97 | ||
ANN | 3.18 | 1.52 | 20.67 | 5.16 | 2.15 | 24.11 | ||
GCN | 3.24 | 0.94 | 18.03 | 3.32 | 0.87 | 22.39 | ||
24 | GBDT | 14.31 | 10.65 | 80.92 | 34.14 | 33.93 | 99.20 | |
RF | 18.32 | 13.78 | 73.92 | 29.41 | 19.15 | 89.63 | ||
ANN | 2.70 | 1.31 | 27.40 | 5.19 | 2.45 | 22.21 | ||
GCN | 2.51 | 1.05 | 15.89 | 2.62 | 0.69 | 15.85 | ||
32 | GBDT | 14.95 | 9.98 | 37.90 | 21.45 | 16.52 | 34.89 | |
RF | 16.06 | 12.74 | 33.85 | 25.84 | 27.19 | 46.02 | ||
ANN | 2.39 | 1.00 | 25.07 | 2.59 | 1.46 | 21.31 | ||
GCN | 2.58 | 0.72 | 18.03 | 2.14 | 0.72 | 18.87 | ||
Halton | 16 | GBDT | 19.28 | 15.32 | 80.90 | 48.54 | 43.57 | 85.08 |
RF | 21.46 | 18.24 | 88.02 | 49.52 | 79.62 | 85.01 | ||
ANN | 4.07 | 2.57 | 37.25 | 9.22 | 8.23 | 60.09 | ||
GCN | 3.81 | 1.46 | 18.38 | 4.31 | 1.30 | 18.38 | ||
24 | GBDT | 12.80 | 7.80 | 69.04 | 26.27 | 16.46 | 79.35 | |
RF | 13.15 | 10.63 | 49.89 | 20.57 | 12.01 | 61.67 | ||
ANN | 1.94 | 0.57 | 13.31 | 3.01 | 2.50 | 32.13 | ||
GCN | 2.65 | 0.40 | 17.07 | 2.51 | 0.71 | 16.38 | ||
32 | GBDT | 8.88 | 3.80 | 32.64 | 27.48 | 24.46 | 95.76 | |
RF | 11.46 | 7.65 | 40.66 | 21.48 | 13.39 | 58.25 | ||
ANN | 2.27 | 0.59 | 21.03 | 2.44 | 1.08 | 18.07 | ||
GCN | 2.74 | 0.58 | 19.73 | 2.71 | 0.84 | 15.98 |
8.2 ML Model Assessment
Design | ML Model | Performance | Power | Area | System-Energy | System-Runtime | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
\(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | ||
TABLA GF12 | Ensemble | 2.82 | 11.00 | 2.28 | 9.51 | 1.25 | 6.33 | 0.93 | 3.53 | 3.84 | 14.55 |
GCN | 2.75 | 11.56 | 2.18 | 8.94 | 0.54 | 5.64 | 0.93 | 5.39 | 3.03 | 11.82 | |
GeneSys GF12 | Ensemble | 8.38 | 24.36 | 6.45 | 22.02 | 1.00 | 3.04 | 1.80 | 5.37 | 6.45 | 17.86 |
GCN | 6.00 | 20.59 | 7.28 | 15.81 | 0.49 | 1.30 | 1.80 | 5.11 | 5.83 | 15.51 | |
VTA GF12 | Ensemble | 2.79 | 14.57 | 2.67 | 11.94 | 1.15 | 4.35 | 2.36 | 10.43 | 4.07 | 12.04 |
GCN | 2.16 | 12.21 | 2.18 | 7.77 | 0.66 | 4.02 | 2.46 | 6.92 | 2.31 | 8.53 | |
Axiline GF12 | Ensemble | 0.70 | 8.50 | 2.44 | 28.53 | 1.46 | 20.99 | 9.15 | 95.32 | 1.05 | 8.30 |
GCN | 3.06 | 49.65 | 1.52 | 22.69 | 1.82 | 16.09 | 2.68 | 37.83 | 1.39 | 25.56 | |
Axiline NG45 | Ensemble | 3.15 | 23.61 | 7.68 | 54.21 | 1.39 | 8.81 | 8.91 | 75.83 | 5.16 | 31.31 |
GCN | 4.74 | 36.25 | 5.19 | 29.98 | 3.03 | 13.48 | 4.97 | 25.07 | 4.59 | 55.06 |
Design | ML Model | Performance | Power | Area | System-Energy | System-Runtime | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
\(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | ||
TABLA GF12 | Ensemble | 3.68 | 32.51 | 4.11 | 17.11 | 3.99 | 16.05 | 4.62 | 18.63 | 6.03 | 24.10 |
GCN | 5.79 | 21.74 | 5.34 | 14.00 | 3.76 | 12.81 | 3.93 | 13.80 | 5.20 | 23.63 | |
GeneSys GF12 | Ensemble | 6.32 | 14.82 | 7.26 | 15.23 | 2.75 | 8.09 | 11.96 | 19.78 | 6.28 | 20.27 |
GCN | 6.97 | 13.11 | 5.39 | 15.10 | 2.12 | 4.08 | 4.32 | 8.88 | 7.65 | 17.81 | |
VTA GF12 | Ensemble | 2.99 | 12.58 | 11.19 | 28.99 | 6.65 | 17.01 | 9.10 | 18.41 | 2.87 | 10.50 |
GCN | 2.60 | 9.67 | 2.85 | 12.90 | 2.15 | 9.51 | 4.07 | 13.76 | 3.67 | 9.87 | |
Axiline GF12 | Ensemble | 0.61 | 6.48 | 2.55 | 22.45 | 1.31 | 5.68 | 7.17 | 47.20 | 1.29 | 7.74 |
GCN | 2.92 | 28.74 | 2.86 | 29.34 | 1.88 | 9.82 | 2.34 | 29.21 | 2.98 | 2971 | |
Axiline NG45 | Ensemble | 3.33 | 27.29 | 5.97 | 48.34 | 2.81 | 17.25 | 7.21 | 49.81 | 4.75 | 22.63 |
GCN | 4.57 | 35.68 | 5.55 | 36.38 | 3.77 | 16.26 | 12.88 | 86.2 | 5.85 | 51.21 |
Design | ML model | Performance | Power | Area | System-energy | System-runtime | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
\(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | ||
TABLA GF12 | GBDT | 3.09 | 14.02 | 2.88 | 11.94 | 0.78 | 3.02 | 1.22 | 5.15 | 3.44 | 13.28 |
RF | 6.13 | 29.43 | 3.58 | 12.15 | 2.59 | 10.55 | 1.83 | 5.08 | 6.17 | 20.39 | |
ANN | 3.21 | 11.05 | 2.88 | 13.36 | 0.24 | 0.93 | 0.85 | 2.73 | 3.14 | 17.15 | |
Ensemble | 2.82 | 11.00 | 2.28 | 9.51 | 1.25 | 6.33 | 0.93 | 3.53 | 3.84 | 14.55 | |
GCN | 2.75 | 11.56 | 2.18 | 8.94 | 0.54 | 5.64 | 0.93 | 5.39 | 3.03 | 11.82 | |
GeneSys GF12 | GBDT | 7.16 | 22.74 | 8.57 | 26.41 | 3.93 | 12.82 | 1.50 | 7.03 | 6.76 | 20.15 |
RF | 11.04 | 23.52 | 8.48 | 18.29 | 3.25 | 7.46 | 1.56 | 7.60 | 8.99 | 34.54 | |
ANN | 6.50 | 15.49 | 5.26 | 17.93 | 0.84 | 2.27 | 2.80 | 7.80 | 6.40 | 18.46 | |
Ensemble | 8.38 | 24.36 | 6.45 | 22.02 | 1.00 | 3.04 | 1.80 | 5.37 | 6.45 | 17.86 | |
GCN | 6.00 | 20.59 | 7.28 | 15.81 | 0.49 | 1.30 | 1.80 | 5.11 | 5.83 | 15.51 | |
VTA GF12 | GBDT | 2.75 | 13.38 | 2.84 | 12.75 | 1.90 | 13.14 | 1.89 | 8.27 | 2.84 | 12.18 |
RF | 5.67 | 35.31 | 4.57 | 28.02 | 2.66 | 12.64 | 2.07 | 7.58 | 4.68 | 24.74 | |
ANN | 2.29 | 14.00 | 2.05 | 8.59 | 0.89 | 3.85 | 7.29 | 24.37 | 2.47 | 10.54 | |
Ensemble | 2.79 | 14.57 | 2.67 | 11.94 | 1.15 | 4.35 | 2.36 | 10.43 | 4.07 | 12.04 | |
GCN | 2.16 | 12.21 | 2.18 | 7.77 | 0.66 | 4.02 | 2.46 | 6.92 | 2.31 | 8.53 | |
Axiline GF12 | GBDT | 0.77 | 5.24 | 2.20 | 14.70 | 2.74 | 13.59 | 1.34 | 12.31 | 1.15 | 8.87 |
RF | 6.55 | 36.06 | 3.79 | 29.70 | 3.50 | 16.94 | 1.32 | 13.00 | 7.53 | 91.34 | |
ANN | 0.78 | 8.69 | 2.78 | 28.19 | 2.21 | 53.32 | 4.46 | 77.34 | 1.29 | 13.16 | |
Ensemble | 0.70 | 8.50 | 2.44 | 28.53 | 1.46 | 20.99 | 9.15 | 95.32 | 1.05 | 8.30 | |
GCN | 3.06 | 49.65 | 1.52 | 22.69 | 1.82 | 16.09 | 2.68 | 37.83 | 1.39 | 25.56 | |
Axiline NG45 | GBDT | 3.56 | 22.73 | 7.01 | 33.61 | 2.60 | 12.62 | 6.43 | 51.06 | 3.95 | 36.78 |
RF | 4.56 | 30.57 | 9.38 | 45.35 | 3.70 | 13.38 | 6.92 | 41.56 | 4.21 | 44.36 | |
ANN | 3.48 | 25.40 | 8.48 | 83.22 | 1.93 | 25.85 | 7.04 | 51.25 | 6.60 | 45.50 | |
Ensemble | 3.15 | 23.61 | 7.68 | 54.21 | 1.39 | 8.81 | 8.91 | 75.83 | 5.16 | 31.31 | |
GCN | 4.74 | 36.25 | 5.19 | 29.98 | 3.03 | 13.48 | 4.97 | 25.07 | 4.59 | 55.06 |
Design | ML model | Performance | Power | Area | System-energy | System-runtime | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
\(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | \(\mu APE\) | MAPE | ||
TABLA GF12 | GBDT | 3.24 | 33.06 | 3.87 | 19.03 | 3.42 | 9.01 | 8.86 | 19.15 | 3.83 | 18.99 |
RF | 5.16 | 38.01 | 9.76 | 46.62 | 11.25 | 40.13 | 10.59 | 21.30 | 5.67 | 27.41 | |
ANN | 5.78 | 32.70 | 5.22 | 20.02 | 2.30 | 5.70 | 2.97 | 10.25 | 6.02 | 24.94 | |
Ensemble | 3.68 | 32.51 | 4.11 | 17.11 | 3.99 | 16.05 | 4.62 | 18.63 | 6.03 | 24.10 | |
GCN | 5.79 | 21.74 | 5.34 | 14.00 | 3.76 | 12.81 | 3.93 | 13.80 | 5.20 | 23.63 | |
GeneSys GF12 | GBDT | 6.54 | 20.86 | 5.82 | 15.99 | 2.94 | 8.52 | 6.37 | 11.65 | 12.23 | 28.41 |
RF | 8.73 | 22.57 | 9.34 | 18.66 | 3.69 | 9.57 | 14.89 | 22.00 | 17.37 | 34.38 | |
ANN | 6.55 | 19.86 | 3.82 | 15.36 | 2.06 | 3.80 | 3.47 | 9.11 | 10.73 | 28.90 | |
Ensemble | 6.32 | 14.82 | 7.26 | 15.23 | 2.75 | 8.09 | 11.96 | 19.78 | 6.28 | 20.27 | |
GCN | 6.97 | 13.11 | 5.39 | 15.10 | 2.12 | 4.08 | 4.32 | 8.88 | 7.65 | 17.81 | |
VTA GF12 | GBDT | 4.96 | 19.31 | 4.04 | 11.83 | 24.74 | 38.07 | 7.58 | 18.79 | 6.94 | 20.39 |
RF | 3.00 | 9.81 | 12.96 | 33.34 | 18.05 | 53.58 | 7.15 | 16.43 | 5.67 | 13.52 | |
ANN | 2.52 | 14.09 | 3.08 | 11.84 | 2.19 | 6.66 | 10.61 | 22.16 | 4.39 | 12.70 | |
Ensemble | 2.99 | 12.58 | 11.19 | 28.99 | 6.65 | 17.01 | 9.10 | 18.41 | 2.87 | 10.50 | |
GCN | 2.60 | 9.67 | 2.85 | 12.90 | 2.15 | 9.51 | 4.07 | 13.76 | 3.67 | 9.87 | |
Axiline GF12 | GBDT | 0.62 | 7.18 | 11.53 | 74.19 | 10.29 | 41.78 | 16.61 | 82.95 | 2.19 | 19.83 |
RF | 0.63 | 5.41 | 15.95 | 77.38 | 13.24 | 57.18 | 21.8 | 90.34 | 2.12 | 12.86 | |
ANN | 0.72 | 8.64 | 2.24 | 21.98 | 1.20 | 7.85 | 4.24 | 29.85 | 1.08 | 9.72 | |
Ensemble | 0.61 | 6.48 | 2.55 | 22.45 | 1.31 | 5.68 | 7.17 | 47.20 | 1.29 | 7.74 | |
GCN | 2.92 | 28.74 | 2.86 | 29.34 | 1.88 | 9.82 | 2.34 | 29.21 | 2.98 | 2971 | |
Axiline NG45 | GBDT | 3.45 | 27.48 | 5.98 | 56.62 | 2.79 | 13.40 | 23.33 | 90.79 | 5.54 | 30.13 |
RF | 3.18 | 26.96 | 6.30 | 41.31 | 3.00 | 16.90 | 21.54 | 93.74 | 5.77 | 35.33 | |
ANN | 3.37 | 27.19 | 6.57 | 59.76 | 1.86 | 13.81 | 9.30 | 77.10 | 5.04 | 25.56 | |
Ensemble | 3.33 | 27.29 | 5.97 | 48.34 | 2.81 | 17.25 | 7.21 | 49.81 | 4.75 | 22.63 | |
GCN | 4.57 | 35.68 | 5.55 | 36.38 | 3.77 | 16.26 | 12.88 | 86.2 | 5.85 | 51.21 |
8.3 Effect of Limited Training Dataset
8.4 DSE with Trained ML Models
9 Conclusion
Acknowledgments
Footnotes
Appendix
A ML Model Assessment
References
Index Terms
- An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators
Recommendations
Physically Accurate Learning-based Performance Prediction of Hardware-accelerated ML Algorithms
MLCAD '22: Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CADParameterizable ML accelerators are the product of recent breakthroughs in machine learning (ML). To fully enable the design space exploration, we propose a physical-design-driven, learning-based prediction framework for hardware-accelerated deep neural ...
A Design-Space Exploration Framework for Application-Specific Machine Learning Targeting Reconfigurable Computing
Applied Reconfigurable Computing. Architectures, Tools, and ApplicationsAbstractMachine learning has progressed from inaccessible for embedded systems to readily deployable, thanks to efficient training on modern computers. Regrettably, requirements for each specific application which relies on machine learning varies on a ...
A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform
This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom ...
Comments
Information & Contributors
Information
Published In
Publisher
Association for Computing Machinery
New York, NY, United States
Journal Family
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- NSF (National Science Foundation)
- Defense Advanced Research Projects Agency
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 1,569Total Downloads
- Downloads (Last 12 months)1,569
- Downloads (Last 6 weeks)263
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in