research-article

Deep Learning Inferencing with High-performance Hardware Accelerators

Authors:

Luke Kljucaric,

Alan D. GeorgeAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 4

Article No.: 68, Pages 1 - 25

https://doi.org/10.1145/3594221

Published: 15 June 2023 Publication History

Get Access

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively.

References

[1]

T. Aarrestad, V. Loncar, et al. 2021. Fast convolutional neural networks on FPGAs with hls4ml. Machine Learning: Science and Technology 2, 4 (2021), 1–25.

Abstract

References

Cited By

Index Terms

Recommendations

Modeling and predicting performance of high performance computing applications on hardware accelerators

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations