research-article

Equinox: Training (for Free) on a Custom Inference Accelerator

Authors:

Arash Pourhabibi,

Ahmet Caner Yüzügüler,

Martin JaggiAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 421 - 433

https://doi.org/10.1145/3466752.3480057

Published: 17 October 2021 Publication History

Abstract

DNN inference accelerators executing online services exhibit low average loads because of service demand variability, leading to poor resource utilization. Unfortunately, reclaiming idle inference cycles is difficult as other workloads can not execute on a custom accelerator. With recent proposals for the use of fixed-point arithmetic in training, there are opportunities for training services to piggyback on inference accelerators. We make the observation that a key challenge in doing so is maintaining service-level latency constraints for inference. We show that relaxing latency constraints in an inference accelerator with ALU arrays that are batching-optimized achieves near-optimal throughput for a given area and power envelope while maintaining inference services’ tail latency goals.

We present Equinox, a custom inference accelerator designed to piggyback training. Equinox employs a uniform arithmetic encoding to accommodate inference and training and a priority hardware scheduler with adaptive batching that interleaves training during idle inference cycles. For a 500μs inference service time constraint, Equinox achieves 6.67 × higher throughput than a latency-optimal inference accelerator. Despite not being optimized for training services, Equinox achieves up to 78% of the throughput of a dedicated training accelerator that saturates the available compute resources and DRAM bandwidth. Finally, Equinox’s controller logic incurs less than 1% power and area overhead, while the uniform encoding (to enable training) incurs 13% power and 4% area overhead compared to a fixed-point inference accelerator.

References

[1]

2010. NVIDIA T4 Tensor Core GPU.https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/t4-tensor-core-datasheet.pdf. Accessed: 2019-01-07.

[2]

2017. Cloud TPU.https://cloud.google.com/tpu. Accessed: 2018-01-31.

[3]

2017. Introduction to the IPU architecture.https://www.graphcore.ai/nips2017_presentations. Accessed: 2019-08-06.

[4]

2018. NVIDIA Volta V100 GPU Accelerator.https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf. Accessed: 2018-01-31.

[5]

2018. Tearing Apart Google’s TPU 3.0 AI coprocessor.https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor. Accessed: 2018-05-15.

[6]

Martn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th Symposium on Operating System Design and Implementation (OSDI). 265–283.

[7]

Luiz Andr Barroso, Urs Hlzle, and Parthasarathy Ranganathan. 2018. The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition. Morgan & Claypool Publishers.

[8]

Daniel Crankshaw, Xin Wang 0066, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th Symposium on Networked Systems Design and Implementation (NSDI). 613–627.

[9]

Alexandros Daglis, Mark Sutherland, and Babak Falsafi. 2019. RPCValet: NI-Driven Tail-Aware Balancing of μs-Scale RPCs. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIV). 35–48.

Digital Library

[10]

William Dally. 2015. High Performance Hardware for Machine Learning.https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf. Accessed: 2018-01-31.

[11]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX). 127–144.

Digital Library

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.

[13]

Mario Drumond. 2020. ColTraIn: Co-located DNN training and inference. Ph.D. Dissertation. EPFL.

[14]

Mario Drumond, Tao Lin, Martin Jaggi, and Babak Falsafi. 2018. Training DNNs with Hybrid Block Floating Point. In Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS). 451–461.

[15]

Hadi Esmaeilzadeh, Emily R. Blem, Rene St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). 365–376.

Digital Library

[16]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 45th International Symposium on Computer Architecture (ISCA). 1–14.

Digital Library

[17]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the Thirty-second International Conference on Machine Learning (ICML). 1737–1746.

[18]

Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the 24th IEEE Symposium on High-Performance Computer Architecture (HPCA). 620–629.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[20]

Tayler Hicklin Hetherington, Maria Lubeznov, Deval Shah, and Tor M. Aamodt. 2019. EDGE: Event-Driven GPU Execution. In Proceedings of the 28th International Conference on Parallel Architecture and Compilation Techniques (PACT). 337–353.

[21]

Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM (2020), 67–78.

[22]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA). 1–12.

Digital Library

[23]

Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto(2009).

[24]

Urs Kster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. 2017. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In Proceedings of the Thirty-first Conference on Neural Information Processing Systems (NIPS). 1742–1752.

[25]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd International Symposium on Computer Architecture (ISCA). 450–462.

Digital Library

[26]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 3–14.

Digital Library

[27]

Sharan Narang and Greg Diamos. 2017. Baidu DeepBench. https://doi.org/10.1145/1105734.1105748.

Digital Library

[28]

Ali Pahlevan, Javier Picorel, Arash Pourhabibi Zarandi, Davide Rossi, Marina Zapater, Andrea Bartolini, Pablo Garca Del Valle, David Atienza, Luca Benini, and Babak Falsafi. 2016. Towards near-threshold server processors. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). 7–12.

Digital Library

[29]

George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP). 325–341.

Digital Library

[30]

Bita Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, Alessandro Forin, Haishan Zhu, Taesik Na, Prerak Patel, Shuai Che, Lok Chand Koppaka, Xia Song, Subhojit Som, Kaustav Das, Saurabh Tiwary, Steve Reinhardt, Sitaram Lanka, Eric Chung, and Doug Burger. 2020. Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point. NeurIPS 2020 vol. 33, no. 4, pp. 100-107. (2020).

[31]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs/1409.0575(2014).

[32]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295–2329.

[33]

Kevin Tran. 2016. Start Your HBM/2.5 D Design Today.

[34]

David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, Aamer Jaleel, and Bruce Jacob.2005. DRAMsim: A memory-system simulator. Computer Architecture News vol. 33, no. 4, pp. 100-107. (2005). https://doi.org/10.1145/1105734.1105748

Digital Library

[35]

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS). 7686–7695.

[36]

Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In Proceedings of the 2018 IEEE International Symposium on Workload Characterization. 88–100.

Cited By

Wang SXiao C(2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
https://doi.org/10.1109/TAI.2023.3308099
Chen XHalder DIslam KRay S(2024)Guarding Deep Learning Systems With Boosted Evasion Attack Detection and Model UpdateIEEE Internet of Things Journal10.1109/JIOT.2023.332456811:6(9382-9391)Online publication date: 15-Mar-2024
https://doi.org/10.1109/JIOT.2023.3324568
Hussein EWaschneck BMayr C(2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103080
Show More Cited By

Recommendations

ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority Scheduling
HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

GPU partition mechanisms in run-time software have been widely used in job scheduler and multi-tenant computing system to improve resource utilization and throughput. The latency requirements of different DNN requests, such as real-time and best-effort ...
BIRP: Batch-aware Inference Workload Redistribution and Parallel Scheme for Edge Collaboration
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

The inference workload redistribution is a technique for evacuating inference requests from hot edges to idle edges in edge collaborative systems, thereby achieving inference workload balancing for inference on different edges. However, with the ...
Hardware Accelerator for Probabilistic Inference in 65-nm CMOS

A hardware accelerator is presented to compute the probabilistic inference for a Bayesian network (BN) in distributed sensing applications. For energy efficiency, the accelerator is operated at a near-threshold voltage of 0.5 V, while achieving a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Swiss National Science Foundation

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
955
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang SXiao C(2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
https://doi.org/10.1109/TAI.2023.3308099
Chen XHalder DIslam KRay S(2024)Guarding Deep Learning Systems With Boosted Evasion Attack Detection and Model UpdateIEEE Internet of Things Journal10.1109/JIOT.2023.332456811:6(9382-9391)Online publication date: 15-Mar-2024
https://doi.org/10.1109/JIOT.2023.3324568
Hussein EWaschneck BMayr C(2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103080
Yüzügüler ASönmez CDrumond MOh YFalsafi BFrossard P(2023)Scale-out Systolic ArraysACM Transactions on Architecture and Code Optimization10.1145/357291720:2(1-25)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3572917
Chen XRay S(2023)GERALT: Real-time Detection of Evasion Attacks in Deep Learning Systems2023 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS46773.2023.10181921(1-5)Online publication date: 21-May-2023
https://doi.org/10.1109/ISCAS46773.2023.10181921
Kim BLi SLi H(2023)INCA: Input-stationary Dataflow at Outside-the-box Thinking about Deep Learning Accelerators2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070992(29-41)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070992
Li PChe CHou R(2023)Nacc-Guard: a lightweight DNN accelerator architecture for secure deep learningThe Journal of Supercomputing10.1007/s11227-023-05671-980:5(5815-5831)Online publication date: 7-Oct-2023
https://dl.acm.org/doi/10.1007/s11227-023-05671-9

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents