Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3466752.3480057acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Equinox: Training (for Free) on a Custom Inference Accelerator

Published: 17 October 2021 Publication History

Abstract

DNN inference accelerators executing online services exhibit low average loads because of service demand variability, leading to poor resource utilization. Unfortunately, reclaiming idle inference cycles is difficult as other workloads can not execute on a custom accelerator. With recent proposals for the use of fixed-point arithmetic in training, there are opportunities for training services to piggyback on inference accelerators. We make the observation that a key challenge in doing so is maintaining service-level latency constraints for inference. We show that relaxing latency constraints in an inference accelerator with ALU arrays that are batching-optimized achieves near-optimal throughput for a given area and power envelope while maintaining inference services’ tail latency goals.
We present Equinox, a custom inference accelerator designed to piggyback training. Equinox employs a uniform arithmetic encoding to accommodate inference and training and a priority hardware scheduler with adaptive batching that interleaves training during idle inference cycles. For a 500μs inference service time constraint, Equinox achieves 6.67 × higher throughput than a latency-optimal inference accelerator. Despite not being optimized for training services, Equinox achieves up to 78% of the throughput of a dedicated training accelerator that saturates the available compute resources and DRAM bandwidth. Finally, Equinox’s controller logic incurs less than 1% power and area overhead, while the uniform encoding (to enable training) incurs 13% power and 4% area overhead compared to a fixed-point inference accelerator.

References

[1]
2010. NVIDIA T4 Tensor Core GPU.https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/t4-tensor-core-datasheet.pdf. Accessed: 2019-01-07.
[2]
2017. Cloud TPU.https://cloud.google.com/tpu. Accessed: 2018-01-31.
[3]
2017. Introduction to the IPU architecture.https://www.graphcore.ai/nips2017_presentations. Accessed: 2019-08-06.
[4]
2018. NVIDIA Volta V100 GPU Accelerator.https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf. Accessed: 2018-01-31.
[5]
2018. Tearing Apart Google’s TPU 3.0 AI coprocessor.https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor. Accessed: 2018-05-15.
[6]
Martn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th Symposium on Operating System Design and Implementation (OSDI). 265–283.
[7]
Luiz Andr Barroso, Urs Hlzle, and Parthasarathy Ranganathan. 2018. The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition. Morgan & Claypool Publishers.
[8]
Daniel Crankshaw, Xin Wang 0066, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th Symposium on Networked Systems Design and Implementation (NSDI). 613–627.
[9]
Alexandros Daglis, Mark Sutherland, and Babak Falsafi. 2019. RPCValet: NI-Driven Tail-Aware Balancing of μs-Scale RPCs. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIV). 35–48.
[10]
William Dally. 2015. High Performance Hardware for Machine Learning.https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf. Accessed: 2018-01-31.
[11]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX). 127–144.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
[13]
Mario Drumond. 2020. ColTraIn: Co-located DNN training and inference. Ph.D. Dissertation. EPFL.
[14]
Mario Drumond, Tao Lin, Martin Jaggi, and Babak Falsafi. 2018. Training DNNs with Hybrid Block Floating Point. In Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS). 451–461.
[15]
Hadi Esmaeilzadeh, Emily R. Blem, Rene St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). 365–376.
[16]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 45th International Symposium on Computer Architecture (ISCA). 1–14.
[17]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the Thirty-second International Conference on Machine Learning (ICML). 1737–1746.
[18]
Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the 24th IEEE Symposium on High-Performance Computer Architecture (HPCA). 620–629.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
[20]
Tayler Hicklin Hetherington, Maria Lubeznov, Deval Shah, and Tor M. Aamodt. 2019. EDGE: Event-Driven GPU Execution. In Proceedings of the 28th International Conference on Parallel Architecture and Compilation Techniques (PACT). 337–353.
[21]
Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM (2020), 67–78.
[22]
Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA). 1–12.
[23]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto(2009).
[24]
Urs Kster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. 2017. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In Proceedings of the Thirty-first Conference on Neural Information Processing Systems (NIPS). 1742–1752.
[25]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd International Symposium on Computer Architecture (ISCA). 450–462.
[26]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 3–14.
[27]
Sharan Narang and Greg Diamos. 2017. Baidu DeepBench. https://doi.org/10.1145/1105734.1105748.
[28]
Ali Pahlevan, Javier Picorel, Arash Pourhabibi Zarandi, Davide Rossi, Marina Zapater, Andrea Bartolini, Pablo Garca Del Valle, David Atienza, Luca Benini, and Babak Falsafi. 2016. Towards near-threshold server processors. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). 7–12.
[29]
George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP). 325–341.
[30]
Bita Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, Alessandro Forin, Haishan Zhu, Taesik Na, Prerak Patel, Shuai Che, Lok Chand Koppaka, Xia Song, Subhojit Som, Kaustav Das, Saurabh Tiwary, Steve Reinhardt, Sitaram Lanka, Eric Chung, and Doug Burger. 2020. Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point. NeurIPS 2020 vol. 33, no. 4, pp. 100-107. (2020).
[31]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs/1409.0575(2014).
[32]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[33]
Kevin Tran. 2016. Start Your HBM/2.5 D Design Today.
[34]
David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Katie Baynes, Aamer Jaleel, and Bruce Jacob.2005. DRAMsim: A memory-system simulator. Computer Architecture News vol. 33, no. 4, pp. 100-107. (2005). https://doi.org/10.1145/1105734.1105748
[35]
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training Deep Neural Networks with 8-bit Floating Point Numbers. In Proceedings of the Thirty-second Conference on Neural Information Processing Systems (NeurIPS). 7686–7695.
[36]
Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In Proceedings of the 2018 IEEE International Symposium on Workload Characterization. 88–100.

Cited By

View all
  • (2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
  • (2024)Guarding Deep Learning Systems With Boosted Evasion Attack Detection and Model UpdateIEEE Internet of Things Journal10.1109/JIOT.2023.332456811:6(9382-9391)Online publication date: 15-Mar-2024
  • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNN accelerators
  2. DNN inference
  3. systolic arrays

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Swiss National Science Foundation

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)4
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
  • (2024)Guarding Deep Learning Systems With Boosted Evasion Attack Detection and Model UpdateIEEE Internet of Things Journal10.1109/JIOT.2023.332456811:6(9382-9391)Online publication date: 15-Mar-2024
  • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
  • (2023)Scale-out Systolic ArraysACM Transactions on Architecture and Code Optimization10.1145/357291720:2(1-25)Online publication date: 1-Mar-2023
  • (2023)GERALT: Real-time Detection of Evasion Attacks in Deep Learning Systems2023 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS46773.2023.10181921(1-5)Online publication date: 21-May-2023
  • (2023)INCA: Input-stationary Dataflow at Outside-the-box Thinking about Deep Learning Accelerators2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070992(29-41)Online publication date: Feb-2023
  • (2023)Nacc-Guard: a lightweight DNN accelerator architecture for secure deep learningThe Journal of Supercomputing10.1007/s11227-023-05671-980:5(5815-5831)Online publication date: 7-Oct-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media