research-article

Open access

Simba: scaling deep-learning inference with chiplet-based architecture

Authors:

Yakun Sophia Shao,

Rangharajan Venkatesan,

Matthew Fojtik,

Alicia Klinefelter,

Nathaniel Pinckney,

Priyanka Raina,

Stephen G. Tell,

William J. Dally,

C. Thomas Gray,

Brucek Khailany,

Stephen W. KecklerAuthors Info & Claims

Communications of the ACM, Volume 64, Issue 6

Pages 107 - 116

https://doi.org/10.1145/3460227

Published: 24 May 2021 Publication History

All formats PDF

Abstract

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

References

[1]

Arunkurnar, A., Bolotin, E, Cho, B., Milic, U., Ebrahimi, E., Villa, O., Jaleel, A., Wu, C.-J., Nellans, D. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.

[2]

Asanovic, K., Avizienis, R., Bachrach, J., Beamer, S., Biancolin, D., Celio, C., Cook, H., Dabbelt, D., Hauser, J., Izraelevitz, A., Karandikar, S., Keller, B., Kim, D., Koenig, J., Lee, Y., Love, E., Maas. M., Magyar, A., Mao, H., Moreto, M., Ou, A., Patterson, D.A., Richards, B., Schmidt, C., Twigg, S., Vo, H., Waterman, A. The Rocket Chip Generator. Technical Report, EECS Department, University of California, Berkeley, 2016.

[3]

Beck, N., White, S., Paraschou, M., Naffziger, S. Zeppelin: An SoC for multichip architectures. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2018), IEEE, San Francisco, CA, USA.

[4]

Carloni, L.P., McMillan, K.L., Saldanha, A., Sangiovanni-Vincentelli, A.L. A methodology for correct-by-construction latency insensitive design. In Design Automation Conference (1999), IEEE, San Jose, CA, USA.

[5]

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (Salt Lake City, Utah, USA, 2014), Association for Computing Machinery, New York, NY, USA.

[6]

Chen, Y.-H., Emer, J., Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (2016), IEEE, Seoul, South Korea.

Digital Library

[7]

Das, R., Eachempati, S., Mishra, A.K., Narayanan, V., Das, C.R. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) (2009), IEEE, Raleigh, NC, USA.

[8]

Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N., Tell, S.G., Zimmer, B., Raja, T., Zhou, K., Dally, W.J., Khailany, B. A fine-grained GALS SoC with pausible adaptive clocking in 16nm FinFET. In International Symposium on Asynchronous Circuits and Systems (ASYNC) (2019), IEEE, Hirosaki, Japan.

[9]

Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G., Woods, L., Lanka, S., Reinhardt, S., Caulfield, A., Chung, E., Burger, D. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture (ISCA) (2018), IEEE, Los Angeles, CA, USA.

Digital Library

[10]

Gao, M., Yang, X., Pu, J., Horowitz, M., Kozyrakis, C. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (2019), Association for Computing Machinery, New York NY, USA.

[11]

Greenhill, D., Ho, R., Lewis, D., Schmit, H., Chan, K.H., Tong, A., Atsatt, S., How, D., McElheny, P., Duwel, K., Schulz, J., Faulkner, D., Iyer, G., Chen, G., Phoon, H.K., Lim, H.W., Koay, W.-Y., Garibay, T. A14nm 1GHz FPGA with 2.5D transceiver integration. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2017), IEEE, San Francisco, CA, USA.

[12]

He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2016), IEEE, Las Vegas, NV, USA.

[13]

Iyer, S.S. Heterogeneous integration for performance and scaling. IEEE Transactions on Components, Packaging and Manufacturing Technology (2016), IEEE.

[14]

Jia, Y., Shelhamer E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22^nd ACM International Conference on Multimedia (2014), Association for Computing Machinery, New York, NY, USA.

Digital Library

[15]

Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., luc Cantin. P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY USA.

Digital Library

[16]

Loh, G.H., Jerger, N.E., Kannan, A., Eckert, Y. Interconnect-memory challenges for multi-chip, silicon interposer systems. In Proceedings of the International Symposium on Memory System (MEMSYS) (Washington DC, USA, 2015), Association for Computing Machinery, New York, NY, USA.

Digital Library

[17]

Mirhoseini, A., Pham, H., Le, Q.V., Steiner B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J. Device placement optimization with reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML) (2017), JMLR.org, Sydney, NSW, Australia.

[18]

NVIDIA. NVIDIA Tesla deep learning product performance. https://developer.nvidia.com/deep-learning-performance-training-inference, 2019.

[19]

Parashar, A., Raina, P., Shao, Y.S., Chen, Y.-H., Ying, V.A., Mukkara, A., Venkatesan, R., Khailany, B., Keckler, S.W., Emer, J. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) (2019), IEEE, Madison, WI, USA.

[20]

Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.

Digital Library

[21]

Shao, Y.S., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer, J.S., Gray, C.T., Keckler, S.W., Khailany, B. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO) (Columbus, OH, USA, 2019), Association for Computing Machinery, New York, NY, USA.

[22]

Sijstermans, F. The NVIDIA deep learning accelerator. In Hot Chips (2018).

[23]

Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., Raghunathan, A. ScaleDeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.

[24]

Wilson, J.M., Turner, W.J., Poulton, J.W., Zimmer, B., Chen, X., Kudva, S.S., Song, S., Tell, S.G., Nedovic, N., Zhao, W., Sudhakaran, S.R., Gray, C.T., Dally, W.J. A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2018), IEEE, San Francisco, CA, USA.

[25]

Zimmer, B., Venkatesan, R., Shao, Y.S., Clemons, J., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer J.S., Gray, C.T., Keckler, S.W., Khailany, B. A 0.11 pJ/Op, 0.32--128 TOPS, scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm. In Proceedings of the International Symposia on VLSI Technology and Circuits (VLSI) (2019), IEEE, Kyoto, Japan, Japan.

Cited By

Cao ZLiu QWan ZZhang WSong KLiu W(2025)Enhancing interconnection network topology for chiplet-based systems: An automated design frameworkFuture Generation Computer Systems10.1016/j.future.2024.107547163(107547)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107547
Glint TPechimuthu MMekie J(2024)DeepFrack: A Comprehensive Framework for Layer Fusion, Face Tiling, and Efficient Mapping in DNN Hardware Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546624(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546624
Yan KSong YLiu TTan JWei XFu X(2024)HSASFuture Generation Computer Systems10.1016/j.future.2024.01.023154:C(440-450)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.future.2024.01.023
Show More Cited By

Index Terms

Simba: scaling deep-learning inference with chiplet-based architecture
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and ...
Simba: Efficient In-Memory Spatial Analytics
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial ...
Simba: spatial in-memory big data analysis
SIGSPACIAL '16: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

We present the Simba (<u>S</u>patial <u>I</u>n-Memory <u>B</u>ig data <u>A</u>nalytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 64, Issue 6

June 2021

106 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3467845

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 May 2021

Published in CACM Volume 64, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

DARPA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
18,914
Total Downloads

Downloads (Last 12 months)3,932
Downloads (Last 6 weeks)183

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cao ZLiu QWan ZZhang WSong KLiu W(2025)Enhancing interconnection network topology for chiplet-based systems: An automated design frameworkFuture Generation Computer Systems10.1016/j.future.2024.107547163(107547)Online publication date: Feb-2025
https://doi.org/10.1016/j.future.2024.107547
Glint TPechimuthu MMekie J(2024)DeepFrack: A Comprehensive Framework for Layer Fusion, Face Tiling, and Efficient Mapping in DNN Hardware Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546624(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546624
Yan KSong YLiu TTan JWei XFu X(2024)HSASFuture Generation Computer Systems10.1016/j.future.2024.01.023154:C(440-450)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.future.2024.01.023
Parmar VSarwar SLi ZLee HSalvo BSuri M(2023)Exploring Memory-Oriented Design Optimization of Edge AI Hardware for Extended Reality ApplicationsIEEE Micro10.1109/MM.2023.332124943:6(40-49)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1109/MM.2023.3321249
Dang KDoan NNguyen NAbdallah A(2023)HeterGenMap: An Evolutionary Mapping Framework for Heterogeneous NoC-Based Neuromorphic SystemsIEEE Access10.1109/ACCESS.2023.334516811(144095-144112)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3345168
Zhu SMiao MZhang ZDuan X(2022)Research on A Chiplet-based DSA (Domain-Specific Architectures) Scalable Convolutional Acceleration Architecture2022 23rd International Conference on Electronic Packaging Technology (ICEPT)10.1109/ICEPT56209.2022.9873177(1-6)Online publication date: 10-Aug-2022
https://doi.org/10.1109/ICEPT56209.2022.9873177

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents