Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Simba: scaling deep-learning inference with chiplet-based architecture

Published: 24 May 2021 Publication History

Abstract

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

References

[1]
Arunkurnar, A., Bolotin, E, Cho, B., Milic, U., Ebrahimi, E., Villa, O., Jaleel, A., Wu, C.-J., Nellans, D. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.
[2]
Asanovic, K., Avizienis, R., Bachrach, J., Beamer, S., Biancolin, D., Celio, C., Cook, H., Dabbelt, D., Hauser, J., Izraelevitz, A., Karandikar, S., Keller, B., Kim, D., Koenig, J., Lee, Y., Love, E., Maas. M., Magyar, A., Mao, H., Moreto, M., Ou, A., Patterson, D.A., Richards, B., Schmidt, C., Twigg, S., Vo, H., Waterman, A. The Rocket Chip Generator. Technical Report, EECS Department, University of California, Berkeley, 2016.
[3]
Beck, N., White, S., Paraschou, M., Naffziger, S. Zeppelin: An SoC for multichip architectures. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2018), IEEE, San Francisco, CA, USA.
[4]
Carloni, L.P., McMillan, K.L., Saldanha, A., Sangiovanni-Vincentelli, A.L. A methodology for correct-by-construction latency insensitive design. In Design Automation Conference (1999), IEEE, San Jose, CA, USA.
[5]
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (Salt Lake City, Utah, USA, 2014), Association for Computing Machinery, New York, NY, USA.
[6]
Chen, Y.-H., Emer, J., Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (2016), IEEE, Seoul, South Korea.
[7]
Das, R., Eachempati, S., Mishra, A.K., Narayanan, V., Das, C.R. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) (2009), IEEE, Raleigh, NC, USA.
[8]
Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N., Tell, S.G., Zimmer, B., Raja, T., Zhou, K., Dally, W.J., Khailany, B. A fine-grained GALS SoC with pausible adaptive clocking in 16nm FinFET. In International Symposium on Asynchronous Circuits and Systems (ASYNC) (2019), IEEE, Hirosaki, Japan.
[9]
Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G., Woods, L., Lanka, S., Reinhardt, S., Caulfield, A., Chung, E., Burger, D. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture (ISCA) (2018), IEEE, Los Angeles, CA, USA.
[10]
Gao, M., Yang, X., Pu, J., Horowitz, M., Kozyrakis, C. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (2019), Association for Computing Machinery, New York NY, USA.
[11]
Greenhill, D., Ho, R., Lewis, D., Schmit, H., Chan, K.H., Tong, A., Atsatt, S., How, D., McElheny, P., Duwel, K., Schulz, J., Faulkner, D., Iyer, G., Chen, G., Phoon, H.K., Lim, H.W., Koay, W.-Y., Garibay, T. A14nm 1GHz FPGA with 2.5D transceiver integration. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2017), IEEE, San Francisco, CA, USA.
[12]
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2016), IEEE, Las Vegas, NV, USA.
[13]
Iyer, S.S. Heterogeneous integration for performance and scaling. IEEE Transactions on Components, Packaging and Manufacturing Technology (2016), IEEE.
[14]
Jia, Y., Shelhamer E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (2014), Association for Computing Machinery, New York, NY, USA.
[15]
Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., luc Cantin. P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY USA.
[16]
Loh, G.H., Jerger, N.E., Kannan, A., Eckert, Y. Interconnect-memory challenges for multi-chip, silicon interposer systems. In Proceedings of the International Symposium on Memory System (MEMSYS) (Washington DC, USA, 2015), Association for Computing Machinery, New York, NY, USA.
[17]
Mirhoseini, A., Pham, H., Le, Q.V., Steiner B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J. Device placement optimization with reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML) (2017), JMLR.org, Sydney, NSW, Australia.
[18]
NVIDIA. NVIDIA Tesla deep learning product performance. https://developer.nvidia.com/deep-learning-performance-training-inference, 2019.
[19]
Parashar, A., Raina, P., Shao, Y.S., Chen, Y.-H., Ying, V.A., Mukkara, A., Venkatesan, R., Khailany, B., Keckler, S.W., Emer, J. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) (2019), IEEE, Madison, WI, USA.
[20]
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.
[21]
Shao, Y.S., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer, J.S., Gray, C.T., Keckler, S.W., Khailany, B. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO) (Columbus, OH, USA, 2019), Association for Computing Machinery, New York, NY, USA.
[22]
Sijstermans, F. The NVIDIA deep learning accelerator. In Hot Chips (2018).
[23]
Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., Raghunathan, A. ScaleDeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA.
[24]
Wilson, J.M., Turner, W.J., Poulton, J.W., Zimmer, B., Chen, X., Kudva, S.S., Song, S., Tell, S.G., Nedovic, N., Zhao, W., Sudhakaran, S.R., Gray, C.T., Dally, W.J. A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2018), IEEE, San Francisco, CA, USA.
[25]
Zimmer, B., Venkatesan, R., Shao, Y.S., Clemons, J., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer J.S., Gray, C.T., Keckler, S.W., Khailany, B. A 0.11 pJ/Op, 0.32--128 TOPS, scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm. In Proceedings of the International Symposia on VLSI Technology and Circuits (VLSI) (2019), IEEE, Kyoto, Japan, Japan.

Cited By

View all
  • (2025)Enhancing interconnection network topology for chiplet-based systems: An automated design frameworkFuture Generation Computer Systems10.1016/j.future.2024.107547163(107547)Online publication date: Feb-2025
  • (2024)DeepFrack: A Comprehensive Framework for Layer Fusion, Face Tiling, and Efficient Mapping in DNN Hardware Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546624(1-6)Online publication date: 25-Mar-2024
  • (2024)HSASFuture Generation Computer Systems10.1016/j.future.2024.01.023154:C(440-450)Online publication date: 25-Jun-2024
  • Show More Cited By

Index Terms

  1. Simba: scaling deep-learning inference with chiplet-based architecture

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Communications of the ACM
        Communications of the ACM  Volume 64, Issue 6
        June 2021
        106 pages
        ISSN:0001-0782
        EISSN:1557-7317
        DOI:10.1145/3467845
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 24 May 2021
        Published in CACM Volume 64, Issue 6

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • DARPA

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)3,932
        • Downloads (Last 6 weeks)183
        Reflects downloads up to 06 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Enhancing interconnection network topology for chiplet-based systems: An automated design frameworkFuture Generation Computer Systems10.1016/j.future.2024.107547163(107547)Online publication date: Feb-2025
        • (2024)DeepFrack: A Comprehensive Framework for Layer Fusion, Face Tiling, and Efficient Mapping in DNN Hardware Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546624(1-6)Online publication date: 25-Mar-2024
        • (2024)HSASFuture Generation Computer Systems10.1016/j.future.2024.01.023154:C(440-450)Online publication date: 25-Jun-2024
        • (2023)Exploring Memory-Oriented Design Optimization of Edge AI Hardware for Extended Reality ApplicationsIEEE Micro10.1109/MM.2023.332124943:6(40-49)Online publication date: 2-Oct-2023
        • (2023)HeterGenMap: An Evolutionary Mapping Framework for Heterogeneous NoC-Based Neuromorphic SystemsIEEE Access10.1109/ACCESS.2023.334516811(144095-144112)Online publication date: 2023
        • (2022)Research on A Chiplet-based DSA (Domain-Specific Architectures) Scalable Convolutional Acceleration Architecture2022 23rd International Conference on Electronic Packaging Technology (ICEPT)10.1109/ICEPT56209.2022.9873177(1-6)Online publication date: 10-Aug-2022

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Digital Edition

        View this article in digital edition.

        Digital Edition

        Magazine Site

        View this article on the magazine site (external)

        Magazine Site

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media