research-article

SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks

Authors:

Gokul Krishnan,

Sumit K. Mandal,

Manvitha Pannala,

Chaitali Chakrabarti,

Yu CaoAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 20, Issue 5s

Article No.: 68, Pages 1 - 24

https://doi.org/10.1145/3476999

Published: 17 September 2021 Publication History

Abstract

In-memory computing (IMC) on a monolithic chip for deep learning faces dramatic challenges on area, yield, and on-chip interconnection cost due to the ever-increasing model sizes. 2.5D integration or chiplet-based architectures interconnect multiple small chips (i.e., chiplets) to form a large computing system, presenting a feasible solution beyond a monolithic IMC architecture to accelerate large deep learning models. This paper presents a new benchmarking simulator, SIAM, to evaluate the performance of chiplet-based IMC architectures and explore the potential of such a paradigm shift in IMC architecture design. SIAM integrates device, circuit, architecture, network-on-chip (NoC), network-on-package (NoP), and DRAM access models to realize an end-to-end system. SIAM is scalable in its support of a wide range of deep neural networks (DNNs), customizable to various network structures and configurations, and capable of efficient design space exploration. We demonstrate the flexibility, scalability, and simulation speed of SIAM by benchmarking different state-of-the-art DNNs with CIFAR-10, CIFAR-100, and ImageNet datasets. We further calibrate the simulation results with a published silicon result, SIMBA. The chiplet-based IMC architecture obtained through SIAM shows 130

and 72

improvement in energy-efficiency for ResNet-50 on the ImageNet dataset compared to Nvidia V100 and T4 GPUs.

References

[1]

AnySilicon. 2011. Fabrication Cost. https://anysilicon.com/die-per-wafer-formula-free-calculators/. Accessed 29 Mar. 2021.

[2]

Ali BanaGozar et al. 2019. CIM-SIM: Computation in memory SIMulator. In Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems. 1–4.

Digital Library

[3]

Noah Beck, Sean White, Milam Paraschou, and Samuel Naffziger. 2018. ‘Zeppelin’: An SoC for multichip architectures. In 2018 IEEE ISSCC. IEEE, 40–42.

[4]

Nathan Binkert et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1–7.

Digital Library

[5]

Mariusz Bojarski et al. 2017. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911 (2017).

[6]

Indranil Chakraborty, Mustafa Fayez Ali, Dong Eun Kim, Aayush Ankit, and Kaushik Roy. 2020. Geniex: A generalized approach to emulating non-ideality in memristive xbars using neural networks. In 2020 57th ACM/IEEE DAC. IEEE.

Digital Library

[7]

Marc Erett et al. 2018. A 126mW 56Gb/s NRZ wireline transceiver for synchronous short-reach applications in 16nm FinFET. In 2018 IEEE ISSCC. IEEE, 274–276.

[8]

Saugata Ghose et al. 2018. What your DRAM power models are not telling you: Lessons from a detailed experimental study. Proceedings of the ACM on Measurement and Analysis of Computing Systems 2, 3 (2018), 1–41.

Digital Library

[9]

David Greenhill et al. 2017. 3.3 A 14nm 1GHz FPGA with 2.5 D transceiver integration. In 2017 IEEE ISSCC. IEEE.

[10]

Kaiming He et al. 2016. Deep residual learning for image recognition. In IEEE CVPR. 770–778.

[11]

Andrew Howard et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF ICCV. 1314–1324.

[12]

Gao Huang et al. 2017. Densely connected convolutional networks. In IEEE CVPR. 4700–4708.

[13]

Ranggi Hwang, Taehun Kim, Youngeun Kwon, and Minsoo Rhu. 2020. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual ISCA. IEEE, 968–981.

Digital Library

[14]

Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. In Proceedings of the 46th Annual ISCA. 802–815.

Digital Library

[15]

Shubham Jain, Abhronil Sengupta, Kaushik Roy, and Anand Raghunathan. 2020. RxNN: A framework for evaluating deep neural networks on resistive crossbars. IEEE TCAD (2020).

[16]

James Jeffers et al. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition.

Digital Library

[17]

Nan Jiang et al. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In IEEE ISPASS. 86–96.

[18]

Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H Loh. 2015. Enabling interposer-based disintegration of multi-core processors. In 2015 48th Annual IEEE/ACM MICRO. IEEE, 546–558.

Digital Library

[19]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. RAMULATOR: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1 (2015), 45–49.

Digital Library

[20]

Gokul Krishnan et al. 2020. Interconnect-aware area and energy optimization for in-memory acceleration of DNNs. IEEE Design & Test 37, 6 (2020), 79–87.

[21]

Gokul Krishnan et al. 2021. Interconnect-centric benchmarking of in-memory acceleration for DNNS. In 2021 China Semiconductor Technology International Conference (CSTIC). IEEE, 1–4.

[22]

Mu-Shan Lin et al. 2020. A 7-nm 4-GHz Arm-core-based CoWoS chiplet design for high-performance computing. IEEE Journal of Solid-State Circuits 55, 4 (2020), 956–966.

[23]

Sumit K Mandal et al. 2020. A latency-optimized reconfigurable noc for in-memory acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 3 (2020), 362–375.

[24]

Sumit K Mandal, Anish Krishnakumar, and Umit Y Ogras. 2021. Energy-efficient networks-on-chip architectures: Design and run-time optimization. Network-on-Chip Security and Privacy (2021), 55.

[25]

Radu Marculescu et al. 2008. Outstanding research problems in NoC design: system, microarchitecture, and circuit perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 1 (2008), 3–21.

Digital Library

[26]

MICRON. 2011. Datasheet for DDR3 Model. https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr3/2gb_ddr3l-rs.pdf?rev=f43686e89394458caff410138d9d2152. Accessed 29 Mar. 2021.

[27]

MICRON. 2014. Datasheet for DDR4 Model. https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/4gb_ddr4_dram_2e0d.pdf. Accessed 29 Mar. 2021.

[28]

Seyed Morteza Nabavinejad et al. 2020. An overview of efficient interconnection networks for deep neural network accelerators. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 3 (2020), 268–282.

[29]

Xiaochen Peng, Shanshi Huang, Yandong Luo, Xiaoyu Sun, and Shimeng Yu. 2019. DNN+ neurosim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies. In 2019 IEEE IEDM. IEEE, 32–5.

[30]

John W Poulton et al. 2013. A 0.54 pJ/b 20Gb/s ground-referenced single-ended short-haul serial link in 28nm CMOS for advanced packaging applications. In 2013 IEEE ISSCC. IEEE, 404–405.

[31]

Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, David Atienza, and Katzalin Olcoz. 2019. Gem5-X: A Gem5-based system level simulation framework to optimize many-core platforms. In 2019 SpringSim. IEEE, 1–12.

Digital Library

[32]

Enrico Russo, Maurizio Palesi, Salvatore Monteleone, Davide Patti, Giuseppe Ascia, and Vincenzo Catania. 2021. LAMBDA: An open framework for deep neural network accelerators simulation. In 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops).

[33]

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic CNN accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).

[34]

Ali Shafiee et al. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM/IEEE ISCA (2016).

Digital Library

[35]

Yakun Sophia Shao et al. 2019. SIMBA: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM MICRO. 14–27.

Digital Library

[36]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[37]

Saurabh Sinha, Greg Yeric, Vikas Chandra, Brian Cline, and Yu Cao. 2012. Exploring sub-20nm FinFET design with predictive technology models. In DAC 2012. IEEE, 283–288.

Digital Library

[38]

Linghao Song et al. 2017. Pipelayer: A pipelined reram-based accelerator for deep learning. In IEEE HPCA. 541–552.

[39]

Simon M Tam et al. 2018. SkyLake-SP: A 14nm 28-core xeon® Processor. In 2018 IEEE ISSCC. 34–36.

[40]

Walker J Turner et al. 2018. Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects. In 2018 IEEE CICC. IEEE, 1–8.

[41]

Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. 2019. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE/CVF ICCV. 1284–1293.

[42]

Jieming Yin et al. 2018. Modular routing design for chiplet-based systems. In 2018 ACM/IEEE 45th Annual ISCA. IEEE.

Digital Library

[43]

Shihui Yin et al. 2019. Vesti: Energy-efficient in-memory computing accelerator for deep neural networks. IEEE TVLSI 28, 1 (2019), 48–61.

[44]

Zhenhua Zhu et al. 2020. MNSIM 2.0: A behavior-level modeling tool for memristor-based neuromorphic computing systems. In Proceedings of the 2020 on Great Lakes Symposium on VLSI. 83–88.

Digital Library

[45]

Brian Zimmer et al. 2019. A 0.11 PJ/Op, 0.32-128 TOPS, scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm. In 2019 Symposium on VLSI Circuits. IEEE, C300–C301.

[46]

Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).

Cited By

Sharma HNarang GDoppa JOgras UPande P(2024)Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546730(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546730
Narang GOgbogu CDoppa JPande P(2024)TEFLON: Thermally Efficient Dataflow-aware 3D NoC for Accelerating CNN Inferencing on Manycore PIM ArchitecturesACM Transactions on Embedded Computing Systems10.1145/366527923:5(1-23)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3665279
Park JKanani APfromm LSharma HSolanki PTervo EDoppa JPande POgras U(2024)Thermal Modeling and Management Challenges in Heterogenous Integration: 2.5D Chiplet Platforms and Beyond2024 IEEE 42nd VLSI Test Symposium (VTS)10.1109/VTS60656.2024.10538578(1-4)Online publication date: 22-Apr-2024
https://doi.org/10.1109/VTS60656.2024.10538578
Show More Cited By

Index Terms

SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks
1. Hardware

Recommendations

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks
With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions—one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The ...
Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks
Special Issue ESWEEK 2023
Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a ...
Cycle-Accurate NoC-based Convolutional Neural Network Simulator
COINS '19: Proceedings of the International Conference on Omni-Layer Intelligent Systems

Due to the development of intelligent systems, convolutional neural network (CNN) have been applied and achieved outstanding performance in many aspects, such as patent recognition and object classification. Although CNN brings many advantages to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 20, Issue 5s

Special Issue ESWEEK 2021, CASES 2021, CODES+ISSS 2021 and EMSOFT 2021

October 2021

1367 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3481713

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 17 September 2021

Accepted: 01 July 2021

Revised: 01 June 2021

Received: 01 April 2021

Published in TECS Volume 20, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Semiconductor Research Corporation (SRC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
1,329
Total Downloads

Downloads (Last 12 months)498
Downloads (Last 6 weeks)51

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sharma HNarang GDoppa JOgras UPande P(2024)Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546730(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546730
Narang GOgbogu CDoppa JPande P(2024)TEFLON: Thermally Efficient Dataflow-aware 3D NoC for Accelerating CNN Inferencing on Manycore PIM ArchitecturesACM Transactions on Embedded Computing Systems10.1145/366527923:5(1-23)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3665279
Park JKanani APfromm LSharma HSolanki PTervo EDoppa JPande POgras U(2024)Thermal Modeling and Management Challenges in Heterogenous Integration: 2.5D Chiplet Platforms and Beyond2024 IEEE 42nd VLSI Test Symposium (VTS)10.1109/VTS60656.2024.10538578(1-4)Online publication date: 22-Apr-2024
https://doi.org/10.1109/VTS60656.2024.10538578
Zhang JFan XYe YWang XXiong GLeng XXu NLian YHe G(2024)INDM: Chiplet-Based Interconnect Network and Dataflow Mapping for DNN AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333283243:4(1107-1120)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TCAD.2023.3332832
Nalla PWang ZAgarwal SXiao TBennett CMarinella MSeo JCao Y(2024)SHIFFT: A Scalable Hybrid In-Memory Computing FFT Accelerator2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00034(130-135)Online publication date: 1-Jul-2024
https://doi.org/10.1109/ISVLSI61997.2024.00034
Yang ZJi SChen XZhuang JZhang WJani DZhou PKim T(2024)Challenges and Opportunities to Enable Large-Scale Computing via Heterogeneous ChipletsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473961(765-770)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473961
Krestinskaya OFouda MBenmeziane HEl Maghraoui KSebastian ALu WLanza MLi HKurdahi FFahmy SEltawil ASalama K(2024)Neural architecture search for in-memory computing-based deep learning acceleratorsNature Reviews Electrical Engineering10.1038/s44287-024-00052-71:6(374-390)Online publication date: 20-May-2024
https://doi.org/10.1038/s44287-024-00052-7
Liu YLi XYin S(2024)Review of chiplet-based design: system architecture and interconnectionScience China Information Sciences10.1007/s11432-023-3926-867:10Online publication date: 19-Jul-2024
https://doi.org/10.1007/s11432-023-3926-8
Krishnan GK. Mandal SA. Goksoy AWang ZChakrabarti CSeo JY. Ogras UCao Y(2023)End-to-End Benchmarking of Chiplet-Based In-Memory ComputingNeuromorphic Computing10.5772/intechopen.111926Online publication date: 15-Nov-2023
https://doi.org/10.5772/intechopen.111926
Mizginov DTelminov OYanovich SZhevnenko DMeshchaninov FGornev E(2023)Investigation of the Temperature Dependence of Volt-Ampere Characteristics of a Thin-Film Si3N4 MemristorCrystals10.3390/cryst1302032313:2(323)Online publication date: 15-Feb-2023
https://doi.org/10.3390/cryst13020323
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents