research-article

DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

Authors:

Yi KangAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 28, Issue 3

Article No.: 36, Pages 1 - 30

https://doi.org/10.1145/3576196

Published: 19 March 2023 Publication History

Abstract

Convolution neural networks (CNNs) are widely used algorithms in image processing, natural language processing and many other fields. The large amount of memory access of CNNs is one of the major concerns in CNN accelerator designs that influences the performance and energy-efficiency. With fast and low-cost memory access, Processing-In-Memory (PIM) system is a feasible solution to alleviate the memory concern of CNNs. However, the distributed manner of data storing in PIM systems is in conflict with the large amount of data reuse of CNN layers. Nodes of PIM systems may need to share their data with each other before processing a CNN layer, leading to extra communication overhead. In this article, we propose DDAM to map CNNs onto PIM systems with the communication overhead reduced. Firstly, A data transfer strategy is proposed to deal with the data sharing requirement among PIM nodes by formulating a Traveling-Salesman-Problem (TSP). To improve data locality, a dynamic programming algorithm is proposed to partition the CNN and allocate a number of nodes to each part. Finally, an integer linear programming (ILP)-based mapping algorithm is proposed to map the partitioned CNN onto the PIM system. Experimental results show that compared to the baselines, DDAM can get a higher throughput of 2.0× with the energy cost reduced by 37% on average.

References

[1]

2018. Hybrid memory cube – HMC Gen2. (2018), 105. Retrieved from https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. Accessed May 1, 2022.

[2]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE, Taipei, Taiwan, 1–12. DOI:

[3]

Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. 2021. Revisiting ResNets: Improved training and scaling strategies. arXiv:2103.07579. Retrieved from https://arxiv.org/abs/2103.07579.

[4]

Xiaoming Chen, Yinhe Han, and Yu Wang. 2020. Communication lower bound in convolution accelerators. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture.529–541. DOI:ISSN: 2378-203X.

[5]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Cambridge, United Kingdom, 609–622. DOI:

Digital Library

[6]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 367–379. DOI:

Digital Library

[7]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.27–39. DOI:ISSN: 1063-6897.

Digital Library

[8]

Palash Das and Hemangee K. Kapoor. 2021. CLU: A near-memory accelerator exploiting the parallelism in convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems 17, 2, 1–25. DOI:

Digital Library

[9]

Bai Fujun, Jiang Xiping, Wang Song, Yu Bing, Tan Jie, Zuo Fengguo, Wang Chunjuan, Wang Fan, Long Xiaodong, Yu Guoqing, Fu Ni, Li Qiannan, Li Hua, Wang Kexin, Duan Huifu, Bai Liang, Jia Xuerong, Li Jin, Li Mei, Wang Zhengwen, Hu Sheng, Zhou Jun, Zhan Qiong, Sun Peng, Yang Daohong, Cheichan Kau, David Yang, Ching-Sung Ho, Sun Hongbin, Lv Hangbing, Liu Ming, Kang Yi, and Ren Qiwei. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In Proceedings of the 2020 IEEE International Electron Devices Meeting.IEEE, 6.6.1–6.6.4. DOI:

[10]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Xi’an China, 751–764. DOI:

Digital Library

[11]

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 807–820. DOI:

Digital Library

[12]

Lei Gong, Chao Wang, Xi Li, Huaping Chen, and Xuehai Zhou. 2018. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11(2018), 2601–2612. DOI:

[13]

Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. Retrieved from https://www.gurobi.com. Accessed May 1, 2022.

[14]

Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1510.00149.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770–778. DOI:

[16]

Keld Helsgaun. 2000. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European Journal of Operational Research 126, 1(2000), 106–130. DOI:

[17]

Mark Horowitz. 2014. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers.10–14. DOI:

[18]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by constrained optimization for spatial accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.IEEE, Valencia, Spain, 554–566. DOI:

Digital Library

[19]

Natalie Enright Jerger, Tushar Krishna, Li-Shiuan Peh, and Margaret Martonosi. 2017. On-Chip Networks: Second Edition. Morgan and Claypool.

[20]

Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, D. E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software.86–96. DOI:

[21]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Jaworski Alek, Kaplan Alexander, Khaitan Harshit, Killebrew Daniel, Koch Andy, Kumar Naveen, Lacy Steve, Laudon James, Law James, Le Diemthu, Leary Chris, Liu Zhuyuan, Lucke Kyle, Lundin Alan, MacKean Gordon, Maggiore Adriana, Mahony Maire, Miller Kieran, Nagarajan Rahul, Narayanaswami Ravi, Ni Ray, Nix Kathy, Norrie Thomas, Omernick Mark, Penukonda Narayana, Phelps Andy, Ross Jonathan, Ross Matt, Salek Amir, Samadiani Emad, Severn Chris, Sizikov Gregory, Snelham Matthew, Souter Jed, Steinberg Dan, Swing Andy, Tan Mercedes, Thorson Gregory, Tian Bo, Toma Horia, Tuttle Erick, Vasudevan Vijay, Walter Richard, Wang Walter, Wilcox Eric, and Yoon Doe Hyun. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, Toronto ON Canada, 1–12. DOI:

Digital Library

[22]

Chandrasekar Karthik, Weis Christian, Li Yonghui, Goossens Sven, Jung Matthias, Naji Omar, Akesson Benny, Wehn Norbert, and Kees Goossens. 2022. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http://www.drampower.info.

[23]

Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016a. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 380–392. DOI:

Digital Library

[24]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016b. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1(2016), 45–49. DOI:

Digital Library

[25]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6(2017), 84–90. DOI:

Digital Library

[26]

Benzheng Li, Qi Du, Dingcheng Liu, Jingchong Zhang, Gengjie Chen, and Hailong You. 2021. Placement for wafer-scale deep learning accelerator. In Proceedings of the 2021 26th Asia and South Pacific Design Automation Conference.665–670.

Digital Library

[27]

Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition.IEEE, Dresden, Germany, 343–348. DOI:

[28]

Chuhan Min, Jiachen Mao, Hai Li, and Yiran Chen. 2019. NeuralHMC: An efficient HMC-based accelerator for deep neural networks. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. ACM, Tokyo Japan, 394–399. DOI:

Digital Library

[29]

Dimin Niu, Shuangchen Li, Yuhao Wang, Wei Han, Zhe Zhang, Yijin Guan, Tianchan Guan, Fei Sun, Fei Xue, Lide Duan, Yuanwei Fang, Hongzhong Zheng, Xiping Jiang, Song Wang, Fengguo Zuo, Yubing Wang, Bing Yu, Qiwei Ren, and Yuan Xie. 2022. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference.1–3. DOI:ISSN: 2376–8606.

[30]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 304–315. DOI:

[31]

Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from https://arxiv.org/abs/1804.02767.

[32]

Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 58–68. DOI:

[33]

Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 14–27. DOI:

Digital Library

[34]

Kota Shiba, Tatsuo Omori, Kodai Ueyoshi, Shinya Takamaeda-Yamazaki, Masato Motomura, Mototsugu Hamada, and Tadahiro Kuroda. 2021. A 96-MB 3D-stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 2(2021), 692–703. DOI:

[35]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1409.1556.

[36]

Gagandeep Singh, Juan Gómez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6. DOI:

Digital Library

[37]

S. Sutanthavibul, E. Shragowitz, and J. B. Rosen. 1991. An analytical approach to floorplan design and optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10, 6(1991), 761–769. DOI:

Digital Library

[38]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1–9. DOI:

[39]

Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Shinya Takamaeda-Yamazaki, Mototsugu Hamada, Tadahiro Kuroda, and Masato Motomura. 2019. QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS. IEEE Journal of Solid-State Circuits 54, 1(2019), 186–196. DOI:

[40]

Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6(2018), 1428–1441. DOI:

[41]

Yi Wang, Mingxu Zhang, and Jing Yang. 2017. Exploiting parallelism for convolutional connections in processing-in-memory architecture. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 1–6. DOI:

Digital Library

[42]

Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6. DOI:

Digital Library

[43]

Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design.IEEE, 1–8. DOI:

[44]

Shixuan Zheng, Xianjue Zhang, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2022. Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture.475–489. DOI:

[45]

Zhenhua Zhu, Hanbo Sun, Yujun Lin, Guohao Dai, Lixue Xia, Song Han, Yu Wang, and Huazhong Yang. 2019. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference.1–6.

Digital Library

Cited By

Zhao XChen SKang Y(2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3659951
Wang JGe MDing BXu QChen SKang Y(2024)NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators With 3-D Stacked-DRAMIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334260543:5(1456-1469)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3342605
Han HWang JDing BChen S(2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
https://doi.org/10.1109/AICAS59952.2024.10595921
Show More Cited By

Index Terms

DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES '17

Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...
An Enhanced Memory Address Mapping Scheme for Improved Memory Access Performance of 2-D DWT Processing Systems
Abstract
The implementation of the memory for storing image and transform coefficients in 2-D DWT processing systems using the more cost-effective external memory module such as DDR DRAM is shown to suffer from effective memory bandwidth which is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 3

May 2023

456 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3587887

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 19 March 2023

Online AM: 15 December 2022

Accepted: 30 November 2022

Revised: 13 September 2022

Received: 13 May 2022

Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
National Natural Science Foundation of China (NSFC)
CAS Project for Young Scientists in Basic Research
Strategic Priority Research Program of Chinese Academy of Sciences

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
642
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)22

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao XChen SKang Y(2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3659951
Wang JGe MDing BXu QChen SKang Y(2024)NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators With 3-D Stacked-DRAMIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334260543:5(1456-1469)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3342605
Han HWang JDing BChen S(2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
https://doi.org/10.1109/AICAS59952.2024.10595921
Kang HZhao YBlelloch GDhulipala LGu YMcGuffey CGibbons PAgrawal KShun J(2023)PIM-trie: A Skew-resistant Trie for Processing-in-MemoryProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591070(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591070

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents