Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

Published: 19 March 2023 Publication History

Abstract

Convolution neural networks (CNNs) are widely used algorithms in image processing, natural language processing and many other fields. The large amount of memory access of CNNs is one of the major concerns in CNN accelerator designs that influences the performance and energy-efficiency. With fast and low-cost memory access, Processing-In-Memory (PIM) system is a feasible solution to alleviate the memory concern of CNNs. However, the distributed manner of data storing in PIM systems is in conflict with the large amount of data reuse of CNN layers. Nodes of PIM systems may need to share their data with each other before processing a CNN layer, leading to extra communication overhead. In this article, we propose DDAM to map CNNs onto PIM systems with the communication overhead reduced. Firstly, A data transfer strategy is proposed to deal with the data sharing requirement among PIM nodes by formulating a Traveling-Salesman-Problem (TSP). To improve data locality, a dynamic programming algorithm is proposed to partition the CNN and allocate a number of nodes to each part. Finally, an integer linear programming (ILP)-based mapping algorithm is proposed to map the partitioned CNN onto the PIM system. Experimental results show that compared to the baselines, DDAM can get a higher throughput of 2.0× with the energy cost reduced by 37% on average.

References

[1]
2018. Hybrid memory cube – HMC Gen2. (2018), 105. Retrieved from https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. Accessed May 1, 2022.
[2]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE, Taipei, Taiwan, 1–12. DOI:
[3]
Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. 2021. Revisiting ResNets: Improved training and scaling strategies. arXiv:2103.07579. Retrieved from https://arxiv.org/abs/2103.07579.
[4]
Xiaoming Chen, Yinhe Han, and Yu Wang. 2020. Communication lower bound in convolution accelerators. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture.529–541. DOI:ISSN: 2378-203X.
[5]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Cambridge, United Kingdom, 609–622. DOI:
[6]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 367–379. DOI:
[7]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.27–39. DOI:ISSN: 1063-6897.
[8]
Palash Das and Hemangee K. Kapoor. 2021. CLU: A near-memory accelerator exploiting the parallelism in convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems 17, 2, 1–25. DOI:
[9]
Bai Fujun, Jiang Xiping, Wang Song, Yu Bing, Tan Jie, Zuo Fengguo, Wang Chunjuan, Wang Fan, Long Xiaodong, Yu Guoqing, Fu Ni, Li Qiannan, Li Hua, Wang Kexin, Duan Huifu, Bai Liang, Jia Xuerong, Li Jin, Li Mei, Wang Zhengwen, Hu Sheng, Zhou Jun, Zhan Qiong, Sun Peng, Yang Daohong, Cheichan Kau, David Yang, Ching-Sung Ho, Sun Hongbin, Lv Hangbing, Liu Ming, Kang Yi, and Ren Qiwei. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In Proceedings of the 2020 IEEE International Electron Devices Meeting.IEEE, 6.6.1–6.6.4. DOI:
[10]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Xi’an China, 751–764. DOI:
[11]
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 807–820. DOI:
[12]
Lei Gong, Chao Wang, Xi Li, Huaping Chen, and Xuehai Zhou. 2018. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11(2018), 2601–2612. DOI:
[13]
Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. Retrieved from https://www.gurobi.com. Accessed May 1, 2022.
[14]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1510.00149.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770–778. DOI:
[16]
Keld Helsgaun. 2000. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European Journal of Operational Research 126, 1(2000), 106–130. DOI:
[17]
Mark Horowitz. 2014. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers.10–14. DOI:
[18]
Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by constrained optimization for spatial accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.IEEE, Valencia, Spain, 554–566. DOI:
[19]
Natalie Enright Jerger, Tushar Krishna, Li-Shiuan Peh, and Margaret Martonosi. 2017. On-Chip Networks: Second Edition. Morgan and Claypool.
[20]
Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, D. E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software.86–96. DOI:
[21]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Jaworski Alek, Kaplan Alexander, Khaitan Harshit, Killebrew Daniel, Koch Andy, Kumar Naveen, Lacy Steve, Laudon James, Law James, Le Diemthu, Leary Chris, Liu Zhuyuan, Lucke Kyle, Lundin Alan, MacKean Gordon, Maggiore Adriana, Mahony Maire, Miller Kieran, Nagarajan Rahul, Narayanaswami Ravi, Ni Ray, Nix Kathy, Norrie Thomas, Omernick Mark, Penukonda Narayana, Phelps Andy, Ross Jonathan, Ross Matt, Salek Amir, Samadiani Emad, Severn Chris, Sizikov Gregory, Snelham Matthew, Souter Jed, Steinberg Dan, Swing Andy, Tan Mercedes, Thorson Gregory, Tian Bo, Toma Horia, Tuttle Erick, Vasudevan Vijay, Walter Richard, Wang Walter, Wilcox Eric, and Yoon Doe Hyun. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, Toronto ON Canada, 1–12. DOI:
[22]
Chandrasekar Karthik, Weis Christian, Li Yonghui, Goossens Sven, Jung Matthias, Naji Omar, Akesson Benny, Wehn Norbert, and Kees Goossens. 2022. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http://www.drampower.info.
[23]
Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016a. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 380–392. DOI:
[24]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016b. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1(2016), 45–49. DOI:
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6(2017), 84–90. DOI:
[26]
Benzheng Li, Qi Du, Dingcheng Liu, Jingchong Zhang, Gengjie Chen, and Hailong You. 2021. Placement for wafer-scale deep learning accelerator. In Proceedings of the 2021 26th Asia and South Pacific Design Automation Conference.665–670.
[27]
Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition.IEEE, Dresden, Germany, 343–348. DOI:
[28]
Chuhan Min, Jiachen Mao, Hai Li, and Yiran Chen. 2019. NeuralHMC: An efficient HMC-based accelerator for deep neural networks. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. ACM, Tokyo Japan, 394–399. DOI:
[29]
Dimin Niu, Shuangchen Li, Yuhao Wang, Wei Han, Zhe Zhang, Yijin Guan, Tianchan Guan, Fei Sun, Fei Xue, Lide Duan, Yuanwei Fang, Hongzhong Zheng, Xiping Jiang, Song Wang, Fengguo Zuo, Yubing Wang, Bing Yu, Qiwei Ren, and Yuan Xie. 2022. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference.1–3. DOI:ISSN: 2376–8606.
[30]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 304–315. DOI:
[31]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from https://arxiv.org/abs/1804.02767.
[32]
Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 58–68. DOI:
[33]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 14–27. DOI:
[34]
Kota Shiba, Tatsuo Omori, Kodai Ueyoshi, Shinya Takamaeda-Yamazaki, Masato Motomura, Mototsugu Hamada, and Tadahiro Kuroda. 2021. A 96-MB 3D-stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 2(2021), 692–703. DOI:
[35]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1409.1556.
[36]
Gagandeep Singh, Juan Gómez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6. DOI:
[37]
S. Sutanthavibul, E. Shragowitz, and J. B. Rosen. 1991. An analytical approach to floorplan design and optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10, 6(1991), 761–769. DOI:
[38]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1–9. DOI:
[39]
Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Shinya Takamaeda-Yamazaki, Mototsugu Hamada, Tadahiro Kuroda, and Masato Motomura. 2019. QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS. IEEE Journal of Solid-State Circuits 54, 1(2019), 186–196. DOI:
[40]
Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6(2018), 1428–1441. DOI:
[41]
Yi Wang, Mingxu Zhang, and Jing Yang. 2017. Exploiting parallelism for convolutional connections in processing-in-memory architecture. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 1–6. DOI:
[42]
Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6. DOI:
[43]
Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design.IEEE, 1–8. DOI:
[44]
Shixuan Zheng, Xianjue Zhang, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2022. Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture.475–489. DOI:
[45]
Zhenhua Zhu, Hanbo Sun, Yujun Lin, Guohao Dai, Lixue Xia, Song Han, Yu Wang, and Huazhong Yang. 2019. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference.1–6.

Cited By

View all
  • (2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
  • (2024)NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators With 3-D Stacked-DRAMIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334260543:5(1456-1469)Online publication date: May-2024
  • (2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
  • Show More Cited By

Index Terms

  1. DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Design Automation of Electronic Systems
    ACM Transactions on Design Automation of Electronic Systems  Volume 28, Issue 3
    May 2023
    456 pages
    ISSN:1084-4309
    EISSN:1557-7309
    DOI:10.1145/3587887
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 19 March 2023
    Online AM: 15 December 2022
    Accepted: 30 November 2022
    Revised: 13 September 2022
    Received: 13 May 2022
    Published in TODAES Volume 28, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional neural networks
    2. Processing-In-Memory

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • National Natural Science Foundation of China (NSFC)
    • CAS Project for Young Scientists in Basic Research
    • Strategic Priority Research Program of Chinese Academy of Sciences

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)296
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
    • (2024)NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators With 3-D Stacked-DRAMIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334260543:5(1456-1469)Online publication date: May-2024
    • (2024)ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595921(179-183)Online publication date: 22-Apr-2024
    • (2023)PIM-trie: A Skew-resistant Trie for Processing-in-MemoryProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591070(1-14)Online publication date: 17-Jun-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media