Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3695794.3695795acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

PIM-Potential: Broadening the Acceleration Reach of PIM Architectures

Published: 11 December 2024 Publication History

Abstract

Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. While the challenge of efficient PIM orchestration requires consideration of new constraints across the compute stack, these can often be hidden from software and microarchitecture for highly regular workloads (e.g., common machine learning, or ML, primitives). However, these new constraints are not as easy to hide for workloads that exhibit certain types of irregularity. To extend PIM’s reach to a broader range of workloads, navigating these new constraints becomes necessary at all levels of the compute stack.
In this work, we analyze the capabilities and constraints of a promising new type of commercial PIM architectures and we describe the properties that make a workload amenable to acceleration on such a system. Next, we explore how limitations of these PIM designs like row activation overheads, lack of reuse benefit, and command bandwidth can expose novel bottlenecks for some workloads. These workloads, termed PIM-potential workloads, have properties which deviate in limited ways from the identified amenability characteristics but enjoy only minor performance gain from PIM. The exposed bottlenecks motivate targeted hardware and software optimizations - eager activation, increased register storage, selective PIM command issue, and increased command bandwidth - that can be leveraged to mitigate these performance constraints and enable PIM acceleration for a wider range of workloads. We evaluate their impact on PIM-potential primitive acceleration and demonstrate that PIM can be applied more broadly than previously described if the emergent PIM bottlenecks can be addressed. We argue that emerging PIM architectures and programming models should take into account these novel PIM bottlenecks and corresponding optimizations in order to enhance the scope of PIM acceleration.

References

[1]
2013. JEDEC High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/standards-documents/docs/jesd235a.
[2]
[4]
Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, and Valeria Bertacco. 2018. Heterogeneous memory subsystem for natural graph analytics. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 134–145.
[5]
Shaizeen Aga, Nuwan Jayasena, and Mike Ignatowski. 2019. Co-ML: a case for Collaborative ML acceleration using near-data processing. In Proceedings of the International Symposium on Memory Systems. 506–517.
[6]
Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Caches. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 481–492.
[7]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 105–117.
[8]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 336–348.
[9]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. ACM SIGARCH Computer Architecture News 43, 3S (2015), 336–348.
[10]
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In 2016 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO). IEEE, 1–13.
[11]
Maciej Besta, Michał Podstawski, Linus Groner, Edgar Solomonik, and Torsten Hoefler. 2017. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. 93–104.
[12]
Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, et al. 2018. Google workloads for consumer devices: Mitigating data movement bottlenecks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 316–331.
[13]
Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna T Malladi, Hongzhong Zheng, et al. 2019. CoNDA: Efficient cache coherence support for near-data accelerators. In Proceedings of the 46th International Symposium on Computer Architecture. 629–642.
[14]
Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T Malladi, Hongzhong Zheng, and Onur Mutlu. 2016. LazyPIM: An efficient cache coherence mechanism for processing-in-memory. IEEE Computer Architecture Letters 16, 1 (2016), 46–50.
[15]
Preston Briggs, Keith D Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 3 (1994), 428–455.
[16]
Jiwon Choe, Amy Huang, Tali Moreshet, Maurice Herlihy, and R Iris Bahar. 2019. Concurrent data structures with near-data-processing: An architecture-aware implementation. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures. 297–308.
[17]
Fred C Chow and John L Hennessy. 1990. The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 4 (1990), 501–536.
[18]
Criteo. [n. d.]. Criteo Terabyte Click Logs dataset. https://ailab.criteo.com/criteo-1tb-click-logs-dataset/.
[19]
Timothy A Davis et al. 2015. SuiteSparse: A suite of sparse matrix software. https://people.engr.tamu.edu/davis/suitesparse.html
[20]
Fabrice Devaux. 2019. The true processing in memory accelerator. In 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 1–24.
[21]
Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings of workshop on GRAph data management experiences and systems. 1–6.
[22]
Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. Sparsep: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1 (2022), 1–49.
[23]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system. IEEE Access 10 (2022), 52565–52608.
[24]
Bagus Hanindhito, Ruihao Li, Dimitrios Gourounas, Arash Fathi, Karan Govil, Dimitar Trenev, Andreas Gerstlauer, and Lizy John. 2021. Wave-PIM: Accelerating Wave Simulation Using Processing-in-Memory. In 50th International Conference on Parallel Processing. 1–11.
[25]
Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and TN Vijaykumar. 2020. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 372–385.
[26]
Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S Lee, et al. 2020. Recnmp: Accelerating personalized recommendation with near-memory processing. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 790–803.
[27]
Jin Hyun Kim, Shin-Haeng Kang, Sukhan Lee, Hyeonsu Kim, Yuhwan Ro, Seungwon Lee, David Wang, Jihyun Choi, Jinin So, YeonGon Cho, et al. 2022. Aquabolt-XL HBM2-PIM, LPDDR5-PIM with in-memory processing, and AXDIMM with acceleration buffer. IEEE Micro 42, 3 (2022), 20–30.
[28]
Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, et al. 2021. Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE, 1–26.
[29]
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753.
[30]
Yongkee Kwon, Kornijcuk Vladimir, Nahsung Kim, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Guhyun Kim, Byeongju An, et al. 2022. System architecture and software stack for GDDR6-AiM. In 2022 IEEE Hot Chips 34 Symposium (HCS). IEEE, 1–25.
[31]
Chang-Chi Lee, CP Hung, Calvin Cheung, Ping-Feng Yang, Chin-Li Kao, Dao-Long Chen, Meng-Kai Shih, Chien-Lin Chang Chien, Yu-Hsiang Hsiao, Li-Chieh Chen, et al. 2016. An overview of the development of a GPU with integrated HBM on silicon interposer. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 1439–1444.
[32]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 43–56.
[33]
Seongju Lee, Kyuyoung Kim, Sanghoon Oh, Joonhong Park, Gimoon Hong, Dongyoon Ka, Kyudong Hwang, Jeongje Park, Kyeongpil Kang, Jungyeon Kim, Junyeol Jeon, Nahsung Kim, Yongkee Kwon, Kornijcuk Vladimir, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Jaewook Lee, Donguc Ko, Younggun Jun, Keewon Cho, Ilwoong Kim, Choungki Song, Chunseok Jeong, Daehan Kwon, Jieun Jang, Il Park, Junhyun Chun, and Joohwan Cho. 2022. A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), Vol. 65. 1–3.
[34]
Won Jun Lee, Chang Hyun Kim, Yoonah Paik, Jongsun Park, Il Park, and Seon Wook Kim. 2019. Design of processing-“inside”-memory optimized for dram behaviors. IEEE Access 7 (2019), 82633–82648.
[35]
Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. Drisa: A dram-based reconfigurable in-situ accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 288–301.
[36]
Yueting Li, Tianshuo Bai, Xinyi Xu, Yundong Zhang, Bi Wu, Hao Cai, Biao Pan, and Weisheng Zhao. 2022. A Survey of MRAM-Centric Computing: From Near Memory to In Memory. IEEE Transactions on Emerging Topics in Computing (2022).
[37]
Sparsh Mittal. 2018. A survey of ReRAM-based architectures for processing-in-memory and neural networks. Machine learning and knowledge extraction 1, 1 (2018), 75–114.
[38]
Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 457–468.
[39]
Lifeng Nai, Yinglong Xia, Ilie G Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.
[40]
Ravi Nair, Samuel F Antao, Carlo Bertolli, Pradip Bose, Jose R Brunheroto, Tong Chen, C-Y Cher, Carlos HA Costa, Jun Doi, Constantinos Evangelinos, et al. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (2015), 17–1.
[41]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:https://arXiv.org/abs/1906.00091 (2019).
[42]
Jaehyun Park, Byeongho Kim, Sungmin Yun, Eojin Lee, Minsoo Rhu, and Jung Ho Ahn. 2021. Trim: Enhancing processor-memory interfaces with scalable tensor reduction in memory. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 268–281.
[43]
Suchita Pati, Shaizeen Aga, Nuwan Jayasena, and Matthew D. Sinclair. 2022. Demystifying BERT: System Design Implications. In 2022 IEEE International Symposium on Workload Characterization (IISWC). 296–309.
[44]
Giordano Salvador, Wesley H Darvin, Muhammad Huzaifa, Johnathan Alsop, Matthew D Sinclair, and Sarita V Adve. 2020. Specializing coherence, consistency, and push/pull for gpu graph analytics. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 123–125.
[45]
Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 273–287.
[46]
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating graph processing using ReRAM. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 531–543.
[47]
Chirag Sudarshan. 2024. Processing-in-Memory DRAM Architectures for Neural Network Applications. Ph. D. Dissertation. Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau.
[48]
Chirag Sudarshan, Mohammad Hassani Sadi, Lukas Steiner, Christian Weis, and Norbert Wehn. 2022. A critical assessment of dram-pim architectures-trends, challenges and solutions. In International Conference on Embedded Computer Systems. Springer, 362–379.
[49]
Chirag Sudarshan, Mohammad Hassani Sadi, Christian Weis, and Norbert Wehn. 2022. Optimization of DRAM based PIM architecture for energy-efficient deep neural network training. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1472–1476.
[50]
Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T Riffel, et al. 2017. Gunrock: GPU graph analytics. ACM Transactions on Parallel Computing (TOPC) 4, 1 (2017), 1–49.
[51]
Lucas C Wilcox, Georg Stadler, Carsten Burstedde, and Omar Ghattas. 2010. A high-order discontinuous Galerkin method for wave propagation through coupled elastic–acoustic media. J. Comput. Phys. 229, 24 (2010), 9373–9396.
[52]
Yue Xi, Bin Gao, Jianshi Tang, An Chen, Meng-Fan Chang, Xiaobo Sharon Hu, Jan Van Der Spiegel, He Qian, and Huaqiang Wu. 2020. In-memory learning with analog resistive switching memory: A review and perspective. Proc. IEEE 109, 1 (2020), 14–42.
[53]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. 85–98.
[54]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 544–557.
[55]
Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359–371.
[56]
Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. Graphq: Scalable pim-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 712–725.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '24: Proceedings of the International Symposium on Memory Systems
September 2024
289 pages
ISBN:9798400710919
DOI:10.1145/3695794
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2024

Check for updates

Qualifiers

  • Research-article

Conference

MEMSYS '24
MEMSYS '24: The International Symposium on Memory Systems
September 30 - October 3, 2024
DC, Washington, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 64
    Total Downloads
  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)64
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media