research-article

Open access

Scalability Limitations of Processing-in-Memory using Real System Evaluations

Authors:

Gilbert Jonatan,

Kaustubh Shivdikar,

José L. Abellán,

John KimAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 8, Issue 1

Article No.: 5, Pages 1 - 28

https://doi.org/10.1145/3639046

Published: 21 February 2024 Publication History

Abstract

Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.

References

[1]

B. Acun, M. Murphy, X. Wang, J. Nie, C. Wu, and K. Hazelwood. 2021. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 802--814. https://doi.org/10.1109/HPCA51647.2021.00072

[2]

Rashmi Agrawal, Leo de Castro, Guowei Yang, Chiraag Juvekar, Rabia Yazicigil, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. FAB: An fpga-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 882--895. https://doi.org/10.1109/HPCA56546.2023.10070953

[3]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 105--117. https://doi.org/10.1145/2749469.2750386

Digital Library

[4]

Ahmad Al Badawi, Jack Bates, Flavio Bergamaschi, David Bruce Cousins, Saroja Erabelli, Nicholas Genise, Shai Halevi, Hamish Hunt, Andrey Kim, Yongwoo Lee, Zeyu Liu, Daniele Micciancio, Ian Quah, Yuriy Polyakov, Saraswathy R.V., Kurt Rohloff, Jonathan Saylor, Dmitriy Suponitsky, Matthew Triplett, Vinod Vaikuntanathan, and Vincent Zucca. 2022. OpenFHE: Open-source fully homomorphic encryption library. In Proceedings of the 10th Workshop on Encrypted Computing & Applied Homomorphic Cryptography (Los Angeles, CA, USA) (WAHC'22). Association for Computing Machinery, New York, NY, USA, 53--63. https://doi.org/10.1145/3560827.3563379

Digital Library

[5]

Mohammad Alian, Seung Won Min, Hadi Asgharimoghaddam, Ashutosh Dhar, Dong Kai Wang, Thomas Roewer, Adam McPadden, Oliver O'Halloran, Deming Chen, Jinjun Xiong, Daehoon Kim, Wen-mei Hwu, and Nam Sung Kim. 2018. Application-transparent near-memory processing architecture with memory channel network. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 802--814. https://doi.org/10.1109/MICRO.2018.00070

Digital Library

[6]

Michael J. Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, Haixin Liu, Yinghai Lu, Jack Montgomery, Arun Moorthy, Nadathur Satish, Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin Schatz, Narayanan Sundaram, Bangsheng Tang, Peter Tang, Amy Yang, Jiecao Yu, Hector Yuen, Ying Zhang, Aravind Anbudurai, Vandana Balan, Harsha Bojja, Joe Boyd, Matthew Breitbach, Claudio Caldato, Anna Calvo, Garret Catron, Sneh Chandwani, Panos Christeas, Brad Cottel, Brian Coutinho, Arun Dalli, Abhishek Dhanotia, Oniel Duncan, Roman Dzhabarov, Simon Elmir, Chunli Fu, Wenyin Fu, Michael Fulthorp, Adi Gangidi, Nick Gibson, Sean Gordon, Beatriz Padilla Hernandez, Daniel Ho, Yu-Cheng Huang, Olof Johansson, Shishir Juluri, and et al. 2021. First-generation inference accelerator deployment at facebook. CoRR, Vol. abs/2107.04140 (2021). showeprint[arXiv]2107.04140 https://arxiv.org/abs/2107.04140

[7]

Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. https://doi.org/10.1109/MICRO.2016.7783753

[8]

D. H. Bailey. 1989. FFTs in external or hierarchical memory. In Supercomputing '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. 234--242. https://doi.org/10.1145/76263.76288

Digital Library

[9]

Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory (TOCT), Vol. 6, 3 (2014), 1--36.

Digital Library

[10]

Sébastien Cayrols, Jiali Li, George Bosilca, Stanimire Tomov, Alan Ayala, and Jack Dongarra. 2022. Lossy all-to-all exchange for accelerating parallel 3-d ffts on hybrid architectures with gpus. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 152--160. https://doi.org/10.1109/CLUSTER51413.2022.00029

[11]

Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2017. Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security. Springer, 409--437.

[12]

Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020. TFHE: Fast fully homomorphic encryption over the torus. Journal of Cryptology, Vol. 33, 1 (2020), 34--91.

Digital Library

[13]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 tensor core gpu: Performance and innovation. IEEE Micro, Vol. 41, 2 (2021), 29--35.

[14]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys '16). Association for Computing Machinery, New York, NY, USA, 191--198. https://doi.org/10.1145/2959100.2959190

Digital Library

[15]

Richard Crandall and Barry Fagin. 1994. Discrete weighted transforms and large-integer arithmetic. Math. Comput., Vol. 62 (1994), 305--324. https://doi.org/10.2307/2153411

Digital Library

[16]

David Culler, Jaswinder Pal Singh, and Anoop Gupta. 1998. Parallel computer architecture: A hardware/software approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

[17]

Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. 2019. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing. 46--57.

Digital Library

[18]

William Dally and Brian Towles. 2004. Principles and practices of interconnection network. (01 2004).

Digital Library

[19]

William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolution of the graphics processing unit (gpu). IEEE Micro, Vol. 41, 6 (2021), 42--51. https://doi.org/10.1109/MM.2021.3113475

Digital Library

[20]

W. J. Dally and B. Towles. 2001. Route packets, not wires: On-chip interconnection networks (DAC '01). ACM, New York, NY, USA, 684--689. https://doi.org/10.1145/378239.379048

Digital Library

[21]

William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM, Vol. 63, 7 (jun 2020), 48--57. https://doi.org/10.1145/3361682

Digital Library

[22]

Fabrice Devaux. 2019. The true processing in memory accelerator. In 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18--20, 2019. IEEE, 1--24. https://doi.org/10.1109/HOTCHIPS.2019.8875680

[23]

Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using non-volatile memory for storing deep learning models. arxiv: 1811.05922 [cs.LG]

[24]

Junfeng Fan and Frederik Vercauteren. 2012. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch., Vol. 2012 (2012), 144.

[25]

Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 283--295. https://doi.org/10.1109/HPCA.2015.7056040

[26]

Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 113--124. https://doi.org/10.1109/PACT.2015.22

Digital Library

[27]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 751--764. https://doi.org/10.1145/3037697.3037702

Digital Library

[28]

S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-memory: A workload-driven perspective. IBM Journal of Research and Development, Vol. 63, 6 (2019), 3:1--3:19. https://doi.org/10.1147/JRD.2019.2934048

[29]

Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, 1, Article 21 (feb 2022), bibinfonumpages49 pages. https://doi.org/10.1145/3508041

Digital Library

[30]

Steven Gottlieb, W. Liu, D. Toussaint, R. L. Renken, and R. L. Sugar. 1987. Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. Phys. Rev. D, Vol. 35 (Apr 1987), 2531--2542. Issue 8. https://doi.org/10.1103/PhysRevD.35.2531

[31]

Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen. 2019. Post-training 4-bit quantization on embedding tables. CoRR, Vol. abs/1911.02079 (2019). showeprint[arXiv]1911.02079 http://arxiv.org/abs/1911.02079

[32]

Saransh Gupta, Rosario Cammarota, and Tajana Rosing. 2022. MemFHE: End-to-end computing with fully homomorphic encryption in memory. arXiv preprint arXiv:2204.12557 (2022).

[33]

Saransh Gupta and Tajana vS imunić Rosing. 2021. Accelerating fully homomorphic encryption with processing in memory. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 1335--1338.

Digital Library

[34]

Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. The architectural implications of facebook's dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 488--501. https://doi.org/10.1109/HPCA47549.2020.00047

[35]

Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system. IEEE Access, Vol. 10 (2022), 52565--52608. https://doi.org/10.1109/ACCESS.2022.3174101

[36]

Kyoohyung Han and Dohyeong Ki. 2020. Better bootstrapping for approximate homomorphic encryption. In Cryptographers' Track at the RSA Conference. Springer, 364--390.

[37]

John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM, Vol. 62, 2 (jan 2019), 48--60. https://doi.org/10.1145/3282307

Digital Library

[38]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).

[39]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News, Vol. 45, 2 (jun 2017), 1--12. https://doi.org/10.1145/3140659.3080246

Digital Library

[40]

Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee. 2021. Over 100x faster bootstrapping in fully homomorphic encryption through memory-centric optimization with gpus. IACR Transactions on Cryptographic Hardware and Embedded Systems (2021), 114--148.

[41]

Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, Xiaodong Wang, Brandon Reagen, Carole-Jean Wu, Mark Hempstead, and Xuan Zhang. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 790--803. https://doi.org/10.1109/ISCA45697.2020.00070

Digital Library

[42]

Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, KyungSoo Kim, Jin Jung, Ilkwon Yun, Sung Joo Park, Hyunsun Park, Joonho Song, Jeonghyeon Cho, Kyomin Sohn, Nam Sung Kim, and Hsien-Hsin S. Lee. 2022. Near-memory processing in action: Accelerating personalized recommendation with axdimm. IEEE Micro, Vol. 42, 1 (2022), 116--127. https://doi.org/10.1109/MM.2021.3097700

[43]

Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 145--155. https://doi.org/10.1109/PACT.2013.6618812

[44]

John Kim. 2009. Low-cost router microarchitecture for on-chip networks. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 255--266. https://doi.org/10.1145/1669112.1669145

Digital Library

[45]

Jin Hyun Kim, Shin-Haeng Kang, Sukhan Lee, Hyeonsu Kim, Yuhwan Ro, Seungwon Lee, David Wang, Jihyun Choi, Jinin So, YeonGon Cho, JoonHo Song, Jeonghyeon Cho, Kyomin Sohn, and Nam Sung Kim. 2022a. Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer. IEEE Micro, Vol. 42, 3 (2022), 20--30. https://doi.org/10.1109/MM.2022.3164651

[46]

Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, Jihyun Choi, Jinin So, YeonGon Cho, JoonHo Song, Jangseok Choi, Jeonghyeon Cho, Kyomin Sohn, Youngsoo Sohn, Kwangil Park, and Nam Sung Kim. 2021. Aquabolt-xl: Samsung hbm2-pim with in-memory processing for ml accelerators and beyond. In 2021 IEEE Hot Chips 33 Symposium (HCS). 1--26. https://doi.org/10.1109/HCS52781.2021.9567191

[47]

Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim, Minsoo Rhu, and Jung Ho Ahn. 2022b. Bts: An accelerator for bootstrappable fully homomorphic encryption. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 711--725.

Digital Library

[48]

Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE Micro, Vol. 40, 3 (2020), 20--29. https://doi.org/10.1109/MM.2020.2985963

[49]

Yongkee Kwon, Guhyun Kim, Nahsung Kim, Woojae Shin, Jongsoon Won, Hyunha Joo, Haerang Choi, Byeongju An, Gyeongcheol Shin, Dayeon Yun, Jeongbin Kim, Changhyun Kim, Ilkon Kim, Jaehan Park, Chanwook Park, Yosub Song, Byeongsu Yang, Hyeongdeok Lee, Seungyeong Park, Wonjun Lee, Seongju Lee, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, John Kim, Euicheol Lim, and Junhyun Chun. 2023. Memory-Centric Computing with SK Hynix's Domain-Specific Memory. In 2023 IEEE Hot Chips 35 Symposium (HCS). 1--26. https://doi.org/10.1109/HCS59251.2023.10254717

[50]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). Association for Computing Machinery, New York, NY, USA, 740--753. https://doi.org/10.1145/3352460.3358284

Digital Library

[51]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2021a. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 235--248. https://doi.org/10.1109/HPCA51647.2021.00029

[52]

Youngeun Kwon and Minsoo Rhu. 2022. Training personalized recommendation systems from (gpu) Scratch: Look forward not backwards. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 860--873. https://doi.org/10.1145/3470496.3527386

Digital Library

[53]

Young-Cheon Kwon, Jaehoon Lee, Suk Han fhand Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim, Youngmin Cho, Jin Guk Kim, Jongyoon Choi, Hyun-Sung Shin, Jin Kim, BengSeng Phuah, HyoungMin Kim, Myeong Jun Song, Ahn Choi, Daeho Kim, SooYoung Kim, Eun-Bong Kim, David Wang, Shinhaeng Kang, Yuhwan Ro, Seungwoo Seo, JoonHo Song, Jaeyoun Youn, Kyomin Sohn, and Nam Sung Kim. 2021b. 25.4 A 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops programmable computing unit using bank-level parallelism, for machine learning applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), Vol. 64. 350--352. https://doi.org/10.1109/ISSCC42613.2021.9365862

[54]

Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. DNA mapping using processor-in-memory architecture. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1429--1435.

[55]

Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021a. Hardware architecture and software stack for pim based on commercial dram technology : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 43--56. https://doi.org/10.1109/ISCA52012.2021.00013

Digital Library

[56]

Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W. Lee, and Tae Jun Ham. 2021b. MERCI: Efficient embedding reduction on commodity hardware via sub-query memoization. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 302--313. https://doi.org/10.1145/3445814.3446717

Digital Library

[57]

Vadim Lyubashevsky, Chris Peikert, and Oded Regev. 2010. On ideal lattices and learning with errors over rings. In Advances in Cryptology -- EUROCRYPT 2010, Henri Gilbert (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--23.

Digital Library

[58]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 993--1011. https://doi.org/10.1145/3470496.3533727

Digital Library

[59]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems, Vol. 67 (2019), 28--41.

Digital Library

[60]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 457--468. https://doi.org/10.1109/HPCA.2017.54

[61]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. https://doi.org/10.48550/ARXIV.1906.00091

[62]

Hamid Nejatollahi, Saransh Gupta, Mohsen Imani, Tajana Simunic Rosing, Rosario Cammarota, and Nikil Dutt. 2020. Cryptopim: In-memory acceleration for lattice-based cryptographic hardware. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.

[63]

Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, et al. 2021. A case study of processing-in-memory in off-the-shelf systems. In USENIX Annual Technical Conference. 117--130.

[64]

NVidia. 2012. cuBLAS library. https://docs.nvidia.com/cuda/cublas/

[65]

Jaehyun Park, Byeongho Kim, Sungmin Yun, Eojin Lee, Minsoo Rhu, and Jung Ho Ahn. 2021. TRiM: Enhancing processor-memory interfaces with scalable tensor reduction in memory. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO '21). Association for Computing Machinery, New York, NY, USA, 268--281. https://doi.org/10.1145/3466752.3480080

Digital Library

[66]

Thomas Pöppelmann, Tobias Oder, and Tim Güneysu. 2015. High-performance ideal lattice-based cryptography on 8-bit atxmega microcontrollers. In Progress in Cryptology -- LATINCRYPT 2015, Kristin Lauter and Francisco Rodríguez-Henríquez (Eds.). Springer International Publishing, Cham, 346--365.

Digital Library

[67]

Dayane Reis, Jonathan Takeshita, Taeho Jung, Michael Niemier, and Xiaobo Sharon Hu. 2020. Computing-in-memory for performance and energy-efficient homomorphic encryption. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 28, 11 (2020), 2300--2313.

[68]

Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alexander Sergeev, and Michael Matheson. 2022. Accelerating collective communication in data parallel training across deep learning frameworks. (4 2022). https://www.osti.gov/biblio/1862153

[69]

Sujoy Sinha Roy, Frederik Vercauteren, Nele Mentens, Donald Donglong Chen, and Ingrid Verbauwhede. 2014. Compact ring-lwe cryptoprocessor. In International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 371--391.

Digital Library

[70]

Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A fast and programmable accelerator for fully homomorphic encryption. In MICRO-54: 54th Annu. IEEE/ACM Int. Symp. on Microarchitecture (Virtual Event, Greece) (MICRO '21). ACM, New York, NY, USA, 238--252. https://doi.org/10.1145/3466752.3480070

Digital Library

[71]

UPMEM SAS. 2021a. The pim reference platform. https://www.upmem.com/technology/ Retrieved October 31, 2022 from

[72]

UPMEM SAS. 2021b. UPMEM documentation. https://sdk.upmem.com/2021.3.0/ Retrieved October 31, 2022 from

[73]

Brian C. Schwedock, Piratach Yoovidhya, Jennifer Seibert, and Nathan Beckmann. 2022. T"ak=o: A polymorphic cache hierarchy for general-purpose optimization of data movement. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 42--58. https://doi.org/10.1145/3470496.3527379

Digital Library

[74]

Harold S. Stone. 1970. A logic-in-memory computer. IEEE Trans. Comput., Vol. C-19, 1 (1970), 73--78. https://doi.org/10.1109/TC.1970.5008902

Digital Library

[75]

Weiyi Sun, Zhaoshi Li, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. ABC-DIMM: Alleviating the bottleneck of communication in dimm-based near-memory processing with inter-dimm broadcast. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 237--250. https://doi.org/10.1109/ISCA52012.2021.00027

Digital Library

[76]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE, Vol. 105, 12 (2017), 2295--2329. https://doi.org/10.1109/JPROC.2017.2761740

[77]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl., Vol. 19, 1 (feb 2005), 49--66. https://doi.org/10.1177/1094342005051521

Digital Library

[78]

Zheng Wang, Yuke Wang, Boyuan Feng, Dheevatsa Mudigere, Bharath Muthiah, and Yufei Ding. 2022. EL-rec: Efficient large-scale recommendation model training via tensor-train embedding table. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14. https://doi.org/10.1109/SC41404.2022.00075

[79]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. https://doi.org/10.48550/ARXIV.1609.08144

[80]

Neng Zhang, Bohan Yang, Chen Chen, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2020. Highly efficient architecture of newhope-nist on fpga using low-complexity ntt/intt. IACR Transactions on Cryptographic Hardware and Embedded Systems (2020), 49--72.

[81]

Huasha Zhao and John Canny. 2013. Sparse allreduce: Efficient scalable communication for power-law data. arxiv: 1312.3020 [cs.DC]

[82]

Huasha Zhao and John Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In 2014 43rd International Conference on Parallel Processing. 273--282. https://doi.org/10.1109/ICPP.2014.36

Digital Library

[83]

Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. 43--51.

Digital Library

[84]

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059--1068.

Digital Library

[85]

Z. Zhou, C. Li, F. Yang, and G. Sun. 2023. DIMM-Link: Enabling efficient inter-dimm communication for near-memory processing. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 302--316. https://doi.org/10.1109/HPCA56546.2023.10071005

Cited By

Gogineni KDayapule SGómez-Luna JGogineni KWei PLan TSadrosadati MMutlu OVenkataramani G(2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00029
Shivdikar KAgostini NJayaweera MJonatan GAbellán JJoshi AKim JKaeli D(2024)NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00073(946-960)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00073

Index Terms

Scalability Limitations of Processing-in-Memory using Real System Evaluations
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

AESPA: Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

This paper presents an asynchronous execution scheme to leverage the bank-level parallelism of near-bank processing-in-memory (PIM). We observe that performing memory operations underutilizes the parallelism of PIM computation because near-bank PIMs are ...
Scalability Limitations of Processing-in-Memory using Real System Evaluations
SIGMETRICS '24

Processing-in-memory (PIM) has been widely explored in academia and industry to accelerate numerous workloads. By reducing the data movement and increasing parallelism, PIM offers great performance and energy efficiency. A large amount of cores or nodes ...
Scalability Limitations of Processing-in-Memory using Real System Evaluations
SIGMETRICS/PERFORMANCE '24: Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Processing-in-memory (PIM) has been widely explored in academia and industry to accelerate numerous workloads. By reducing the data movement and increasing parallelism, PIM offers great performance and energy efficiency. A large amount of cores or nodes ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 8, Issue 1

POMACS

March 2024

494 pages

EISSN:2476-1249

DOI:10.1145/3649331

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2024

Published in POMACS Volume 8, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
670
Total Downloads

Downloads (Last 12 months)670
Downloads (Last 6 weeks)115

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gogineni KDayapule SGómez-Luna JGogineni KWei PLan TSadrosadati MMutlu OVenkataramani G(2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00029
Shivdikar KAgostini NJayaweera MJonatan GAbellán JJoshi AKim JKaeli D(2024)NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00073(946-960)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00073

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents