Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs

Published: 20 June 2023 Publication History

Abstract

Modern dual in-line memory modules (DIMMs) support processing-in-memory (PIM) by implementing in-DIMM processors (IDPs) located near memory banks. PIM can greatly accelerate in-memory join, whose performance is frequently bounded by main-memory accesses, by offloading the operations of join from host central processing units (CPUs) to the IDPs. As real PIM hardware has not been available until very recently, the prior PIM-assisted join algorithms have relied on PIM hardware simulators which assume fast shared memory between the IDPs and fast inter-IDP communication; however, on commodity PIM-enabled DIMMs, the IDPs do not share memory and demand the CPUs to mediate inter-IDP communication. Such discrepancies in the architectural characteristics make the prior studies incompatible with the DIMMs. Thus, to exploit the high potential of PIM on commodity PIM-enabled DIMMs, we need a new join algorithm designed and optimized for the DIMMs and their architectural characteristics.
In this paper, we design and analyze Processing-In-DIMM Join (PID-Join), a fast in-memory join algorithm which exploits UPMEM DIMMs, currently the only publicly-available PIM-enabled DIMMs. The DIMMs impose several key challenges on efficient acceleration of join including the shared-nothing nature and limited compute capabilities of the IDPs, the lack of hardware support for fast inter-IDP communication, and the slow IDP-wise data transfers between the IDPs and the main memory. PID-Join overcomes the challenges by prototyping and evaluating hash, sort-merge, and nested-loop algorithms optimized for the IDPs, enabling fast inter-IDP communication using host CPU cache streaming and vector instructions, and facilitating fast rank-wise data transfers between the IDPs and the main memory. Our evaluation using a real system equipped with eight UPMEM DIMMs and 1,024 IDPs shows that PID-Join greatly improves the performance of in-memory join over various CPU-based in-memory join algorithms.

Supplemental Material

MP4 File
Presentation video

References

[1]
Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proceedings of the VLDB Endowment, Vol. 5 (2012).
[2]
Marco Antonio Zanata Alves, Carlos Villavieja, Matthias Diener, Francis Birck Moreira, and Philippe Olivier Alexandre Navaux. 2015. SiNUCA: A Validated Micro-Architecture Simulator. In Proc. 17th International Conference on High Performance Computing and Communications (HPCC).
[3]
Austin Appleby. 2011. MurmurHash. https://sites.google.com/site/murmurhash/
[4]
JEDEC Solid State Technology Association. 2012. DDR4 SDRAM STANDARD. https://xdevs.com/doc/Standards/DDR4/JESD79--4%20DDR4%20SDRAM.pdf
[5]
Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013a. Multi-core, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment, Vol. 7 (2013).
[6]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013b. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proc. 29th International Conference on Data Engineering (ICDE).
[7]
Maximilian Bandle, Jana Giceva, and Thomas Neumann. 2021. To Partition, or Not to Partition, That is the Join Question in a Real System. In Proc. International Conference on Management of Data (SIGMOD).
[8]
Steven Keith Begley, Zhen He, and Yi-Ping Phoebe Chen. 2012. Mcjoin: A memory-constrained join for column-store main-memory databases. In Proc. International Conference on Management of Data (SIGMOD).
[9]
Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the Memory Wall in MonetDB. Commun. ACM, Vol. 51 (2008).
[10]
Amirali Boroumand, Saugata Ghose, Geraldo F. Oliveira, and Onur Mutlu. 2022. Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design. In Proc. 38th IEEE International Conference on Data Engineering (ICDE).
[11]
Shimin Chen, Anastassia Ailamaki, Phillip B Gibbons, and Todd C Mowry. 2007. Improving hash join performance through prefetching. ACM Transactions on Database Systems (TODS), Vol. 32 (2007).
[12]
Fabrice Devaux. 2019. The true Processing In Memory accelerator. In Proc. 2019 IEEE Hot Chips 31 Symposium (HCS).
[13]
Alexandar Devic, Siddhartha Balakrishna Rai, Anand Sivasubramaniam, Ameen Akel, Sean Eilert, and Justin Eno. 2022. To PIM or Not for Emerging General Purpose Processing in DDR Memory Systems. In Proc. 49th Annual International Symposium on Computer Architecture (ISCA).
[14]
Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. 1997. A Reliable Randomized Algorithm for the Closest-Pair Problem. Journal of Algorithms, Vol. 25 (1997).
[15]
E Knuth Donald et al. 1999. The art of computer programming. Sorting and searching, Vol. 3 (1999).
[16]
Sairo R. dos Santos, Francis B. Moreira, Tiago R. Kepe, and Marco A. Z. Alves. 2022. Advancing Database System Operators with Near-Data Processing. In Proc. 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).
[17]
Mario Drumond, Alexandros Daglis, Nooshin Mirzadeh, Dmitrii Ustiugov, Javier Picorel, Babak Falsafi, Boris Grot, and Dionisios Pnevmatikatos. 2017. The Mondrian Data Engine. In Proc. 44th International Symposium on Computer Architecture (ISCA).
[18]
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proc. 21st International Symposium on High Performance Computer Architecture (HPCA).
[19]
Hao Gao and Nikolai Sakharnykh. 2021. Scaling Joins to a Thousand GPUs. In Proc. 12th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS).
[20]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F Oliveira, and Onur Mutlu. 2021. Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture. arXiv preprint arXiv:2105.03814 (2021).
[21]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System. IEEE Access, Vol. 10 (2022).
[22]
Michael T Goodrich and Roberto Tamassia. 2015. Algorithm design and applications. Vol. 363. Wiley Hoboken.
[23]
The PostgreSQL Global Development Group. 2022. Documentation: 7.2: Performance Tips - PostgreSQL. https://www.postgresql.org/docs/7.2/performance-tips.html
[24]
Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In Proc. IEEE International Symposium on High Performance Computer Architecture (HPCA).
[25]
Anthony Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of Error in Full-System Simulation. In Proc. 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[26]
Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Jo ao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. 2021. SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In Proc. 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[27]
Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and T. N. Vijaykumar. 2020. Newton: A DRAM-maker's Accelerator-in-Memory (AiM) Architecture for Machine Learning. In Proc. 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28]
Yeye He, Kris Ganjam, and Xu Chu. 2015. Sema-join: joining semantically-related tables using big table corpora. Proceedings of the VLDB Endowment, Vol. 8 (2015).
[29]
Richard D Hipp. 2022. SQLite. https://www.sqlite.org/index.html
[30]
Intel. 2022a. APP Metrics for Intel Microprocessors - Intel Xeon Processor. https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Xeon-Processors.pdf
[31]
Intel. 2022b. Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
[32]
Joe Jeddeloh and Brent Keeth. 2012. Hybrid Memory Cube New DRAM Architecture Increases Density and Performance. In Proc. 2012 Symposium on VLSI Technology (VLSIT).
[33]
Bob Jenkins. 1997. A Hash Function for Hash Table Lookup. https://burtleburtle.net/bob/hash/doobs.html
[34]
Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, et al. 2021. Near-memory processing in action: Accelerating personalized recommendation with AxDIMM. IEEE Micro, Vol. 42 (2021).
[35]
Tiago R. Kepe, Eduardo C. de Almeida, and Marco A. Z. Alves. 2019. Database Processing-in-Memory: An Experimental Study. Proceedings of the VLDB Endowment, Vol. 13 (2019).
[36]
Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment, Vol. 10 (2017).
[37]
Changkyu Kim, Tim Kaldewey, Victor W Lee, Eric Sedlar, Anthony D Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. 2009. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proceedings of the VLDB Endowment, Vol. 2 (2009).
[38]
Peter M. Kogge. 1994. EXECUBE - A New Architecture for Scaleable MPPs. In Proc. 1994 International Conference on Parallel Processing (ICPP).
[39]
Jinho Lee, Jung Ho Ahn, and Kiyoung Choi. 2016. Buffered compares: Excavating the hidden parallelism inside DRAM architectures with lightweight logic. In Proc. Design, Automation & Test in Europe Conference & Exhibition (DATE).
[40]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, et al. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In Proc. 48th Annual International Symposium on Computer Architecture (ISCA).
[41]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. In Proc. International Conference on Management of Data (SIGMOD).
[42]
Ward Douglas Maurer and Ted G Lewis. 1975. Hash table methods. ACM Computing Surveys (CSUR), Vol. 7 (1975).
[43]
Nooshin S. Mirzadeh, Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Sort vs. Hash Join Revisited for Near-Memory Execution. In Proc. 5th Workshop on Architectures and Systems for Big Data (ASBD).
[44]
Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, and Alexandra (Sasha) Fedorova. 2021. A Case Study of Processing-in-Memory in off-the-Shelf Systems. In Proc. USENIX Annual Technical Conference (USENIX ATC).
[45]
Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and Karthikeyan Sankaralingam. 2015. Architectural Simulators Considered Harmful. IEEE Micro, Vol. 35 (2015).
[46]
Oracle. 2022. MySQL 8.0 Reference Manual: Nested-Loop Join Algorithms. https://dev.mysql.com/doc/refman/8.0/en/nested-loop-joins.html
[47]
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A Case for Intelligent RAM. IEEE Micro, Vol. 17 (1997).
[48]
Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In Proc. International Conference on Management of Data (SIGMOD).
[49]
Ben Perach, Ronny Ronen, Benny Kimelfeld, and Shahar Kvatinsky. 2022. PIMDB: Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics. arXiv preprint arXiv:2203.10486 (2022).
[50]
Georg Ch Pflug and Hans W Kessler. 1987. Linear probing with a nonuniform address distribution. Journal of the ACM (JACM), Vol. 34 (1987).
[51]
Orestis Polychroniou, Rajkumar Sen, and Kenneth A Ross. 2014. Track join: distributed joins with minimal network traffic. In Proc. International Conference on Management of Data (SIGMOD).
[52]
Mihai Puatracscu and Mikkel Thorup. 2012. The Power of Simple Tabulation Hashing. J. ACM, Vol. 59 (2012).
[53]
Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and Jingren Zhou. 2018. Real-time constrained cycle detection in large dynamic graphs. Proceedings of the VLDB Endowment, Vol. 11 (2018).
[54]
Maximilian Reif and Thomas Neumann. 2022. A scalable and generic approach to range joins. Proceedings of the VLDB Endowment, Vol. 15 (2022).
[55]
Stefan Richter, Victor Alvarez, and Jens Dittrich. 2015. A seven-dimensional analysis of hashing methods and its implications on query processing. Proceedings of the VLDB Endowment, Vol. 9 (2015).
[56]
Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters, Vol. 10 (2011).
[57]
Ran Rui, Hao Li, and Yi-Cheng Tu. 2015. Join algorithms on GPUs: A revisit after seven years. In Proc. 2015 IEEE International Conference on Big Data (Big Data).
[58]
Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient join algorithms for large database tables in a multi-GPU environment. Proceedings of the VLDB Endowment, Vol. 14 (2020).
[59]
Lukas Rupprecht, William Culhane, and Peter Pietzuch. 2017. SquirrelJoin: Network-Aware Distributed Join Processing with Lazy Partitioning. Proceedings of the VLDB Endowment, Vol. 10 (2017).
[60]
Felix Martin Schuhknecht, Pankaj Khanchandani, and Jens Dittrich. 2015. On the surprising difficulty of simple things: the case of radix partitioning. Proceedings of the VLDB Endowment, Vol. 8 (2015).
[61]
Ambuj Shatdal, Chander Kant, and Jeffrey F Naughton. 1994. Cache conscious algorithms for relational query processing.
[62]
Hyunsung Shin, Dongyoung Kim, Eunhyeok Park, Sungho Park, Yongsik Park, and Sungjoo Yoo. 2018. McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol. 37, 11 (2018).
[63]
Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In Proc. 35th International Conference on Data Engineering (ICDE).
[64]
John S. Sobolewski. 2003. Cyclic Redundancy Check. John Wiley and Sons Ltd.
[65]
UPMEM SAS. 2021. UPMEM SDK. https://sdk.upmem.com/2021.3.0/index.html
[66]
Makoto Yabuta, Anh Nguyen, Shinpei Kato, Masato Edahiro, and Hideyuki Kawashima. 2017. Relational Joins on GPUs: A Closer Look. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 28 (2017).
[67]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM, Vol. 59 (2016).
[68]
Zuyu Zhang, Harshad Deshmukh, and Jignesh M Patel. 2019. Data Partitioning for In-Memory Systems: Myths, Challenges, and Opportunities. In CIDR.
[69]
Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment, Vol. 10 (2017).

Cited By

View all
  • (2024)PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-OptimizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640376(879-896)Online publication date: 27-Apr-2024
  • (2024)Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIMIEEE Computer Architecture Letters10.1109/LCA.2024.338747223:2(179-182)Online publication date: Jul-2024
  • (2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Author Tags

  1. in-memory join
  2. processing-in-DIMM
  3. processing-in-memory

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)703
  • Downloads (Last 6 weeks)74
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-OptimizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640376(879-896)Online publication date: 27-Apr-2024
  • (2024)Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIMIEEE Computer Architecture Letters10.1109/LCA.2024.338747223:2(179-182)Online publication date: Jul-2024
  • (2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
  • (2024)UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00053(644-659)Online publication date: 29-Jun-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2024)Data Flow Architectures for Data Processing on Modern Hardware2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00439(5511-5522)Online publication date: 13-May-2024
  • (2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
  • (2024)MPC-Wrapper: Fully Harnessing the Potential of Samsung Aquabolt-XL HBM2-PIM on FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00027(162-172)Online publication date: 5-May-2024
  • (2023)SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00017(99-111)Online publication date: 21-Oct-2023
  • (2023)Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00030(211-215)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media