research-article

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

Authors:

Sebastian Breß,

Volker MarklAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1633 - 1649

https://doi.org/10.1145/3318464.3389705

Published: 31 May 2020 Publication History

Abstract

GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the on-board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU@. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.

Supplementary Material

MP4 File (3318464.3389705.mp4)

Presentation Video

Download
101.61 MB

References

[1]

Jasmin Ajanovic. 2009. PCI express 3.0 overview. In HCS, Vol. 69. IEEE, New York, NY, USA, 143. https://doi.org/10.1109/HOTCHIPS. 2009.7478337

[2]

Brian Allison. 2018. Introduction to the OpenCAPI Interface. https://openpowerfoundation.org/wp-content/uploads/2018/10/ Brian-Allison.OPF_OpenCAPI_FPGA_Overview_V1--1.pdf. In Open- POWER Summit Europe.

[3]

AMD. 2019. AMD EPYC CPUs, AMD Radeon Instinct GPUs and ROCm Open Source Software to Power World's Fastest Supercomputer at Oak Ridge National Laboratory. Retrieved July 5, 2019 from https://www.amd.com/en/press-releases/2019-05-07-amd-epyccpus- radeon-instinct-gpus-and-rocm-open-source-software-topower

[4]

Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki. 2017. The Case For Heterogeneous HTAP. In CIDR.

[5]

Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. In PPoPP. 173--182. https://doi.org/10.1145/2688500.2688521

[6]

Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In IPDPS. 430--440. https://doi.org/10.1109/ IPDPS.2018.00053

[7]

Muhammad A. Awad, Saman Ashkiani, Rob Johnson, Martin Farach- Colton, and John D. Owens. 2019. Engineering a high-performance GPU B-Tree. In PPoPP. 145--157. https://doi.org/10.1145/3293883. 3295706

[8]

Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE. IEEE, New York, NY, USA, 362--373. https://doi.org/10.1109/ICDE.2013.6544839

[9]

Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-Scale In-Memory Join Processing using RDMA. In SIGMOD. ACM, New York, NY, USA, 1463--1475. https://doi.org/10. 1145/2723372.2750547

Digital Library

[10]

Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In SIGMOD. ACM, New York, NY, USA, 37--48. https://doi.org/10. 1145/1989323.1989328

[11]

Rajesh Bordawekar and Pidad Gasfer D'Souza. 2018. Evaluation of hybrid cache-coherent concurrent hash table on IBM POWER9 AC922 system with NVLink2. http://on-demand.gputechconf.com/gtc/2018/ video/S8172/. In GTC. Nvidia.

[12]

Dan Bouvier and Ben Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In HCS. IEEE, New York, NY, USA, 1--42. https://doi.org/10.1109/HOTCHIPS.2014.7478810

[13]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-accelerated Databases. In SIGMOD. ACM, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903. 2882936

Digital Library

[14]

Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors. VLDB J. 27, 6 (2018), 797--822. https://doi.org/10.1007/s00778-018-0512-y

Digital Library

[15]

Alexandre Bicas Caldeira. 2018. IBM power system AC922 introduction and technical overview. IBM, International Technical Support Organization.

[16]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU--GPU parallelism in JIT compiled engines. PVLDB 12, 5 (2019), 544--556. https://doi.org/10.14778/3303753.3303760

Digital Library

[17]

Periklis Chrysogelos, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious Query Processing in GPU-accelerated Analytical Engines. In CIDR.

[18]

Jonathan Corbet. 2015. Making kernel pages movable. LWN.net (July 2015). https://lwn.net/Articles/650917/

[19]

CXL 2019. Compute Express Link Specification Revision 1.1. CXL. https://www.computeexpresslink.org

[20]

Patrick Damme, Annett Ungethüm, Juliana Hildebrandt, Dirk Habich, and Wolfgang Lehner. 2019. From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms. Trans. Database Syst. 44, 3 (2019), 9:1--9:46. https://doi.org/10.1145/3323991

[21]

Deloitte. 2017. Hitting the accelerator: the next generation of machine-learning chips. Retrieved Oct 1, 2019 from https://www2.deloitte.com/content/dam/Deloitte/global/Images/ infographics/technologymediatelecommunications/gx-deloittetmt- 2018-nextgen-machine-learning-report.pdf

[22]

Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, and Weiwei Gong. 2019. A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores. PVLDB 12, 12 (2019), 2218--2229.

Digital Library

[23]

Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large- Scale Machine Learning. PVLDB 9, 12 (2016), 960--971. https://doi. org/10.14778/2994509.2994515

Digital Library

[24]

Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database Compression on Graphics Processors. PVLDB 3, 1 (2010), 670--680. https://doi.org/10.14778/1920841.1920927

Digital Library

[25]

PhilipWerner Frey and Gustavo Alonso. 2009. Minimizing the Hidden Cost of RDMA. In ICDCS. 553--560. https://doi.org/10.1109/ICDCS. 2009.32

Digital Library

[26]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In SIGMOD. ACM, New York, NY, USA, 1603--1618. https://doi.org/10.1145/3183713.3183734

Digital Library

[27]

Victor Garcia-Flores, Eduard Ayguadé, and Antonio J. Peña. 2017. Efficient Data Sharing on Heterogeneous Systems. In ICPP. 121--130. https://doi.org/10.1109/ICPP.2017.21

[28]

Gartner. 2019. Gartner Says the Future of the Database Market Is the Cloud. Retrieved Oct 1, 2019 from https://www.gartner.com/en/newsroom/press-releases/2019- 07-01-gartner-says-the-future-of-the-database-market-is-the

[29]

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations using Graphics Processors. In SIGMOD. ACM, New York, NY, USA, 215--226. https://doi.org/10.1145/1007568.1007594

[30]

Michael Gowanlock, Ben Karsin, Zane Fink, and Jordan Wright. 2019. Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives. In DaMoN. ACM, New York, NY, USA, 7:1--7:11. https://doi.org/10.1145/3329785.3329926

Digital Library

[31]

Chris Gregg and Kim M. Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In ISPASS. IEEE, New York, NY, USA, 134--144. https://doi.org/10.1109/ ISPASS.2011.5762730

[32]

Tim Gubner, Diego G. Tomé, Harald Lang, and Peter A. Boncz. 2019. Fluid Co-processing: GPU Bloom-filters for CPU Joins. In DaMoN. ACM, New York, NY, USA, 9:1--9:10. https://doi.org/10.1145/3329785. 3329934

Digital Library

[33]

Prabhat K. Gupta. 2016. Accelerating Datacenter Workloads. In FPL. 1--27.

[34]

Bingsheng He et al. 2009. Relational query coprocessing on graphics processors. TODS 34, 4 (2009). Research 18: Main Memory Databases and Modern Hardware SIGMOD '20, June 14--19, 2020, Portland, OR, USA 1647

[35]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2008. Relational joins on graphics processors. In SIGMOD. ACM, New York, NY, USA, 511--524. https: //doi.org/10.1145/1376616.1376670

Digital Library

[36]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. PVLDB 6, 10 (2013), 889--900. https://doi.org/10.14778/2536206.2536216

Digital Library

[37]

Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. PVLDB 8, 4 (2014), 329--340. https://doi.org/10.14778/2735496.2735497

Digital Library

[38]

Max Heimel et al. 2013. Hardware-oblivious parallelism for inmemory column-stores. PVLDB 6, 9 (2013), 709--720.

[39]

Joel Hestness, Stephen W. Keckler, and David A. Wood. 2014. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior. In IISWC. IEEE, New York, NY, USA, 150-- 160. https://doi.org/10.1109/IISWC.2014.6983054

[40]

IBM 2018. POWER9 Processor User's Manual Version 2.0. IBM.

[41]

IBM POWER9 NPU team. 2018. Functionality and performance of NVLink with IBM POWER9 processors. IBM Journal of Research and Development 62, 4/5 (2018), 9.

[42]

Intel 2018. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel.

[43]

Intel. 2019. Intel Stratix 10 DX FPGA Product Brief. Retrieved Accessed: Oct 2, 2019 from https://www.intel.com/content/dam/ www/programmable/us/en/pdfs/literature/solution-sheets/stratix- 10-dx-product-brief.pdf

[44]

Intel. 2019. Intel Unveils New GPU Architecture with High- Performance Computing and AI Acceleration, and oneAPI Software Stack with Unified and Scalable Abstraction for Heterogeneous Architectures. Retrieved Jan 29, 2020 from https://newsroom.intel.com/news-releases/intel-unveils-new-gpuarchitecture- optimized-for-hpc-ai-oneapi

[45]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826 (2018). arXiv:1804.06826

[46]

Krzysztof Kaczmarski. 2012. B + -Tree Optimized for GPGPU. In OTM. 843--854. https://doi.org/10.1007/978--3--642--33615--7_27

[47]

Tim Kaldewey, Guy M. Lohman, René Müller, and Peter Benjamin Volk. 2012. GPU join processing revisited. In DaMoN. ACM, New York, NY, USA, 55--62. https://doi.org/10.1145/2236584.2236592

Digital Library

[48]

Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12, 4 (2018), 348--361.

Digital Library

[49]

Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big data causing big (TLB) problems: taming random memory accesses on the GPU. In DaMoN. ACM, New York, NY, USA, 6:1--6:10. https://doi.org/10.1145/3076113.3076115

Digital Library

[50]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB 10, 7 (2017), 733--744. https://doi.org/10.14778/ 3067421.3067423

Digital Library

[51]

Tomas Karnagel, René Müller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated group-by and aggregation. In ADMS. ACM, New York, NY, USA, 13--24.

[52]

Farzad Khorasani, Mehmet E. Belviranli, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Stadium Hashing: Scalable and Flexible Hashing on GPUs. In PACT. IEEE, New York, NY, USA, 63--74. https://doi.org/10. 1109/PACT.2015.13

[53]

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD. 339--350. https: //doi.org/10.1145/1807167.1807206

[54]

Changkyu Kim, Eric Sedlar, Jatin Chhugani, Tim Kaldewey, Anthony D. Nguyen, Andrea Di Blas, VictorW. Lee, Nadathur Satish, and Pradeep Dubey. 2009. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. PVLDB 2, 2 (2009), 1378--1389. https://doi.org/10.14778/1687553.1687564

Digital Library

[55]

Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. PVLDB 12, 11 (2019), 1399--1413. https://doi.org/10.14778/3342263.3342276

Digital Library

[56]

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter R. Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In SIGMOD. ACM, New York, NY, USA, 555--569. https: //doi.org/10.1145/2882903.2882906

[57]

Viktor Leis, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In SIGMOD. ACM, New York, NY, USA, 743--754. https://doi.org/10.1145/2588555.2610507

[58]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). arXiv:1903.04611

[59]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In IISWC. IEEE, New York, NY, USA, 191--202. https://doi.org/10.1109/IISWC.2018.8573483

[60]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. PVLDB 9, 14 (2016), 1647--1658. https://doi.org/10.14778/3007328.3007331

Digital Library

[61]

Yinan Li, Ippokratis Pandis, René Müller, Vijayshankar Raman, and Guy M. Lohman. 2013. NUMA-aware algorithms: the case of data shuffling. In CIDR.

[62]

Nusrat Jahan Lisa, Annett Ungethüm, Dirk Habich,Wolfgang Lehner, Tuan D. A. Nguyen, and Akash Kumar. 2018. Column Scan Acceleration in Hybrid CPU-FPGA Systems. In ADMS. ACM, New York, NY, USA, 22--33.

[63]

Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In ASPLOS. ACM, New York, NY, USA, 257--270. https://doi.org/10. 1145/3297858.3304043

[64]

Clemens Lutz et al. 2018. Efficient k-means on GPUs. In DaMoN. ACM, New York, NY, USA, 3:1--3:3. https://doi.org/10.1145/3211922.3211925

[65]

Clemens Lutz, Sebastian Breß, Tilmann Rabl, Steffen Zeuch, and Volker Markl. 2018. Efficient and Scalable k-Means on GPUs. Datenbank-Spektrum 18, 3 (2018), 157--169. https://doi.org/10.1007/ s13222-018-0293-x

[66]

Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS Hardware Acceleration of Advanced Analytics. PVLDB 11, 11 (2018), 1317--1331. https://doi.org/10.14778/3236187.3236188

Digital Library

[67]

MarketsandMarkets Research. 2018. GPU Database Market. Retrieved Oct 1, 2019 from https://www.marketsandmarkets.com/ Market-Reports/gpu-database-market-259046335.html

[68]

Frank Mietke, Robert Rex, Robert Baumgartl, Torsten Mehlan, Torsten Hoefler, andWolfgang Rehm. 2006. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. In Euro-Par. 124--133. https://doi.org/10.1007/11823285_13

[69]

Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. 2014. Unified memory in CUDA 6.0: a brief overview of related data access and transfer issues. University of Wisconsin-Madison. TR-2014--09. Research 18: Main Memory Databases and Modern Hardware SIGMOD '20, June 14--19, 2020, Portland, OR, USA 1648

[70]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In SIGCOMM. ACM, New York, NY, USA, 327--341. https://doi.org/10.1145/3230543. 3230560

[71]

Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet D. Tran, Álvaro López García, Ignacio Heredia, Peter Malík, and Ladislav Hluchý. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52, 1 (2019), 77--124. https://doi.org/10.1007/s10462-018-09679-z

Digital Library

[72]

Nvidia 2016. Nvidia Tesla P100. Nvidia. https://images.nvidia.com/ content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf WP-08019-001_v01.1.

[73]

Nvidia 2017. Nvidia Tesla V100 GPU Architecture. Nvidia. https://images.nvidia.com/content/volta-architecture/pdf/voltaarchitecture- whitepaper.pdf WP-08608-001_v1.1.

[74]

Nvidia 2018. CUDA C Programming Guide. Nvidia. http://docs.nvidia. com/pdf/CUDA_C_Programming_Guide.pdf PG-02829-001_v10.0.

[75]

Nvidia 2019. CUDA C Best Practices Guide. Nvidia. https://docs. nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf DG-05603- 001_v10.1.

[76]

Nvidia 2019. Tuning CUDA Applications for Pascal. Nvidia. https:// docs.nvidia.com/cuda/pdf/Pascal_Tuning_Guide.pdf DA-08134-001_- v10.1.

[77]

Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. 2011. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In ReConFig. IEEE, New York, NY, USA, 80--85. https://doi.org/10.1109/ReConFig.2011.4

Digital Library

[78]

Muhsen Owaida, Gustavo Alonso, Laura Fogliarini, Anthony Hock- Koon, and Pierre-Etienne Melet. 2019. Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration. PVLDB 13, 1 (2019), 71--85. http://www.vldb.org/pvldb/vol13/p71- owaida.pdf

Digital Library

[79]

C. Pearson, I. Chung, Z. Sura, W. Hwu, and J. Xiong. 2018. NUMAaware Data-transfer Measurements for Power/NVLink Multi-GPU Systems. In IWOPH. Springer, Heidelberg, Germany.

[80]

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu. 2019. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. In ICPE. ACM, New York, NY, USA, 209--218. https://doi.org/10.1145/ 3297663.3310299

[81]

Holger Pirk, Stefan Manegold, and Martin L. Kersten. 2014. Waste not. . . Efficient co-processing of relational data. In ICDE. IEEE, New York, NY, USA, 508--519.

[82]

Aunn Raza, Periklis Chrysogelos, Panagiotis Sioulas, Vladimir Indjic, Angelos-Christos G. Anadiotis, and Anastasia Ailamaki. 2020. GPUaccelerated data management under the test of time. In CIDR.

[83]

Christopher Root and Todd Mostak. 2016. MapD: a GPU-powered big data analytics and visualization platform. In SIGGRAPH. ACM, New York, NY, USA, 73:1--73:2. https://doi.org/10.1145/2897839.2927468

Digital Library

[84]

Viktor Rosenfeld, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2019. Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs. In DaMoN. ACM, New York, NY, USA, 8. https://doi.org/10.1145/3329785.3329922

Digital Library

[85]

Eyal Rozenberg and Peter A. Boncz. 2017. Faster across the PCIe bus: a GPU library for lightweight decompression: including support for patched compression schemes. In DaMoN. ACM, New York, NY, USA, 8:1--8:5. https://doi.org/10.1145/3076113.3076122

[86]

Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. In SIGMOD. ACM, New York, NY, USA, 1961--1976. https://doi.org/10. 1145/2882903.2882917

Digital Library

[87]

Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B+- tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In SIGMOD. 1523--1538. https://doi.org/10. 1145/2882903.2882918

[88]

Arnon Shimoni. 2017. Which GPU database is right for me? Retrieved Oct 1, 2019 from https://hackernoon.com/which-gpu-database-isright- for-me-6ceef6a17505

[89]

Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-conscious Hash-Joins on GPUs. In ICDE. IEEE, New York, NY, USA.

[90]

Elias Stehle and Hans-Arno Jacobsen. 2017. A Memory Bandwidth- Efficient Hybrid Radix Sort on GPUs. In SIGMOD. ACM, New York, NY, USA, 417--432.

[91]

Nathan R. Tallent, Nitin A. Gawande, Charles Siegel, Abhinav Vishnu, and Adolfy Hoisie. 2017. Evaluating On-Node GPU Interconnects for Deep Learning Workloads. In PMBS@SC. 3--21. https://doi.org/10. 1007/978--3--319--72971--8_1

[92]

Top500. 2019. Top500 Highlights. Retrieved Mar 16, 2020 from https://www.top500.org/lists/2019/11/highs/

[93]

Animesh Trivedi, Patrick Stuedi, Bernard Metzler, Clemens Lutz, Martin Schmatz, and Thomas R. Gross. 2015. RStore: A Direct-Access DRAM-based Data Store. In ICDCS. 674--685. https://doi.org/10.1109/ ICDCS.2015.74

[94]

Vasily Volkov. 2016. Understanding latency hiding on GPUs. Ph.D. Dissertation. UC Berkeley.

[95]

Haicheng Wu, Gregory Frederick Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. KernelWeaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO. IEEE/ACM, New York, NY, USA, 107--118. https://doi.org/10.1109/MICRO.2012.19

[96]

Xilinx 2017. Vivado Design Suite: AXI Reference Guide. Xilinx. https: //www.xilinx.com/support/documentation/ip_documentation/axi_ ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf UG1037 (v4.0).

[97]

Rengan Xu, Frank Han, and Quy Ta. 2018. Deep Learning at Scale on Nvidia V100 Accelerators. In PMBS@SC. 23--32. https://doi.org/10. 1109/PMBS.2018.8641600

[98]

Zhaofeng Yan, Yuzhe Lin, Lu Peng, and Weihua Zhang. 2019. Harmonia: A high throughput B+tree for GPUs. In PPoPP. 133--144. https://doi.org/10.1145/3293883.3295704

[99]

Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, Pedro V. Sander, and Jiaoying Shi. 2007. In-memory grid files on graphics processors. In DaMoN. ACM, New York, NY, USA, 5. https://doi.org/10.1145/1363189.1363196

[100]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB 6, 10 (2013), 817--828. https://doi.org/10.14778/2536206.2536210

Digital Library

[101]

Xia Zhao, Almutaz Adileh, Zhibin Yu, Zhiying Wang, Aamer Jaleel, and Lieven Eeckhout. 2019. Adaptive memory-side last-level GPU caching. In ISCA. ISCA, Winona, MN, USA, 411--423. https://doi.org/ 10.1145/3307650.3322235

[102]

Tianhao Zheng, DavidW. Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs. In HPCA. IEEE, New York, NY, USA, 345--357. https://doi.org/10.1109/HPCA.2016.7446077

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 31-May-2024
https://doi.org/10.14778/3659437.3659443
Schieffer GWahlgren JRen JFaj JPeng I(2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673110
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Show More Cited By

Index Terms

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Main memory engines

Recommendations

Exploring parallelism in volume ray casting: understanding the programming issues of multithreaded accelerators
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Direct volume rendering of irregular 3D datasets demands high computational power and memory bandwidth. Recent research in optimizing volume rendering algorithms are exploring the high processing power offered by a new trend in hardware design: ...
Correlation ratio based volume image registration on GPUs

Volume image registration remains one of the best candidates for Graphics Processing Unit (GPU) acceleration because of its enormous computation time and plentiful data-level parallelism. However, an efficient GPU implementation for image registration ...
High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Results Reproduced / v1.1
Best Paper

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft
Horizon 2020
Bundesministerium für Bildung und Forschung
Bundesministerium für Wirtschaft und Energie

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
2,213
Total Downloads

Downloads (Last 12 months)485
Downloads (Last 6 weeks)46

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 31-May-2024
https://doi.org/10.14778/3659437.3659443
Schieffer GWahlgren JRen JFaj JPeng I(2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673110
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Elis BPearce OBoehme DBurmark JSchulz M(2024)Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU ParallelismProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635036(1-11)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3635035.3635036
Katsarakis AGavrielatos VNtarmos NMencagli GDazzi PLowenthal DBadia R(2024)DLHT: A Non-blocking Resizable Hashtable with Fast Deletes and Memory-awarenessProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658682(186-199)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658682
Ahammed AEzekiel AObermaisser R(2023)Time-Efficient Identification Procedure for Neurological Complications of Rescue Patients in an Emergency Scenario Using Hardware-Accelerated Artificial Intelligence ModelsAlgorithms10.3390/a1605025816:5(258)Online publication date: 18-May-2023
https://doi.org/10.3390/a16050258
Böther MBenson LKlimovic ARabl T(2023)Analyzing Vectorized Hash Tables across CPU ArchitecturesProceedings of the VLDB Endowment10.14778/3611479.361148516:11(2755-2768)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611485
Allen TCooper BGe R(2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632953
Zhu ZHu XAthanassoulis M(2023)NOCAP: Near-Optimal Correlation-Aware Partitioning JoinsProceedings of the ACM on Management of Data10.1145/36267391:4(1-27)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626739
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents