Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389705acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

Published: 31 May 2020 Publication History

Abstract

GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the on-board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU@. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.

Supplementary Material

MP4 File (3318464.3389705.mp4)
Presentation Video

References

[1]
Jasmin Ajanovic. 2009. PCI express 3.0 overview. In HCS, Vol. 69. IEEE, New York, NY, USA, 143. https://doi.org/10.1109/HOTCHIPS. 2009.7478337
[2]
Brian Allison. 2018. Introduction to the OpenCAPI Interface. https://openpowerfoundation.org/wp-content/uploads/2018/10/ Brian-Allison.OPF_OpenCAPI_FPGA_Overview_V1--1.pdf. In Open- POWER Summit Europe.
[3]
AMD. 2019. AMD EPYC CPUs, AMD Radeon Instinct GPUs and ROCm Open Source Software to Power World's Fastest Supercomputer at Oak Ridge National Laboratory. Retrieved July 5, 2019 from https://www.amd.com/en/press-releases/2019-05-07-amd-epyccpus- radeon-instinct-gpus-and-rocm-open-source-software-topower
[4]
Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki. 2017. The Case For Heterogeneous HTAP. In CIDR.
[5]
Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. In PPoPP. 173--182. https://doi.org/10.1145/2688500.2688521
[6]
Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In IPDPS. 430--440. https://doi.org/10.1109/ IPDPS.2018.00053
[7]
Muhammad A. Awad, Saman Ashkiani, Rob Johnson, Martin Farach- Colton, and John D. Owens. 2019. Engineering a high-performance GPU B-Tree. In PPoPP. 145--157. https://doi.org/10.1145/3293883. 3295706
[8]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE. IEEE, New York, NY, USA, 362--373. https://doi.org/10.1109/ICDE.2013.6544839
[9]
Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-Scale In-Memory Join Processing using RDMA. In SIGMOD. ACM, New York, NY, USA, 1463--1475. https://doi.org/10. 1145/2723372.2750547
[10]
Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In SIGMOD. ACM, New York, NY, USA, 37--48. https://doi.org/10. 1145/1989323.1989328
[11]
Rajesh Bordawekar and Pidad Gasfer D'Souza. 2018. Evaluation of hybrid cache-coherent concurrent hash table on IBM POWER9 AC922 system with NVLink2. http://on-demand.gputechconf.com/gtc/2018/ video/S8172/. In GTC. Nvidia.
[12]
Dan Bouvier and Ben Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In HCS. IEEE, New York, NY, USA, 1--42. https://doi.org/10.1109/HOTCHIPS.2014.7478810
[13]
Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-accelerated Databases. In SIGMOD. ACM, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903. 2882936
[14]
Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors. VLDB J. 27, 6 (2018), 797--822. https://doi.org/10.1007/s00778-018-0512-y
[15]
Alexandre Bicas Caldeira. 2018. IBM power system AC922 introduction and technical overview. IBM, International Technical Support Organization.
[16]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU--GPU parallelism in JIT compiled engines. PVLDB 12, 5 (2019), 544--556. https://doi.org/10.14778/3303753.3303760
[17]
Periklis Chrysogelos, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious Query Processing in GPU-accelerated Analytical Engines. In CIDR.
[18]
Jonathan Corbet. 2015. Making kernel pages movable. LWN.net (July 2015). https://lwn.net/Articles/650917/
[19]
CXL 2019. Compute Express Link Specification Revision 1.1. CXL. https://www.computeexpresslink.org
[20]
Patrick Damme, Annett Ungethüm, Juliana Hildebrandt, Dirk Habich, and Wolfgang Lehner. 2019. From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms. Trans. Database Syst. 44, 3 (2019), 9:1--9:46. https://doi.org/10.1145/3323991
[21]
Deloitte. 2017. Hitting the accelerator: the next generation of machine-learning chips. Retrieved Oct 1, 2019 from https://www2.deloitte.com/content/dam/Deloitte/global/Images/ infographics/technologymediatelecommunications/gx-deloittetmt- 2018-nextgen-machine-learning-report.pdf
[22]
Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, and Weiwei Gong. 2019. A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores. PVLDB 12, 12 (2019), 2218--2229.
[23]
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large- Scale Machine Learning. PVLDB 9, 12 (2016), 960--971. https://doi. org/10.14778/2994509.2994515
[24]
Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database Compression on Graphics Processors. PVLDB 3, 1 (2010), 670--680. https://doi.org/10.14778/1920841.1920927
[25]
PhilipWerner Frey and Gustavo Alonso. 2009. Minimizing the Hidden Cost of RDMA. In ICDCS. 553--560. https://doi.org/10.1109/ICDCS. 2009.32
[26]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In SIGMOD. ACM, New York, NY, USA, 1603--1618. https://doi.org/10.1145/3183713.3183734
[27]
Victor Garcia-Flores, Eduard Ayguadé, and Antonio J. Peña. 2017. Efficient Data Sharing on Heterogeneous Systems. In ICPP. 121--130. https://doi.org/10.1109/ICPP.2017.21
[28]
Gartner. 2019. Gartner Says the Future of the Database Market Is the Cloud. Retrieved Oct 1, 2019 from https://www.gartner.com/en/newsroom/press-releases/2019- 07-01-gartner-says-the-future-of-the-database-market-is-the
[29]
Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations using Graphics Processors. In SIGMOD. ACM, New York, NY, USA, 215--226. https://doi.org/10.1145/1007568.1007594
[30]
Michael Gowanlock, Ben Karsin, Zane Fink, and Jordan Wright. 2019. Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives. In DaMoN. ACM, New York, NY, USA, 7:1--7:11. https://doi.org/10.1145/3329785.3329926
[31]
Chris Gregg and Kim M. Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In ISPASS. IEEE, New York, NY, USA, 134--144. https://doi.org/10.1109/ ISPASS.2011.5762730
[32]
Tim Gubner, Diego G. Tomé, Harald Lang, and Peter A. Boncz. 2019. Fluid Co-processing: GPU Bloom-filters for CPU Joins. In DaMoN. ACM, New York, NY, USA, 9:1--9:10. https://doi.org/10.1145/3329785. 3329934
[33]
Prabhat K. Gupta. 2016. Accelerating Datacenter Workloads. In FPL. 1--27.
[34]
Bingsheng He et al. 2009. Relational query coprocessing on graphics processors. TODS 34, 4 (2009). Research 18: Main Memory Databases and Modern Hardware SIGMOD '20, June 14--19, 2020, Portland, OR, USA 1647
[35]
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2008. Relational joins on graphics processors. In SIGMOD. ACM, New York, NY, USA, 511--524. https: //doi.org/10.1145/1376616.1376670
[36]
Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. PVLDB 6, 10 (2013), 889--900. https://doi.org/10.14778/2536206.2536216
[37]
Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co-Processing on Coupled CPU-GPU Architectures. PVLDB 8, 4 (2014), 329--340. https://doi.org/10.14778/2735496.2735497
[38]
Max Heimel et al. 2013. Hardware-oblivious parallelism for inmemory column-stores. PVLDB 6, 9 (2013), 709--720.
[39]
Joel Hestness, Stephen W. Keckler, and David A. Wood. 2014. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior. In IISWC. IEEE, New York, NY, USA, 150-- 160. https://doi.org/10.1109/IISWC.2014.6983054
[40]
IBM 2018. POWER9 Processor User's Manual Version 2.0. IBM.
[41]
IBM POWER9 NPU team. 2018. Functionality and performance of NVLink with IBM POWER9 processors. IBM Journal of Research and Development 62, 4/5 (2018), 9.
[42]
Intel 2018. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel.
[43]
Intel. 2019. Intel Stratix 10 DX FPGA Product Brief. Retrieved Accessed: Oct 2, 2019 from https://www.intel.com/content/dam/ www/programmable/us/en/pdfs/literature/solution-sheets/stratix- 10-dx-product-brief.pdf
[44]
Intel. 2019. Intel Unveils New GPU Architecture with High- Performance Computing and AI Acceleration, and oneAPI Software Stack with Unified and Scalable Abstraction for Heterogeneous Architectures. Retrieved Jan 29, 2020 from https://newsroom.intel.com/news-releases/intel-unveils-new-gpuarchitecture- optimized-for-hpc-ai-oneapi
[45]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826 (2018). arXiv:1804.06826
[46]
Krzysztof Kaczmarski. 2012. B + -Tree Optimized for GPGPU. In OTM. 843--854. https://doi.org/10.1007/978--3--642--33615--7_27
[47]
Tim Kaldewey, Guy M. Lohman, René Müller, and Peter Benjamin Volk. 2012. GPU join processing revisited. In DaMoN. ACM, New York, NY, USA, 55--62. https://doi.org/10.1145/2236584.2236592
[48]
Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12, 4 (2018), 348--361.
[49]
Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big data causing big (TLB) problems: taming random memory accesses on the GPU. In DaMoN. ACM, New York, NY, USA, 6:1--6:10. https://doi.org/10.1145/3076113.3076115
[50]
Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB 10, 7 (2017), 733--744. https://doi.org/10.14778/ 3067421.3067423
[51]
Tomas Karnagel, René Müller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated group-by and aggregation. In ADMS. ACM, New York, NY, USA, 13--24.
[52]
Farzad Khorasani, Mehmet E. Belviranli, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Stadium Hashing: Scalable and Flexible Hashing on GPUs. In PACT. IEEE, New York, NY, USA, 63--74. https://doi.org/10. 1109/PACT.2015.13
[53]
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD. 339--350. https: //doi.org/10.1145/1807167.1807206
[54]
Changkyu Kim, Eric Sedlar, Jatin Chhugani, Tim Kaldewey, Anthony D. Nguyen, Andrea Di Blas, VictorW. Lee, Nadathur Satish, and Pradeep Dubey. 2009. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. PVLDB 2, 2 (2009), 1378--1389. https://doi.org/10.14778/1687553.1687564
[55]
Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers. PVLDB 12, 11 (2019), 1399--1413. https://doi.org/10.14778/3342263.3342276
[56]
Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter R. Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In SIGMOD. ACM, New York, NY, USA, 555--569. https: //doi.org/10.1145/2882903.2882906
[57]
Viktor Leis, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In SIGMOD. ACM, New York, NY, USA, 743--754. https://doi.org/10.1145/2588555.2610507
[58]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. CoRR abs/1903.04611 (2019). arXiv:1903.04611
[59]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In IISWC. IEEE, New York, NY, USA, 191--202. https://doi.org/10.1109/IISWC.2018.8573483
[60]
Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics. PVLDB 9, 14 (2016), 1647--1658. https://doi.org/10.14778/3007328.3007331
[61]
Yinan Li, Ippokratis Pandis, René Müller, Vijayshankar Raman, and Guy M. Lohman. 2013. NUMA-aware algorithms: the case of data shuffling. In CIDR.
[62]
Nusrat Jahan Lisa, Annett Ungethüm, Dirk Habich,Wolfgang Lehner, Tuan D. A. Nguyen, and Akash Kumar. 2018. Column Scan Acceleration in Hybrid CPU-FPGA Systems. In ADMS. ACM, New York, NY, USA, 22--33.
[63]
Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In ASPLOS. ACM, New York, NY, USA, 257--270. https://doi.org/10. 1145/3297858.3304043
[64]
Clemens Lutz et al. 2018. Efficient k-means on GPUs. In DaMoN. ACM, New York, NY, USA, 3:1--3:3. https://doi.org/10.1145/3211922.3211925
[65]
Clemens Lutz, Sebastian Breß, Tilmann Rabl, Steffen Zeuch, and Volker Markl. 2018. Efficient and Scalable k-Means on GPUs. Datenbank-Spektrum 18, 3 (2018), 157--169. https://doi.org/10.1007/ s13222-018-0293-x
[66]
Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS Hardware Acceleration of Advanced Analytics. PVLDB 11, 11 (2018), 1317--1331. https://doi.org/10.14778/3236187.3236188
[67]
MarketsandMarkets Research. 2018. GPU Database Market. Retrieved Oct 1, 2019 from https://www.marketsandmarkets.com/ Market-Reports/gpu-database-market-259046335.html
[68]
Frank Mietke, Robert Rex, Robert Baumgartl, Torsten Mehlan, Torsten Hoefler, andWolfgang Rehm. 2006. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. In Euro-Par. 124--133. https://doi.org/10.1007/11823285_13
[69]
Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. 2014. Unified memory in CUDA 6.0: a brief overview of related data access and transfer issues. University of Wisconsin-Madison. TR-2014--09. Research 18: Main Memory Databases and Modern Hardware SIGMOD '20, June 14--19, 2020, Portland, OR, USA 1648
[70]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In SIGCOMM. ACM, New York, NY, USA, 327--341. https://doi.org/10.1145/3230543. 3230560
[71]
Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet D. Tran, Álvaro López García, Ignacio Heredia, Peter Malík, and Ladislav Hluchý. 2019. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52, 1 (2019), 77--124. https://doi.org/10.1007/s10462-018-09679-z
[72]
Nvidia 2016. Nvidia Tesla P100. Nvidia. https://images.nvidia.com/ content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf WP-08019-001_v01.1.
[73]
Nvidia 2017. Nvidia Tesla V100 GPU Architecture. Nvidia. https://images.nvidia.com/content/volta-architecture/pdf/voltaarchitecture- whitepaper.pdf WP-08608-001_v1.1.
[74]
Nvidia 2018. CUDA C Programming Guide. Nvidia. http://docs.nvidia. com/pdf/CUDA_C_Programming_Guide.pdf PG-02829-001_v10.0.
[75]
Nvidia 2019. CUDA C Best Practices Guide. Nvidia. https://docs. nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf DG-05603- 001_v10.1.
[76]
Nvidia 2019. Tuning CUDA Applications for Pascal. Nvidia. https:// docs.nvidia.com/cuda/pdf/Pascal_Tuning_Guide.pdf DA-08134-001_- v10.1.
[77]
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. 2011. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In ReConFig. IEEE, New York, NY, USA, 80--85. https://doi.org/10.1109/ReConFig.2011.4
[78]
Muhsen Owaida, Gustavo Alonso, Laura Fogliarini, Anthony Hock- Koon, and Pierre-Etienne Melet. 2019. Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration. PVLDB 13, 1 (2019), 71--85. http://www.vldb.org/pvldb/vol13/p71- owaida.pdf
[79]
C. Pearson, I. Chung, Z. Sura, W. Hwu, and J. Xiong. 2018. NUMAaware Data-transfer Measurements for Power/NVLink Multi-GPU Systems. In IWOPH. Springer, Heidelberg, Germany.
[80]
Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu. 2019. Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects. In ICPE. ACM, New York, NY, USA, 209--218. https://doi.org/10.1145/ 3297663.3310299
[81]
Holger Pirk, Stefan Manegold, and Martin L. Kersten. 2014. Waste not. . . Efficient co-processing of relational data. In ICDE. IEEE, New York, NY, USA, 508--519.
[82]
Aunn Raza, Periklis Chrysogelos, Panagiotis Sioulas, Vladimir Indjic, Angelos-Christos G. Anadiotis, and Anastasia Ailamaki. 2020. GPUaccelerated data management under the test of time. In CIDR.
[83]
Christopher Root and Todd Mostak. 2016. MapD: a GPU-powered big data analytics and visualization platform. In SIGGRAPH. ACM, New York, NY, USA, 73:1--73:2. https://doi.org/10.1145/2897839.2927468
[84]
Viktor Rosenfeld, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2019. Performance Analysis and Automatic Tuning of Hash Aggregation on GPUs. In DaMoN. ACM, New York, NY, USA, 8. https://doi.org/10.1145/3329785.3329922
[85]
Eyal Rozenberg and Peter A. Boncz. 2017. Faster across the PCIe bus: a GPU library for lightweight decompression: including support for patched compression schemes. In DaMoN. ACM, New York, NY, USA, 8:1--8:5. https://doi.org/10.1145/3076113.3076122
[86]
Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. In SIGMOD. ACM, New York, NY, USA, 1961--1976. https://doi.org/10. 1145/2882903.2882917
[87]
Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B+- tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In SIGMOD. 1523--1538. https://doi.org/10. 1145/2882903.2882918
[88]
Arnon Shimoni. 2017. Which GPU database is right for me? Retrieved Oct 1, 2019 from https://hackernoon.com/which-gpu-database-isright- for-me-6ceef6a17505
[89]
Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-conscious Hash-Joins on GPUs. In ICDE. IEEE, New York, NY, USA.
[90]
Elias Stehle and Hans-Arno Jacobsen. 2017. A Memory Bandwidth- Efficient Hybrid Radix Sort on GPUs. In SIGMOD. ACM, New York, NY, USA, 417--432.
[91]
Nathan R. Tallent, Nitin A. Gawande, Charles Siegel, Abhinav Vishnu, and Adolfy Hoisie. 2017. Evaluating On-Node GPU Interconnects for Deep Learning Workloads. In PMBS@SC. 3--21. https://doi.org/10. 1007/978--3--319--72971--8_1
[92]
Top500. 2019. Top500 Highlights. Retrieved Mar 16, 2020 from https://www.top500.org/lists/2019/11/highs/
[93]
Animesh Trivedi, Patrick Stuedi, Bernard Metzler, Clemens Lutz, Martin Schmatz, and Thomas R. Gross. 2015. RStore: A Direct-Access DRAM-based Data Store. In ICDCS. 674--685. https://doi.org/10.1109/ ICDCS.2015.74
[94]
Vasily Volkov. 2016. Understanding latency hiding on GPUs. Ph.D. Dissertation. UC Berkeley.
[95]
Haicheng Wu, Gregory Frederick Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. KernelWeaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO. IEEE/ACM, New York, NY, USA, 107--118. https://doi.org/10.1109/MICRO.2012.19
[96]
Xilinx 2017. Vivado Design Suite: AXI Reference Guide. Xilinx. https: //www.xilinx.com/support/documentation/ip_documentation/axi_ ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf UG1037 (v4.0).
[97]
Rengan Xu, Frank Han, and Quy Ta. 2018. Deep Learning at Scale on Nvidia V100 Accelerators. In PMBS@SC. 23--32. https://doi.org/10. 1109/PMBS.2018.8641600
[98]
Zhaofeng Yan, Yuzhe Lin, Lu Peng, and Weihua Zhang. 2019. Harmonia: A high throughput B+tree for GPUs. In PPoPP. 133--144. https://doi.org/10.1145/3293883.3295704
[99]
Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, Pedro V. Sander, and Jiaoying Shi. 2007. In-memory grid files on graphics processors. In DaMoN. ACM, New York, NY, USA, 5. https://doi.org/10.1145/1363189.1363196
[100]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. PVLDB 6, 10 (2013), 817--828. https://doi.org/10.14778/2536206.2536210
[101]
Xia Zhao, Almutaz Adileh, Zhibin Yu, Zhiying Wang, Aamer Jaleel, and Lieven Eeckhout. 2019. Adaptive memory-side last-level GPU caching. In ISCA. ISCA, Winona, MN, USA, 411--423. https://doi.org/ 10.1145/3307650.3322235
[102]
Tianhao Zheng, DavidW. Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs. In HPCA. IEEE, New York, NY, USA, 345--357. https://doi.org/10.1109/HPCA.2016.7446077

Cited By

View all
  • (2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 31-May-2024
  • (2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
  • (2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
  • Show More Cited By

Index Terms

  1. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)485
    • Downloads (Last 6 weeks)46
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 31-May-2024
    • (2024)Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace HopperProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673110(199-209)Online publication date: 12-Aug-2024
    • (2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
    • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
    • (2024)Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU ParallelismProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635036(1-11)Online publication date: 18-Jan-2024
    • (2024)DLHT: A Non-blocking Resizable Hashtable with Fast Deletes and Memory-awarenessProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658682(186-199)Online publication date: 3-Jun-2024
    • (2023)Time-Efficient Identification Procedure for Neurological Complications of Rescue Patients in an Emergency Scenario Using Hardware-Accelerated Artificial Intelligence ModelsAlgorithms10.3390/a1605025816:5(258)Online publication date: 18-May-2023
    • (2023)Analyzing Vectorized Hash Tables across CPU ArchitecturesProceedings of the VLDB Endowment10.14778/3611479.361148516:11(2755-2768)Online publication date: 24-Aug-2023
    • (2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
    • (2023)NOCAP: Near-Optimal Correlation-Aware Partitioning JoinsProceedings of the ACM on Management of Data10.1145/36267391:4(1-27)Online publication date: 12-Dec-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media