survey

Query Processing on Heterogeneous CPU/GPU Systems

Authors:

Viktor Rosenfeld,

Sebastian Breß,

Volker MarklAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 55, Issue 1

Article No.: 11, Pages 1 - 38

https://doi.org/10.1145/3485126

Published: 17 January 2022 Publication History

Abstract

Due to their high computational power and internal memory bandwidth, graphic processing units (GPUs) have been extensively studied by the database systems research community. A heterogeneous query processing system that employs CPUs and GPUs at the same time has to solve many challenges, including how to distribute the workload on processors with different capabilities; how to overcome the data transfer bottleneck; and how to support implementations for multiple processors efficiently. In this survey we devise a classification scheme to categorize techniques developed to address these challenges. Based on this scheme, we categorize query processing systems on heterogeneous CPU/GPU systems and identify open research problems.

Supplementary Material

rosenfeld (rosenfeld.zip)

Supplemental movie, appendix, image and software files for, Query Processing on Heterogeneous CPU/GPU Systems

Download
94.09 KB

References

[1]

D. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, C. Erwin, E. Galvez, M. Hatoun, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, and S. Zdonik. 2003. Aurora: A data stream management system. In Proc. of SIGMOD’03. Association for Computing Machinery, 666. DOI:https://doi.org/10.1145/872757.872855

[2]

Daniel Abadi, Anastasia Ailamaki, David Andersen, Peter Bailis, Magdalena Balazinska, Philip Bernstein, Peter Boncz, Surajit Chaudhuri, Alvin Cheung, AnHai Doan, Luna Dong, Michael J. Franklin, Juliana Freire, Alon Halevy, Joseph M. Hellerstein, Stratos Idreos, Donald Kossmann, Tim Kraska, Sailesh Krishnamurthy, Volker Markl, Sergey Melnik, Tova Milo, C. Mohan, Thomas Neumann, Beng Chin Ooi, Fatma Ozcan, Jignesh Patel, Andrew Pavlo, Raluca Popa, Raghu Ramakrishnan, Christopher Ré, Michael Stonebraker, and Dan Suciu. 2020. The Seattle report on database research. 48, 4 (2020), 44–53. DOI:https://doi.org/10.1145/3385658.3385668

[3]

Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proc. of ACM SIGMOD’06. Association for Computing Machinery, 671–682. DOI:https://doi.org/10.1145/1142473.1142548

[4]

Advanced Micro Devices. 2021. More About How ROCm Uses PCIe Atomics. https://rocmdocs.amd.com/en/latest/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.html.

[5]

Adnan Agbaria, David Minor, Natan Peterfreund, Eyal Rozenberg, and Ofer Rosenberg. 2017. Overtaking CPU DBMSes with a GPU in whole-query analytic processing with parallelism-friendly execution plan optimization. In Proc. of ADMS/IMDM@VLDB’17. Springer International Publishing, 57–78. https://link.springer.com/chapter/10.1007/978-3-319-56111-0_4.

[6]

Jasmin Ajanovic. 2009. PCI express 3.0 overview. In Proc. of IEEE HCS 21. 1–61. DOI:

[7]

Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki. 2017. The case for heterogeneous HTAP. In Proc. of CIDR’17. http://infoscience.epfl.ch/record/224447.

[8]

Sonu Arora, Dan Bouvier, and Chris Weaver. 2020. AMD next generation 7NM Ryzen™ 4000 APU “Renoir”. In Proc. of IEEE HCS 32. 1–30. DOI:

[9]

M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, and V. Watson. 1976. System R: Relational approach to database management. 1, 2 (1976), 97–137. DOI:https://doi.org/10.1145/320455.320457

[10]

Nagender Bandi, Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2004. Hardware acceleration in commercial databases: A case study of spatial operations. In Proc. of VLDB’04. Morgan Kaufmann, 1021–1032. https://doi.org/10.1016/B978-012088469-8.50089-9

Digital Library

[11]

Felix Beier, Torsten Kilias, and Kai-Uwe Sattler. 2012. GiST scan acceleration using coprocessors. In Proc. of ACM DaMoN’12. 63–69.

Digital Library

[12]

David Blythe. 2020. The Xe GPU architecture. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–27. https://doi.org/10.1109/HCS49909.2020.9220591

[13]

Kenneth S. Bøgh, Ira Assent, and Matteo Magnani. 2013. Efficient GPU-based skyline computation. In Proc. of ACM DaMoN’13, Article 5.

Digital Library

[14]

Kenneth S. Bøgh, Sean Chester, Darius Šidlauskas, and Ira Assent. 2017. Template skycube algorithms for heterogeneous parallelism on multicore and GPU architectures. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 447–462. DOI:https://doi.org/10.1145/3035918.3035962

[15]

M. Bohr. 2007. A 30 year retrospective on Dennard’s MOSFET scaling paper. 12, 1 (2007), 11–13. https://doi.org/10.1109/N-SSC.2007.4785534

[16]

Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-pipelining query execution. In Proc. of CIDR’05.

[17]

Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. 54, 5 (2011), 67–77. https://doi.org/10.1145/1941487.1941507

Digital Library

[18]

Dan Bouvier and Ben Sander. 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Proc. of IEEE HCS 26. 1–42. DOI:

[19]

Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. 32, 2 (2012), 28–37. https://doi.org/10.1109/MM.2012.2

Digital Library

[20]

Sebastian Breß. 2014. The design and implementation of CoGaDB: A column-oriented GPU-accelerated DBMS. 14, 3 (2014), 199–209. DOI:

[21]

Sebastian Breß, Felix Beier, Hannes Rauhe, Kai-Uwe Sattler, Eike Schallehn, and Gunter Saake. 2013. Efficient co-processor utilization in database query processing. Information Systems 38, 8 (2013), 1084–1096. DOI:https://doi.org/10.1016/j.is.2013.05.004

[22]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust query processing in co-processor-accelerated databases. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1891–1906. https://doi.org/10.1145/2882903.2882936

[23]

Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-accelerated database systems: Survey and open challenges. Trans. Large Scale Data Knowl. Centered Syst. 15 (2014), 1–35. DOI:

[24]

Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 6 (2018), 797–822. DOI:https://doi.org/10.1007/s00778-018-0512-y

[25]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine.IEEE Data Engineering Bulletin 36, 4 (2015).

[26]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. of USENIX OSDI’18. USENIX Association, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen.

Digital Library

[27]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow 12, 5 (2019), 544–556. DOI:https://doi.org/10.14778/3303753.3303760

[28]

Periklis Chrysogelos, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious query processing in GPU-accelerated analytical engines. In Proc. of CIDR’19. 9. http://infoscience.epfl.ch/record/262529.

[29]

Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. 44, 3, Article 15 (2012), 62 pages. DOI:https://doi.org/10.1145/2187671.2187677

[30]

Leornado Dagum and Ramesh Menon. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46–55. DOI:https://doi.org/10.1109/99.660313

[31]

R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268. DOI:

[32]

Harish Doraiswamy, Huy T. Vo, Cláudio T. Siva, and Juliana Freire. 2016. A GPU-based index to support interactive spatio-temporal queries over historical data. In Proc. of IEEE ICDE’16. 1086–1097. https://doi.org/10.1109/ICDE.2016.7498315

[33]

Ahmed Eldawy and Mohamed F. Mokbel. 2016. The era of big spatial data: A survey. Foundations and Trends® in Databases 6, 3–4 (2016), 163–273. https://doi.org/10.1561/1900000054

Digital Library

[34]

Gordon Elder. 2002. Radeon 9700. In Proc. of ACM SIGGRAPH/Eurographics’02 Tutorials. https://www.graphicshardware.org/previous/www_2002/presentations/Hot3D-RADEON9700.ppt.

[35]

Jian Fang, Yvo T. B. Mulder, Jan Hidders, Jinho Lee, and H. Peter Hofstee. 2020. In-memory database acceleration on FPGAs: A survey. 29, 1 (2020), 33–59. DOI:

[36]

Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database compression on graphics processors. Proc. VLDB Endow 3, 1–2 (2010), 670–680. DOI:https://doi.org/10.14778/1920841.1920927

[37]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined query processing in coprocessor environments. In Proc. of ACM SIGMOD’18. Association for Computing Machinery, 1603–1618. DOI:https://doi.org/10.1145/3183713.3183734

[38]

Henning Funke and Jens Teubner. 2020. Data-parallel query processing on non-uniform data. Proc. VLDB Endow 13, 6 (2020), 884–897. DOI:https://doi.org/10.14778/3380750.3380758

[39]

Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. 2006. GPUTeraSort: High performance graphics co-processor sorting for large database management. In Proc. of ACM SIGMOD’06. ACM, 325–336. DOI:https://doi.org/10.1145/1142473.1142511

[40]

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. 2004. Fast computation of database operations using graphics processors. In Proc. of ACM SIGMOD’04. ACM, 215–226. DOI:https://doi.org/10.1145/1007568.1007594

[41]

Goetz Graefe and Leonard D. Shapiro. 1991. Data compression and database performance. In Proc. of IEEE SAC’91. IEEE Computer Society, 22–27. DOI:

[42]

Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proc. of IEEE ISPASS’11. 134–144. DOI:https://doi.org/10.1109/ISPASS.2011.5762730

[43]

Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang. 2004. Best of both latency and throughput. In Proc. of IEEE ICCD’04. 236–243. DOI:https://doi.org/10.1109/ICCD.2004.1347928

[44]

Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastassia Ailamaki. 2005. QPipe: A simultaneously pipelined relational query engine. In Proc. of ACM SIGMOD’05. Association for Computing Machinery, 383–394. DOI:https://doi.org/10.1145/1066157.1066201

[45]

Mark Harris. 2004. General-purpose computation using graphics hardware. In Eurographics’04 Tutorials. Eurographics Association. DOI:

[46]

Bingsheng He, Naga K. Govindaraju, Qiong Luo, and Burton Smith. 2007. Efficient gather and scatter operations on graphics processors. In Proc. of ACM SC’07. ACM, Article 46, 12 pages. DOI:https://doi.org/10.1145/1362622.1362684

[47]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article 21 (2009), 39 pages. DOI:https://doi.org/10.1145/1620585.1620588

[48]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational joins on graphics processors. In Proc. of ACM SIGMOD’08. ACM, 511–524. DOI:https://doi.org/10.1145/1376616.1376670

[49]

Bingsheng He and Jeffrey Xu Yu. 2011. High-throughput transaction executions on graphics processors. Proc. VLDB Endow 4, 5 (2011), 314–325. DOI:https://doi.org/10.14778/1952376.1952381

[50]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. Proc. VLDB Endow 6, 10 (2013), 889–900. DOI:https://doi.org/10.14778/2536206.2536216

[51]

Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-cache query co-processing on coupled CPU-GPU architectures. Proc. VLDB Endow. 8, 4 (2014), 329–340. DOI:https://doi.org/10.14778/2735496.2735497

[52]

Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In Proc. of ACM SIGMOD’15. Association for Computing Machinery, 1477–1492. DOI:https://doi.org/10.1145/2723372.2749438

[53]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proc. VLDB Endow. 6, 9 (2013), 709–720. DOI:https://doi.org/10.14778/2536360.2536370

[54]

John L. Hennessy and David A. Patterson. 2017. Computer Architecture: A Quantitative Approach (6 ed.). Morgan Kaufmann.

Digital Library

[55]

Christian S. Jensen, Torben Bach Pedersen, and Christian Thomsen. 2010. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management 2, 1 (2010), 1–111. DOI:https://doi.org/10.2200/S00299ED1V01Y201009DTM009

[56]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVIDIA turing T4 GPU via microbenchmarking. abs/1903.07486 (2019). http://arxiv.org/abs/1903.07486

[57]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. abs/1804.06826 (2018). https://arxiv.org/abs/1804.06826

[58]

Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson. 2018. A domain-specific architecture for deep neural networks. Commun. ACM 61, 9 (2018), 50–59. DOI:https://doi.org/10.1145/3154484

[59]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In Proc. of ACM DaMoN’12. ACM, 55–62. DOI:https://doi.org/10.1145/2236584.2236592

[60]

Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big data causing big (TLB) problems: Taming random memory accesses on the GPU. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article 6, 10 pages. DOI:https://doi.org/10.1145/3076113.3076115

[61]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive work placement for query processing on heterogeneous computing resources. Proc. VLDB Endow 10, 7 (2017), 733–744. DOI:https://doi.org/10.14778/3067421.3067423

[62]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2015. Local vs. global optimization: Operator placement strategies in heterogeneous environments. In Proc. of EDBT’15 Workshops. CEUR-WS.org, 48–55. http://ceur-ws.org/Vol-1330/paper-10.pdf.

[63]

Tomas Karnagel, Dirk Habich, Benjamin Schlegel, and Wolfgang Lehner. 2014. Heterogeneity-aware operator placement in column-store DBMS. Datenbank-Spektrum 14, 3 (2014), 211–221. DOI:

[64]

Tomas Karnagel, Dirk Habich, Benjamin Schlegel, and Wolfgang Lehner. 2013. The HELLS-Join: A heterogeneous stream join for extremely large windows. In Proc. of ACM DaMoN’13. Association for Computing Machinery, Article 2, 7 pages. DOI:https://doi.org/10.1145/2485278.2485280

[65]

Tomas Karnagel, René Müller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated group-by and aggregation. In Proc. of ADMS@VLDB’15. 13–24. http://www.adms-conf.org/2015/gpu-optimizer-camera-ready.pdf.

[66]

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In Proc. of IEEE/ACM MICRO 47. 114–126. DOI:https://doi.org/10.1109/MICRO.2014.62

[67]

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. 11, 13 (2018), 2209–2222. DOI:https://doi.org/10.14778/3275366.3284966

[68]

John Kessenich, Graham Sellers, and Dave Shreiner. 2016. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5 with SPIR-V (9 ed.). Addison-Wesley Professional.

Digital Library

[69]

Khronos OpenCL Working Group. 2013. The OpenCL Specification Version 2.0. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.

[70]

Raja Koduri. 2019. Exascale for everyone. In Intel HPC Developer Conference’19. https://software.intel.com/content/www/us/en/develop/events/hpc-devcon.html.

[71]

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-based hybrid stream processing for heterogeneous architectures. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 555–569. DOI:https://doi.org/10.1145/2882903.2882906

[72]

Michael Körber, Jakob Eckstein, Nikolaus Glombiewski, and Bernhard Seeger. 2019. Event stream processing on heterogeneous system architecture. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article 3. DOI:https://doi.org/10.1145/3329785.3329933

[73]

Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 311–326. DOI:https://doi.org/10.1145/2882903.2882925

[74]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proc. of IEEE CGO’04. 75–86. DOI:https://doi.org/10.1109/CGO.2004.1281665

[75]

Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In Proc. of ACM SIGMOD’14. ACM, 743–754. DOI:https://doi.org/10.1145/2588555.2610507

[76]

Oded Lempel. 2011. 2nd generation Intel® core processor family: Intel® core i7, i5 and i3. In Proc. of IEEE HCS 23. 1–48. DOI:

[77]

Chuanwen Li, Yu Gu, Jianzhong Qi, Jiayuan He, Qingxu Deng, and Ge Yu. 2018. A GPU accelerated update efficient index for kNN queries in road networks. In Proc. of IEEE ICDE’18. 881–892. DOI:

[78]

Yuan Lin and Vinod Grover. 2018. Using CUDA Warp-Level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.

[79]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39–55. DOI:https://doi.org/10.1109/MM.2008.31

[80]

Erik Lindholm, Mark J. Kilgard, and Henry Moreton. 2001. A user-programmable vertex engine. In Proc. of ACM SIGGRAPH’01. ACM, 149–158. DOI:https://doi.org/10.1145/383259.383274

[81]

LLVM Developer Group. [n.d.]. The LLVM Target-Independent Code Generator. https://www.llvm.org/docs/CodeGenerator.html.

[82]

Justin Luitjens. 2014. Faster Parallel Reductions on Kepler. https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/.

[83]

Clemens Lutz, Sebastian Breß, Tilmann Rabl, Steffen Zeuch, and Volker Markl. 2018. Efficient K-means on GPUs. In Proc. of ACM DaMoN’18. Association for Computing Machinery, Article 3, 3 pages. https://doi.org/10.1145/3211922.3211925

Digital Library

[84]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump up the volume: Processing large data on GPUs with fast interconnects. In Proc. of ACM SIGMOD’20. Association for Computing Machinery, 1633–1649. DOI:https://doi.org/10.1145/3318464.3389705

[85]

Stefan Manegold, Peter Boncz, and Martin L. Kersten. 2002. Generic database cost models for hierarchical memory systems. In Proc. of VLDB’02. VLDB Endowment, 191–202. http://vldb.org/conf/2002/S06P03.pdf.

Digital Library

[86]

Mike Mantor. 2019. 7nm “Navi” GPU - A GPU built for performance and efficiency. In Proc. of IEEE HCS 31. IEEE Computer Society, 1–28. DOI:

[87]

William R. Mark, R. Steven Glanville, Kurt Akeley, and Mark J. Kilgard. 2003. Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graph 22, 3 (2003), 896–907. DOI:https://doi.org/10.1145/882262.882362

[88]

Prashanth Menon, Todd C. Mowry, and Andrew Pavlo. 2017. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. Proc. VLDB Endow 11, 1 (2017), 1–13. DOI:https://doi.org/10.14778/3151113.3151114

[89]

Sina Meraji, Berni Schiefer, Lan Pham, Lee Chu, Peter Kokosielis, Adam Storm, Wayne Young, Chang Ge, Geoffrey Ng, and Kajan Kanagaratnam. 2016. Towards a hybrid design for fast query processing in DB2 with BLU acceleration using graphical processing units: A technology demonstration. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1951–1960. DOI:https://doi.org/10.1145/2882903.2903735

[90]

Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article 69 (2015), 35 pages. DOI:https://doi.org/10.1145/2788396

[91]

Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, and Thomas Neumann. 2014. Heterogeneity-conscious parallel query execution: Getting a better mileage while driving faster!. In Proc. of ACM DaMoN’14. Association for Computing Machinery, Article 2, 10 pages. DOI:https://doi.org/10.1145/2619228.2619230

Digital Library

[92]

Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, and David Kaeli. 2016. A comprehensive performance analysis of HSA and OpenCL 2.0. In Proc. of IEEE ISPASS’16. 183–193. DOI:

[93]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In Proc. of ACM SIGCOMM’18. Association for Computing Machinery, 327–341. DOI:https://doi.org/10.1145/3230543.3230560

[94]

Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow 4, 9 (2011), 539–550. DOI:https://doi.org/10.14778/2002938.2002940

[95]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40–53. DOI:https://doi.org/10.1145/1365490.1365500

[96]

NVIDIA Corporation. 2020. CUDA C Best Practices Guide (v11.1 ed.). https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf.

[97]

NVIDIA Corporation. [n.d.]. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html.

[98]

NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture.

[99]

NVIDIA Corporation. 2007. NVIDIA CUDA Programming Guide (version 1.0 ed.).

[100]

NVIDIA Corporation. [n.d.]. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect.

[101]

NVIDIA Corporation. 2016. NVIDIA Tesla P100.

[102]

NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture.

[103]

NVIDIA Corporation. 2018. NVIDIA Turing GPU Architecture.

[104]

NVIDIA Corporation. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.

[105]

Patrick O’Neil, Eizabeth O’Neil, and Xuedong Chen. 2009. Star Schema Benchmark-Revision 3. http://www.cs.umbo.edu/poneil/StarSchemaB.PDF.

[106]

Oak Ridge National Laboratory. 2019. Frontier Spec Sheet. https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf.

[107]

Irma Esmer Papazian. 2020. New 3rd gen Intel® Xeon® scalable processor (codename: Ice Lake-SP). In Proc. of IEEE HCS 32. IEEE Computer Society, 1–22. DOI:

[108]

Mark Papermaster. 2020. Future of High Performance. https://ir.amd.com/news-events/analyst-day.

[109]

Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-based pipelined query processing engine. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1935–1950. DOI:https://doi.org/10.1145/2882903.2915224

[110]

Holger Pirk, Stefan Manegold, and Martin Kersten. 2014. Waste not... efficient co-processing of relational data. In Proc. of IEEE ICDE’14. 508–519. DOI:

[111]

Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo - a vector algebra for portable database performance on modern hardware. Proc. VLDB Endow 9, 14 (2016), 1707–1718. DOI:https://doi.org/10.14778/3007328.3007336

[112]

Holger Pirk, Thibault Sellam, Stefan Manegold, and Martin Kersten. 2012. X-device query processing by bitwise distribution. In Proc. of ACM DaMoN’12. ACM, 48–54. DOI:https://doi.org/10.1145/2236584.2236591

[113]

Iraklis Psaroudakis, Florian Wolf, Norman May, Thomas Neumann, Alexander Böhm, Anastasia Ailamaki, and Kai-Uwe Sattler. 2015. Scaling up mixed workloads: A battle of data freshness, flexibility, and scheduling. In Proc. of TPCTC’14. Springer International Publishing, 97–112. DOI:

[114]

Syed Mohammad Aunn Raza, Periklis Chrysogelos, Panagiotis Sioulas, Vladimir Indjic, Angelos Christos Anadiotis, and Anastasia Ailamaki. 2020. GPU-accelerated data management under the test of time. In Proc. of CIDR’20.

[115]

Phil Rogers. 2013. Heterogeneous system architecture overview. In Proc. of IEEE HCS 25. 1–41. DOI:

[116]

Phil Rogers, Ben Ander, Benedict Gaster, and Ian Bratt. 2013. Heterogeneous system architecture (HSA): Overview and implementation. In Proc. of IEEE HCS 25. 1–41. DOI:

[117]

Viktor Rosenfeld, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2019. Performance analysis and automatic tuning of hash aggregation on GPUs. In Proc. of ACM DaMoN’19. Association for Computing Machinery, Article 8, 11 pages. DOI:https://doi.org/10.1145/3329785.3329922

[118]

Viktor Rosenfeld, Max Heimel, Christoph Viebig, and Volker Markl. 2015. The operator variant selection problem on heterogeneous hardware. In Proc. of ADMS@VLDB’15. 1–12. http://www.adms-conf.org/2015/ADMS_Viktor_Rosenfeld_CR.pdf.

[119]

Eyal Rozenberg and Peter Boncz. 2017. Faster across the PCIe Bus: A GPU library for lightweight decompression. In Proc. of ACM DaMoN’17. Association for Computing Machinery, Article 8, 5 pages. https://doi.org/10.1145/3076113.3076122

Digital Library

[120]

Satish Kumar Sadasivam, Brian W. Thompto, Ron Kalla, and William J. Starke. 2017. IBM Power9 processor architecture. 37, 2 (2017), 40–51. DOI:https://doi.org/10.1109/MM.2017.40

[121]

Nikolay Sakharnykh. 2018. Everything you need to know about unified memory. In GPU Tech Conference 2018. https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf.

[122]

Science Staff. 2011. Special online collection: Dealing with data. challenges and opportunities. Science 331, 6018 (2011), 692–693. DOI:

[123]

Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. 2007. Scan primitives for GPU computing. In Proc. of ACM SIGGRAPH/Eurographics’07 Workshop. The Eurographics Association. DOI:https://doi.org/10.2312/EGGH/EGGH07/097-106

[124]

Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A hybrid B+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In Proc. of ACM SIGMOD’16. Association for Computing Machinery, 1523–1538. DOI:https://doi.org/10.1145/2882903.2882918

[125]

Debendra Das Sharma and Siamak Tavallaei. 2020. Compute Express Link™ 2.0 White Paper.

[126]

Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-conscious hash-joins on GPUs. In Proc. of IEEE ICDE’19. 698–709. DOI:

[127]

Kyle L. Spafford, Jeremy S. Meredith, Seyong Lee, Dong Li, Philip C. Roth, and Jeffrey S. Vetter. 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proc. of ACM CF’12. Association for Computing Machinery, 103–112. DOI:https://doi.org/10.1145/2212908.2212924

[128]

William Starke and Brian Thompto. 2020. IBM’s POWER10 processor. In Proc. of IEEE HCS 32. IEEE Computer Society, 1–43. DOI:

[129]

Elias Stehle and Hans-Arno Jacobsen. 2017. A memory bandwidth-efficient hybrid radix sort on GPUs. In Proc. of ACM SIGMOD’17. Association for Computing Machinery, 417–432. DOI:https://doi.org/10.1145/3035918.3064043

[130]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73. DOI:https://doi.org/10.1109/MCSE.2010.69

[131]

David Suggs, Mahesh Subramony, and Dan Bouvier. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. DOI:

[132]

Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2003. Hardware acceleration for spatial selections and joins. In Proc. of ACM SIGMOD’03. ACM, 455–466. DOI:https://doi.org/10.1145/872757.872813

[133]

The Economist. 2010. Data, Data Everywhere. A Special Report on Managing Information.https://www.economist.com/special-report/2010/02/27/data-data-everywhere.

[134]

The Khronos Group. [n.d.]. The Open Standard for Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/.

[135]

TPC. 2021. TPC-H Version 2 and Version 3. http://www.tpc.org/tpch/.

[136]

Xavier Vera. 2020. Inside Tiger Lake: Intel’s next generation mobile client CPU. In Proc. of IEEE HCS 32. 1–26. DOI:

[137]

Ján Veselý, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H. Loh, Mark Oskin, and Steven K. Reinhardt. 2018. Generic system calls for GPUs. In Proc. of ACM/IEEE ISCA’18. 843–856. DOI:https://doi.org/10.1109/ISCA.2018.00075

[138]

David W. Wall. 1993. Limits of Instruction-Level Parallelism.

[139]

Kaibo Wang, Yin Huai, Rubao Lee, Fusheng Wang, Xiaodong Zhang, and Joel H. Saltz. 2012. Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems. Proc. VLDB Endow. 5, 11 (2012), 1543–1554. DOI:https://doi.org/10.14778/2350229.2350268

[140]

Wm. A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (1995), 20–24. DOI:https://doi.org/10.1145/216585.216588

[141]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The yin and yang of processing data warehousing queries on GPU devices. Proc. VLDB Endow. 6, 10 (2013), 817–828. DOI:https://doi.org/10.14778/2536206.2536210

[142]

Eleni Tzirita Zacharatou, Harish Doraiswamy, Anastasia Ailamaki, Cláudio T. Silva, and Juliana Freire. 2017. GPU rasterization for real-time spatial aggregation over arbitrary polygons. Proc. VLDB Endow 11, 3 (2017), 352–365. DOI:https://doi.org/10.14778/3157794.3157803

[143]

Cyril Zeller, Randy Fernando, Matthias Wloka, and Mark Harris. 2004. Programming graphics hardware. In Proc. of Eurographics’04 Tutorials. Eurographics Association. DOI:

[144]

Bowen Zhang, Yanyan Shen, Yanmin Zhu, and Jiadi Yu. 2018. A GPU-accelerated framework for processing trajectory queries. In Proc. of IEEE ICDE’18. 1037–1048. DOI:

[145]

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, and Xiaoyong Du. 2020. FineStream: Fine-grained window-based stream processing on CPU-GPU integrated architectures. In Proc. of USENIX ATC’20. USENIX Association, 633–647. https://www.usenix.org/conference/atc20/presentation/zhang-feng.

Digital Library

[146]

Kai Zhang, Jiayu Hu, Bingsheng He, and Bei Hua. 2017. DIDO: Dynamic pipelines for in-memory key-value stores on coupled CPU-GPU architectures. In Proc. of IEEE ICDE’17. 671–682. DOI:

[147]

Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: A case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow 8, 11 (2015), 1226–1237. DOI:https://doi.org/10.14778/2809974.2809984

[148]

Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In Proc. of IEEE ICDE’06. 59–59. DOI:

Cited By

Carvalho MSimitsis AQueralt ARomero O(2024)Workload Placement on Heterogeneous CPU-GPU SystemsProceedings of the VLDB Endowment10.14778/3685800.368584517:12(4241-4244)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685845
Justen DRitter DFraser CLamb ALee ABodner THaddad MZeuch SMarkl VBoehm M(2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648175
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Show More Cited By

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
In-cache query co-processing on coupled CPU-GPU architectures

Recently, there have been some emerging processor designs that the CPU and the GPU (Graphics Processing Unit) are integrated in a single chip and share Last Level Cache (LLC). However, the main memory bandwidth of such coupled CPU-GPU architectures can ...
Algorithmic performance studies on graphics processing units

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 1

January 2023

860 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3492451

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 January 2022

Accepted: 01 August 2021

Revised: 01 June 2021

Received: 01 December 2020

Published in CSUR Volume 55, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Funding Sources

EU Horizon 2020 program as E2Data
DFG
German Ministry for Education and Research as BIFOLD-BBDC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
3,723
Total Downloads

Downloads (Last 12 months)1,266
Downloads (Last 6 weeks)179

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Carvalho MSimitsis AQueralt ARomero O(2024)Workload Placement on Heterogeneous CPU-GPU SystemsProceedings of the VLDB Endowment10.14778/3685800.368584517:12(4241-4244)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685845
Justen DRitter DFraser CLamb ALee ABodner THaddad MZeuch SMarkl VBoehm M(2024)POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least ResistanceProceedings of the VLDB Endowment10.14778/3648160.364817517:6(1350-1363)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648175
Kroviakov AKurapov PAnneser CGiceva J(2024)Heterogeneous Intra-Pipeline Device-Parallel AggregationsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663441(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663441
Chen WTsai HLing J(2024)Parallel Computation of Dominance Scores for Multidimensional Datasets on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338211935:6(919-931)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3382119
Guo XHe MQin SWang DLi C(2024)PS-Based Heterogeneous Computing Framework2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744160(6-10)Online publication date: 26-Apr-2024
https://doi.org/10.1109/NGDN61651.2024.10744160
Wei JGu YLi TQi JLi CZhang YJensen CYu G(2024)LTPG: Large-Batch Transaction Processing on GPUs with Deterministic Concurrency Control2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00296(3865-3877)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00296
Oh SMoon GPark S(2024)ML-Based Dynamic Operator-Level Query Mapping for Stream Processing Systems in Heterogeneous Computing Environments2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00027(226-237)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00027
Gill SWu HPatros POttaviani CArora PPujol VHaunschild DParlikad ACetinkaya OLutfiyya HStankovski VLi RDing YQadir JAbraham AGhosh SSong HSakellariou RRana ORodrigues JKanhere SDustdar SUhlig SRamamohanarao KBuyya R(2024)Modern computing: Vision and challengesTelematics and Informatics Reports10.1016/j.teler.2024.10011613(100116)Online publication date: Mar-2024
https://doi.org/10.1016/j.teler.2024.100116
Mencagli GTorquati MGriebler DFais ADanelutto M(2024)General-purpose data stream processing on heterogeneous architectures with WindFlowJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104782184:COnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104782
Alaei MYazdanpanah F(2024)A Survey on Heterogeneous CPU–GPU Architectures and SimulatorsConcurrency and Computation: Practice and Experience10.1002/cpe.831837:1Online publication date: 30-Oct-2024
https://doi.org/10.1002/cpe.8318
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents