Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2750545acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System

Published: 27 May 2015 Publication History

Abstract

Big data analytics often requires processing complex queries using massive parallelism, where the main performance metrics is the communication cost incurred during data reshuffling. In this paper, we describe a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture. We build on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation. We evaluate these algorithms together, then describe novel, practical optimizations for both algorithms.

References

[1]
Clasp. http://potassco.sourceforge.net/, 2014.
[2]
Glpk. https://www.gnu.org/software/glpk/, 2014.
[3]
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.
[4]
F. N. Afrati, M. Joglekar, C. Ré, S. Salihoglu, and J. D. Ullman. GYM: A multiround join algorithm in mapreduce. CoRR, abs/1410.4156, 2014.
[5]
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010.
[6]
A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. In 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA, pages 739--748, 2008.
[7]
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 261--272, 2000.
[8]
P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22-27, 2013, pages 273--284, 2013.
[9]
P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, pages 212--223, 2014.
[10]
N. Bruno, Y. Kwon, and M. Wu. Advanced join strategies for large-scale distributed computation. PVLDB, 7(13):1484--1495, 2014.
[11]
M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. Scalable and adaptive online joins. PVLDB, 7(6):441--452, 2014.
[12]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251--262, 1999.
[13]
S. Ganguly, A. Silberschatz, and S. Tsur. Parallel bottom-up processing of datalog queries. J. Log. Program., 14(1&2):101--126, 1992.
[14]
S. Ganguly, A. Silberschatz, and S. Tsur. Parallel bottom-up processing of datalog queries. J. Log. Program., 14(1&2):101--126, 1992.
[15]
D. Halperin, V. T. de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and D. Suciu. Demonstration of the myria big data management service. In C. E. Dyreson, F. Li, and M. T. Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 881--884, 2014.
[16]
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, Denver, Colorado, May 29-31, 1991., pages 268--277, 1991.
[17]
P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12--16, 2011, Athens, Greece, pages 223--234, 2011.
[18]
H. Lu, M. Shan, and K. Tan. Optimization of multi-way join queries for parallel execution. In 17th International Conference on Very Large Data Bases, September 3-6, 1991, Barcelona, Catalonia, Spain, Proceedings., pages 549--560, 1991.
[19]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010.
[20]
D. Moritz, D. Halperin, B. Howe, and J. Heer. Perfopticon: Visual query analysis for distributed databases. In Computer Graphics Forum (EuroVis), Cagliari, Italy, volume 34, 2015.
[21]
R. V. Nehme and N. Bruno. Automated partitioning design in parallel database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 1137--1148, 2011.
[22]
H. Q. Ngo, D. T. Nguyen, C. Re, and A. Rudra. Beyond worst-case analysis for joins with minesweeper. In PODS, pages 234--245, 2014.
[23]
H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms: {extended abstract}. In PODS, pages 37--48, 2012.
[24]
H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms: {extended abstract}. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 37--48, 2012.
[25]
H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: new developments in the theory of join algorithms. SIGMOD Record, 42(4):5--16, 2013.
[26]
T. Phan, L. d'Orazio, and P. Rigaux. Toward intersection filter-based optimization for joins in mapreduce. In 2nd International Workshop on Cloud Intelligence (colocated with VLDB 2013), Cloud-I '13, Riva del Garda, Trento, Italy, August 26, 2013, page 2, 2013.
[27]
O. Polychroniou, R. Sen, and K. A. Ross. Track join: distributed joins with minimal network traffic. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 1483--1494, 2014.
[28]
J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002, pages 558--569, 2002.
[29]
D. A. Schneider and D. J. DeWitt. Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In VLDB, pages 469--480, 1990.
[30]
J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins, M. Oancea, K. Littlefield, D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte. F1: A distributed SQL database that scales. PVLDB, 6(11):1068--1079, 2013.
[31]
T. Stöhr, H. Märtens, and E. Rahm. Multi-dimensional database allocation for parallel data warehouses. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 273--284, 2000.
[32]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, pages 996--1005, 2010.
[33]
T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In N. Schweikardt, V. Christophides, and V. Leroy, editors, Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014., pages 96--106, 2014.
[34]
S. Vemuri, M. Varshney, K. Puttaswamy, and R. Liu. Execution primitives for scalable joins and aggregations in map reduce. PVLDB, 7(13):1462--1473, 2014.
[35]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In K. A. Ross, D. Srivastava, and D. Papadias, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 13--24. ACM, 2013.
[36]
M. Yannakakis. Algorithms for acyclic database schemes. In Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, pages 82--94, 1981.
[37]
â. N. Yaveroğlu, N. Malod-Dognin, D. Davis, Z. Levnajic, V. Janjic, R. Karapandza, A. Stojmirovic, and N. Pržulj. Revealing the hidden language of complex networks. Scientific Reports, 4, 2014.
[38]
X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using mapreduce. PVLDB, 5(11):1184--1195, 2012.
[39]
G. Zhou, Y. Zhu, and G. Wang. Cache conscious star-join in mapreduce environments. In 2nd International Workshop on Cloud Intelligence (colocated with VLDB 2013), Cloud-I '13, Riva del Garda, Trento, Italy, August 26, 2013, page 1, 2013.

Cited By

View all
  • (2024)Trinity: A Fast Compressed Multi-attribute Data StoreProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650072(405-420)Online publication date: 22-Apr-2024
  • (2023)SODA: A Set of Fast Oblivious Algorithms in Distributed Secure Data AnalyticsProceedings of the VLDB Endowment10.14778/3587136.358714216:7(1671-1684)Online publication date: 8-May-2023
  • (2023)JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product EstimationProceedings of the ACM on Management of Data10.1145/35889351:1(1-26)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. join query evaluation
  2. parallel database system

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)7
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Trinity: A Fast Compressed Multi-attribute Data StoreProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650072(405-420)Online publication date: 22-Apr-2024
  • (2023)SODA: A Set of Fast Oblivious Algorithms in Distributed Secure Data AnalyticsProceedings of the VLDB Endowment10.14778/3587136.358714216:7(1671-1684)Online publication date: 8-May-2023
  • (2023)JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product EstimationProceedings of the ACM on Management of Data10.1145/35889351:1(1-26)Online publication date: 30-May-2023
  • (2023)GPU Database for Large Geospatial Datasets2023 IEEE 17th International Symposium on Applied Computational Intelligence and Informatics (SACI)10.1109/SACI58269.2023.10158535(000399-000404)Online publication date: 23-May-2023
  • (2023)Attempts in Worst-Case Optimal Joins on Relational Data Systems: A Literature Survey2023 IEEE 6th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)10.1109/CloudTech58737.2023.10366077(01-08)Online publication date: 21-Nov-2023
  • (2022)SortledtonProceedings of the VLDB Endowment10.14778/3514061.351406515:6(1173-1186)Online publication date: 22-Jun-2022
  • (2022)Parallel Query Processing: To Separate Communication from ComputationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526164(1447-1461)Online publication date: 10-Jun-2022
  • (2022)Scaling Equi-JoinsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526042(2163-2176)Online publication date: 10-Jun-2022
  • (2022)HYPERSONIC: A Hybrid Parallelization Approach for Scalable Complex Event ProcessingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517829(1093-1107)Online publication date: 10-Jun-2022
  • (2022)A parallel query processing system based on graph-based database partitioningInformation Sciences: an International Journal10.1016/j.ins.2018.12.031480:C(237-260)Online publication date: 13-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media