Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3197388acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Algorithmic Aspects of Parallel Query Processing

Published: 27 May 2018 Publication History

Abstract

In the last decade we have witnessed a growing interest in process- ing large data sets on large-scale distributed clusters. A big part of the complex data analysis pipelines performed by these systems consists of a sequence of relatively simple query operations, such as joining two or more tables, or sorting. This tutorial discusses several recent algorithmic developments for data processing in such large distributed clusters. It uses as a model of computation the Massively Parallel Computation (MPC) model, a simplification of the BSP model, where the only cost is given by the amount of communication and the number of communication rounds. Based on the MPC model, we study and analyze several algorithms for three core data processing tasks: multiway join queries, sorting and matrix multiplication. We discuss the common algorithmic techniques across all tasks, relate the algorithms to what is used in practical systems, and finally present open problems for future research.

References

[1]
F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. 2013. Upper and Lower Bounds on the Cost of a Map-Reduce Computation. PVLDB Vol. 6, 4 (2013).
[2]
Foto N. Afrati and Jeffrey D. Ullman. 2011. Optimizing Multiway Joins in a Map-Reduce Environment. TKDE Vol. 23, 9 (2011).
[3]
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A Three-dimensional Approach to Parallel Matrix Multiplication. IBM Journal of Research and Development Vol. 39, 5 (1995).
[4]
Alok Aggarwal, Ashok K. Chandra, and Marc Snir. 1990. Communication Complexity of PRAMs. Theoretical Computer Science Vol. 71, 1 (1990).
[5]
Alok Aggarwal and S. Vitter, Jeffrey. 1988. The Input/Output Complexity of Sorting and Related Problems. CACM Vol. 31, 9 (1988).
[6]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2013. Communication Steps for Parallel Query Processing PODS.
[7]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2014. Skew in Parallel Query Processing. In PODS.
[8]
Guy E. Blelloch and Bruce M. Maggs. 2010. Parallel Algorithms. In Algorithms and Theory of Computation Handbook. Chapman &Hall/CRC, Chapter 25.
[9]
Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Montana State University, Bozeman, MT, USA.
[10]
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB Vol. 1, 2 (2008).
[11]
Surajit Chaudhuri. 2012. What Next?: A Half-dozen Data Management Research Goals for Big Data and the Cloud PODS.
[12]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters OSDI.
[13]
Eliezer Dekel, David Nassimi, and Sartaj Sahni. 1984. Parallel Matrix and Graph Algorithms. SIAM J. Comput. Vol. 16, 3 (1984).
[14]
David J. DeWitt and Jim Gray. 1992. Parallel Database Systems: The Future of High Performance Database Systems. CACM Vol. 35, 6 (1992).
[15]
EMC Corporation. 2012. Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field. http://www.emc.com/collateral/about/news/emc-data-science-study-wp.pdf. (2012).
[16]
Michael T. Goodrich. 1999. Communication-Efficient Parallel Sorting. SIAM J. Comput. Vol. 29, 2 (1999).
[17]
Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. 2011. Sorting, Searching, and Simulation in the Mapreduce Framework ISAAC.
[18]
Joseph Hellerstein . 2007--2018. Foundations and Trends in Databases. dl.acm.org/citation.cfm?id=1454719. (2007--2018).
[19]
Xiao Hu, Yufei Tao, and Ke Yi. 2017. Output-optimal Parallel Algorithms for Similarity Joins PODS.
[20]
Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication Lower Bounds for Distributed-memory Matrix Multiplication. J. Parallel and Distrib. Comput. Vol. 64, 9 (2004).
[21]
Hong Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-blue Pebble Game. In STOC.
[22]
S. Lennart Johnsson. 1993. Minimizing the Communication Time for Matrix Multiplication on Multiprocessors. Parallel Comput. Vol. 19, 11 (1993).
[23]
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A Model of Computation for MapReduce. In SODA.
[24]
Bas Ketsman and Dan Suciu. 2017. A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries. In PODS.
[25]
Paraschos Koutris, Paul Beame, and Dan Suciu. 2016. Worst-Case Optimal Algorithms for Parallel Query Processing ICDT.
[26]
W. F. McColl and A. Tiskin . 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica Vol. 24, 3 (1999).
[27]
A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. CACM Vol. 12, 3 (1969).
[28]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB Vol. 3, 1 (2010).
[29]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. 2008. Pig Latin: A Not-So-Foreign Language for Data Processing SIGMOD.
[30]
P. Koutris, S. Salihoglu, and D. Suciu. 2018. Algorithmic Aspects of Parallel Data Processing. https://cs.uwaterloo.ca/ ssalihog/papers/bsp-survey-camera-ready.pdf. (2018).
[31]
Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, and Eli Upfal. 2012. Space-round Tradeoffs for MapReduce Computations. In ICS.
[32]
Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. Wang. 2016. Shuffles and Circuits: (On Lower Bounds for Modern Parallel Computation) SPAA.
[33]
Dan Suciu. 2017. Communication Cost in Parallel Query Evaluation: A Tutorial PODS.
[34]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. 2009. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB Vol. 2, 2 (2009).
[35]
Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. CACM Vol. 33, 8 (1990).
[36]
Jeffrey Scott Vitter. 2006. Algorithms and Data Structures for External Memory. Foundations and Trends in Theoretical Computer Science Vol. 2, 4 (2006).
[37]
Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System and Cloud Services Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR).
[38]
Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In VLDB.
[39]
Zaharia, M. and Chowdhury, M. and Franklin, M. J. and Shenker, S. and Stoica, I. 2010. Spark: Cluster Computing with Working Sets. In HotCloud.

Cited By

View all
  1. Algorithmic Aspects of Parallel Query Processing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
    May 2018
    1874 pages
    ISBN:9781450347037
    DOI:10.1145/3183713
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bulk synchronous parallel model
    2. distributed query evaluation

    Qualifiers

    • Research-article

    Funding Sources

    • NSF AiTF
    • NSF III

    Conference

    SIGMOD/PODS '18
    Sponsor:

    Acceptance Rates

    SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media