Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213836.2213839acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Advanced partitioning techniques for massively distributed computation

Published: 20 May 2012 Publication History

Abstract

An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.

References

[1]
Apache. Hadoop. http://hadoop.apache.org/.
[2]
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. In Proceedings of VLDB Conference, 2011.
[3]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. In Proceedings of VLDB Conference, 2008.
[4]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL implementation on the mapreduce framework. In Proceedings of VLDB Conference, 2011.
[5]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, 2004.
[6]
D. DeWitt and J. Gray. Parallel database systems: The future of high performance database processing. Communications of the ACM, 36(6), 1992.
[7]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of SOSP Conference, 2003.
[8]
G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In Proceeding of SIGMOD Conference, 1990.
[9]
G. Graefe. The Cascades framework for query optimization. Data Engineering Bulletin, 18(3), 1995.
[10]
G. Graefe, A. Linville, and L. Shapiro. Sort versus hash revisited. IEEE Transactions on Knowledge and Data Engineering, 6(6), 1994.
[11]
G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In Proceeding of ICDE Conference, 1993.
[12]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, 2007.
[13]
A. Jhingran, T. Malkemus, and S. Padmanabhan. Query optimization in DB2 parallel edition. Data Engineering Bulletin, 20(2), 1997.
[14]
H. Lu. Query Processing in Parallel Relational Database Systems. IEEE Computer Society Press, 1994.
[15]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of webscale datasets. In Proceedings of VLDB Conference, 2010.
[16]
T. Neumann and G. Moerkotte. A combined framework for grouping and order optimization. In Proceedings of VLDB Conference, 2004.
[17]
T. Neumann and G. Moerkotte. An efficient framework for order optimization. In Proceedings of ICDE, 2004.
[18]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, 2008.
[19]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of SIGMOD Conference, 1979.
[20]
D. Simmen, E. Shekita, and T. Malkenus. Fundamental techniques for order optimization. In Proceedings of SIGMOD Conference, 1996.
[21]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, 2010.
[22]
X. Wang and M. Cherniack. Avoiding sorting and grouping in processing queries. In Proc. of VLDB Conference, 2003.
[23]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of OSDI Conference, 2008.
[24]
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, 2010.

Cited By

View all
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • (2023)Distributed Consistency Beyond QueriesProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588657(47-58)Online publication date: 18-Jun-2023
  • (2023)SASPAR: Shared Adaptive Stream Partitioning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00076(922-935)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. Advanced partitioning techniques for massively distributed computation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
    May 2012
    886 pages
    ISBN:9781450312479
    DOI:10.1145/2213836
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 May 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distributed computation
    2. partitioning
    3. query optimization
    4. scope

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '12
    Sponsor:

    Acceptance Rates

    SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
    • (2023)Distributed Consistency Beyond QueriesProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588657(47-58)Online publication date: 18-Jun-2023
    • (2023)SASPAR: Shared Adaptive Stream Partitioning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00076(922-935)Online publication date: Apr-2023
    • (2023)SAT: sampling acceleration tree for adaptive database repartitionWorld Wide Web10.1007/s11280-023-01199-326:5(3503-3533)Online publication date: 3-Aug-2023
    • (2022)The Time Machine in Columnar NoSQL Databases: The Case of Apache HBaseFuture Internet10.3390/fi1403009214:3(92)Online publication date: 15-Mar-2022
    • (2022)Parallel Query Processing: To Separate Communication from ComputationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526164(1447-1461)Online publication date: 10-Jun-2022
    • (2022)PAW: Data Partitioning Meets Workload Variance2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00014(123-135)Online publication date: May-2022
    • (2022)Region-based Sub-Snapshot (RegSnap): Enhanced Fault Tolerance in Distributed Stream Processing with Partial Snapshot2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020607(3374-3382)Online publication date: 17-Dec-2022
    • (2021)The cosmos big data platform at MicrosoftProceedings of the VLDB Endowment10.14778/3476311.347639014:12(3148-3161)Online publication date: 28-Oct-2021
    • (2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media