Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3314045acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Published: 25 June 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

    References

    [1]
    Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In PVLDB.
    [2]
    Peter M. G. Apers, Alan R. Hevner, and S. Bing Yao. 1983. Optimization Algorithms for Distributed Queries. IEEE Trans. Software Eng., Vol. 9, 1 (1983), 57--68.
    [3]
    Edmon Begoli, Jesú s Camacho-Rodr'i guez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In SIGMOD.
    [4]
    Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie Jr. 1981. Query Processing in a System for Distributed Databases (SDD-1) . ACM Trans. Database Syst., Vol. 6, 4 (1981), 602--625.
    [5]
    Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE.
    [6]
    Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and Ioana Manolescu. 2015. Invisible Glue: Scalable Self-Tunning Multi-Stores. In CIDR.
    [7]
    Michael J. Cahill, Uwe Rö hm, and Alan David Fekete. 2008. Serializable isolation for snapshot databases. In SIGMOD.
    [8]
    Jesú s Camacho-Rodr'i guez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, and Soudip Roy Chowdhury. 2016. Reuse-based Optimization for Pig Latin. In CIKM.
    [9]
    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38.
    [10]
    Michael J. Carey, Laura M. Haas, Peter M. Schwarz, Manish Arya, William F. Cody, Ronald Fagin, Myron Flickner, Allen Luniewski, Wayne Niblack, Dragutin Petkovic, Joachim Thomas, John H. Williams, and Edward L. Wimmers. 1995. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM Workshop .
    [11]
    Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing Queries with Materialized Views. In ICDE.
    [12]
    Beno^i t Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD.
    [13]
    Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing Queries Using Materialized Views: A practical, scalable solution. In SIGMOD.
    [14]
    Timothy Griffin and Leonid Libkin. 1995. Incremental Maintenance of Views with Duplicates. In SIGMOD.
    [15]
    Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.
    [16]
    Ashish Gupta and Inderpal Singh Mumick. 1995. Maintenance of Materialized Views: Problems, Techniques, and Applications. IEEE Data Eng. Bull., Vol. 18, 2 (1995), 3--18.
    [17]
    Himanshu Gupta. 1997. Selection of Views to Materialize in a Data Warehouse. In ICDT.
    [18]
    Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1997. Index Selection for OLAP. In ICDE.
    [19]
    Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In SIGMOD.
    [20]
    Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT.
    [21]
    Yin Huai, Ashutosh Chauhan, Alan Gates, Gü nther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major technical advancements in apache hive. In SIGMOD.
    [22]
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys.
    [23]
    Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD.
    [24]
    Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR.
    [25]
    Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka : a Distributed Messaging System for Log Processing. In NetDB.
    [26]
    Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. 2013. Enhancements to SQL server column stores. In SIGMOD.
    [27]
    Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. In PVLDB.
    [28]
    Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC.
    [29]
    Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C. Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In SIGMOD.
    [30]
    Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: a modular query optimizer architecture for big data. In SIGMOD.
    [31]
    Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In PVLDB.
    [32]
    Michael Stonebraker and Ugur cC etintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In ICDE.
    [33]
    Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE.
    [34]
    Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In SOCC.
    [35]
    Gio Wiederhold. 1992. Mediators in the Architecture of Future Information Systems. IEEE Computer, Vol. 25, 3 (1992), 38--49.
    [36]
    Fangjin Yang, Eric Tschetter, Xavier Lé auté, Nelson Ray, Gian Merlino, and Deep Ganguli. 2014. Druid: a real-time analytical data store. In SIGMOD.
    [37]
    Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In USENIX HotCloud .
    [38]
    Mohamed Za"i t, Sunil Chakkappen, Suratna Budalakoti, Satyanarayana R. Valluri, Ramarajan Krishnamachari, and Alan Wood. 2017. Adaptive Statistics in Oracle 12c. In PVLDB.
    [39]
    Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam J. Storm, Christian Garcia-Arellano, and Scott Fadden. 2004. DB2 Design Advisor: Integrated Automatic Physical Database Design. In PVLDB.

    Cited By

    View all
    • (2024)Cloud-Based Infrastructure and DevOps for Energy Fault Detection in Smart BuildingsComputers10.3390/computers1301002313:1(23)Online publication date: 16-Jan-2024
    • (2024)A Model for Enhancing Unstructured Big Data Warehouse Execution TimeBig Data and Cognitive Computing10.3390/bdcc80200178:2(17)Online publication date: 6-Feb-2024
    • (2024)Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query EngineCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653368(5-17)Online publication date: 9-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
    June 2019
    2106 pages
    ISBN:9781450356435
    DOI:10.1145/3299869
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data warehouses
    2. databases
    3. hadoop
    4. hive

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '19
    Sponsor:
    SIGMOD/PODS '19: International Conference on Management of Data
    June 30 - July 5, 2019
    Amsterdam, Netherlands

    Acceptance Rates

    SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)14
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cloud-Based Infrastructure and DevOps for Energy Fault Detection in Smart BuildingsComputers10.3390/computers1301002313:1(23)Online publication date: 16-Jan-2024
    • (2024)A Model for Enhancing Unstructured Big Data Warehouse Execution TimeBig Data and Cognitive Computing10.3390/bdcc80200178:2(17)Online publication date: 6-Feb-2024
    • (2024)Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query EngineCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653368(5-17)Online publication date: 9-Jun-2024
    • (2024)Optimizing Cloud Data Lake Queries With a Balanced Coverage PlanIEEE Transactions on Cloud Computing10.1109/TCC.2023.333920812:1(84-99)Online publication date: Jan-2024
    • (2024)Cloud-Based Analysis of Large-Scale Hyperspectral Imagery for Oil Spill DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.334402217(2461-2474)Online publication date: 2024
    • (2024) QHB + : Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications IEEE Access10.1109/ACCESS.2024.339133312(60138-60148)Online publication date: 2024
    • (2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
    • (2023)ADOps: An Anomaly Detection Pipeline in Structured LogsProceedings of the VLDB Endowment10.14778/3611540.361161816:12(4050-4053)Online publication date: 1-Aug-2023
    • (2023)Design and Development of Big Data Platform for Smart UniversityProceedings of the 7th International Conference on Computer Science and Application Engineering10.1145/3627915.3629592(1-5)Online publication date: 17-Oct-2023
    • (2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media