research-article

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 1773 - 1786

https://doi.org/10.1145/3299869.3314045

Published: 25 June 2019 Publication History

Abstract

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

References

[1]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In PVLDB.

Digital Library

[2]

Peter M. G. Apers, Alan R. Hevner, and S. Bing Yao. 1983. Optimization Algorithms for Distributed Queries. IEEE Trans. Software Eng., Vol. 9, 1 (1983), 57--68.

Digital Library

[3]

Edmon Begoli, Jesú s Camacho-Rodr'i guez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In SIGMOD.

[4]

Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie Jr. 1981. Query Processing in a System for Distributed Databases (SDD-1) . ACM Trans. Database Syst., Vol. 6, 4 (1981), 602--625.

Digital Library

[5]

Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE.

Digital Library

[6]

Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and Ioana Manolescu. 2015. Invisible Glue: Scalable Self-Tunning Multi-Stores. In CIDR.

[7]

Michael J. Cahill, Uwe Rö hm, and Alan David Fekete. 2008. Serializable isolation for snapshot databases. In SIGMOD.

[8]

Jesú s Camacho-Rodr'i guez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, and Soudip Roy Chowdhury. 2016. Reuse-based Optimization for Pig Latin. In CIKM.

[9]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38.

[10]

Michael J. Carey, Laura M. Haas, Peter M. Schwarz, Manish Arya, William F. Cody, Ronald Fagin, Myron Flickner, Allen Luniewski, Wayne Niblack, Dragutin Petkovic, Joachim Thomas, John H. Williams, and Edward L. Wimmers. 1995. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM Workshop .

Digital Library

[11]

Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing Queries with Materialized Views. In ICDE.

Digital Library

[12]

Beno^i t Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD.

Digital Library

[13]

Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing Queries Using Materialized Views: A practical, scalable solution. In SIGMOD.

Digital Library

[14]

Timothy Griffin and Leonid Libkin. 1995. Incremental Maintenance of Views with Duplicates. In SIGMOD.

Digital Library

[15]

Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.

Digital Library

[16]

Ashish Gupta and Inderpal Singh Mumick. 1995. Maintenance of Materialized Views: Problems, Techniques, and Applications. IEEE Data Eng. Bull., Vol. 18, 2 (1995), 3--18.

[17]

Himanshu Gupta. 1997. Selection of Views to Materialize in a Data Warehouse. In ICDT.

Digital Library

[18]

Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1997. Index Selection for OLAP. In ICDE.

Digital Library

[19]

Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing Data Cubes Efficiently. In SIGMOD.

Digital Library

[20]

Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT.

Digital Library

[21]

Yin Huai, Ashutosh Chauhan, Alan Gates, Gü nther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major technical advancements in apache hive. In SIGMOD.

Digital Library

[22]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys.

Digital Library

[23]

Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. 2018. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD.

Digital Library

[24]

Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR.

[25]

Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka : a Distributed Messaging System for Log Processing. In NetDB.

[26]

Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. 2013. Enhancements to SQL server column stores. In SIGMOD.

Digital Library

[27]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. In PVLDB.

[28]

Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC.

Digital Library

[29]

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C. Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In SIGMOD.

Digital Library

[30]

Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: a modular query optimizer architecture for big data. In SIGMOD.

Digital Library

[31]

Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In PVLDB.

Digital Library

[32]

Michael Stonebraker and Ugur cC etintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In ICDE.

Digital Library

[33]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In ICDE.

[34]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In SOCC.

[35]

Gio Wiederhold. 1992. Mediators in the Architecture of Future Information Systems. IEEE Computer, Vol. 25, 3 (1992), 38--49.

Digital Library

[36]

Fangjin Yang, Eric Tschetter, Xavier Lé auté, Nelson Ray, Gian Merlino, and Deep Ganguli. 2014. Druid: a real-time analytical data store. In SIGMOD.

Digital Library

[37]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In USENIX HotCloud .

Digital Library

[38]

Mohamed Za"i t, Sunil Chakkappen, Suratna Budalakoti, Satyanarayana R. Valluri, Ramarajan Krishnamachari, and Alan Wood. 2017. Adaptive Statistics in Oracle 12c. In PVLDB.

[39]

Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam J. Storm, Christian Garcia-Arellano, and Scott Fadden. 2004. DB2 Design Advisor: Integrated Automatic Physical Database Design. In PVLDB.

Digital Library

Cited By

Horvath KAbid MMerino TZimmerman RPeker YKhan S(2024)Cloud-Based Infrastructure and DevOps for Energy Fault Detection in Smart BuildingsComputers10.3390/computers1301002313:1(23)Online publication date: 16-Jan-2024
https://doi.org/10.3390/computers13010023
Farhan MYoussef AAbdelhamid L(2024)A Model for Enhancing Unstructured Big Data Warehouse Execution TimeBig Data and Cognitive Computing10.3390/bdcc80200178:2(17)Online publication date: 6-Feb-2024
https://doi.org/10.3390/bdcc8020017
Lamb AShen YHeres DChakraborty JKabak MHsieh LSun CBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query EngineCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653368(5-17)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653368
Show More Cited By

Index Terms

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
1. Information systems
  1. Data management systems
    1. Database management system engines
    2. Information integration
      1. Federated databases
  2. Information systems applications
    1. Decision support systems
      1. Data warehouses

Recommendations

Major technical advancements in apache hive
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of ...
Apache hadoop goes realtime at Facebook
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the ...
Apache Hive Essentials: Essential techniques to help you process, and get unique insights from, big data, 2nd Edition

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,247
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)14

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Horvath KAbid MMerino TZimmerman RPeker YKhan S(2024)Cloud-Based Infrastructure and DevOps for Energy Fault Detection in Smart BuildingsComputers10.3390/computers1301002313:1(23)Online publication date: 16-Jan-2024
https://doi.org/10.3390/computers13010023
Farhan MYoussef AAbdelhamid L(2024)A Model for Enhancing Unstructured Big Data Warehouse Execution TimeBig Data and Cognitive Computing10.3390/bdcc80200178:2(17)Online publication date: 6-Feb-2024
https://doi.org/10.3390/bdcc8020017
Lamb AShen YHeres DChakraborty JKabak MHsieh LSun CBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query EngineCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653368(5-17)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653368
Weintraub GGudes EDolev SUllman J(2024)Optimizing Cloud Data Lake Queries With a Balanced Coverage PlanIEEE Transactions on Cloud Computing10.1109/TCC.2023.333920812:1(84-99)Online publication date: Jan-2024
https://doi.org/10.1109/TCC.2023.3339208
Haut JMoreno-Alvarez SPastor-Vargas RPerez-Garcia APaoletti M(2024)Cloud-Based Analysis of Large-Scale Hyperspectral Imagery for Oil Spill DetectionIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2023.334402217(2461-2474)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2023.3344022
Jang DYoon HJung KChung Y(2024) QHB + : Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications IEEE Access10.1109/ACCESS.2024.339133312(60138-60148)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3391333
Kotlarska IJackowski ALichota KWelnicki MDubnicki CIwanicki KNaor DGoel A(2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585941
Song XZhu YWu JLiu BWei H(2023)ADOps: An Anomaly Detection Pipeline in Structured LogsProceedings of the VLDB Endowment10.14778/3611540.361161816:12(4050-4053)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611618
Liu FTong XMao JZhao Y(2023)Design and Development of Big Data Platform for Smart UniversityProceedings of the 7th International Conference on Computer Science and Application Engineering10.1145/3627915.3629592(1-5)Online publication date: 17-Oct-2023
https://dl.acm.org/doi/10.1145/3627915.3629592
Margara ACugola GFelicioni NCilloni S(2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604801
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents