research-article

Public Access

Spark SQL: Relational Data Processing in Spark

Authors:

Michael Armbrust,

Reynold S. Xin,

Joseph K. Bradley,

Michael J. Franklin,

Matei ZahariaAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1383 - 1394

https://doi.org/10.1145/2723372.2742797

Published: 27 May 2015 Publication History

Abstract

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

References

[1]

A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible loading: Access-driven data transfer from raw files into database systems. In EDBT, 2013.

Digital Library

[2]

A. Alexandrov et al. The Stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, Dec. 2014.

Digital Library

[3]

AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark.

[4]

Apache Avro project. http://avro.apache.org.

[5]

Apache Parquet project. http://parquet.incubator.apache.org.

[6]

Apache Spark project. http://spark.apache.org.

[7]

M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and D. A. Patterson. The case for PIQL: a performance insightful query language. In SOCC, 2010.

Digital Library

[8]

A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011.

Digital Library

[9]

G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007.

Digital Library

[10]

BigDF project. https://github.com/AyasdiOpenSource/bigdf.

[11]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010.

Digital Library

[12]

J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. VLDB, 2009.

Digital Library

[13]

DDF project. http://ddf.io.

[14]

B. Emir, M. Odersky, and J. Williams. Matching objects with patterns. In ECOOP 2007 -- Object-Oriented Programming, volume 4609 of LNCS, pages 273--298. Springer, 2007.

Digital Library

[15]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014.

Digital Library

[16]

G. Graefe. The Cascades framework for query optimization. IEEE Data Engineering Bulletin, 18(3), 1995.

[17]

G. Graefe and D. DeWitt. The EXODUS optimizer generator. In SIGMOD, 1987.

Digital Library

[18]

J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006.

Digital Library

[19]

Hive data definition language. https://cwiki.apache.org/confluence/display/Hive/LanguageManual

[20]

DDL.

[21]

M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.

Digital Library

[22]

Jackson JSON processor. http://jackson.codehaus.org.

[23]

Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10):853--864, 2014.

Digital Library

[24]

M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.

[25]

Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 2012.

Digital Library

[26]

S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010.

Digital Library

[27]

X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML pipelines: a new high-level API for MLlib. https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html.

[28]

S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ICDM, 1998.

Digital Library

[29]

F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015.

Digital Library

[30]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008.

Digital Library

[31]

\textttpandas Python data analysis library. http://pandas.pydata.org.

[32]

A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.

Digital Library

[33]

R project for statistical computing. http://www.r-project.org.

[34]

scikit-learn: machine learning in Python. http://scikit-learn.org.

[35]

D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for Scala, a technical report. Technical Report 185242, École Polytechnique Fédérale de Lausanne, 2013.

[36]

D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL system for multi-structured data. In SIGMOD, 2014.

Digital Library

[37]

A. Thusoo et al. Hive--a petabyte scale data warehouse using Hadoop. In ICDE, 2010.

[38]

P. Wadler. Monads for functional programming. In Advanced Functional Programming, pages 24--52. Springer, 1995.

Digital Library

[39]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013.

Digital Library

[40]

M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

[41]

K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015.

Digital Library

Cited By

Palakurti NKanchepu N(2024)Machine Learning MasteryPractical Applications of Data Processing, Algorithms, and Modeling10.4018/979-8-3693-2909-2.ch002(16-29)Online publication date: 14-Jun-2024
https://doi.org/10.4018/979-8-3693-2909-2.ch002
Darwish D(2024)Big Data and Cloud ComputingEmerging Trends in Cloud Computing Analytics, Scalability, and Service Models10.4018/979-8-3693-0900-1.ch012(219-252)Online publication date: 22-Mar-2024
https://doi.org/10.4018/979-8-3693-0900-1.ch012
Troullinou GAgathangelos GKondylakis HStefanidis KPlexousakis D(2024)DIAERESIS: RDF data partitioning and query processing on SPARKSemantic Web10.3233/SW-243554(1-27)Online publication date: 6-Mar-2024
https://doi.org/10.3233/SW-243554
Show More Cited By

Index Terms

Spark SQL: Relational Data Processing in Spark
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Shark: SQL and rich analytics at scale
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
LBNL
DARPA

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

904
Total Citations
View Citations
13,228
Total Downloads

Downloads (Last 12 months)1,700
Downloads (Last 6 weeks)147

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Palakurti NKanchepu N(2024)Machine Learning MasteryPractical Applications of Data Processing, Algorithms, and Modeling10.4018/979-8-3693-2909-2.ch002(16-29)Online publication date: 14-Jun-2024
https://doi.org/10.4018/979-8-3693-2909-2.ch002
Darwish D(2024)Big Data and Cloud ComputingEmerging Trends in Cloud Computing Analytics, Scalability, and Service Models10.4018/979-8-3693-0900-1.ch012(219-252)Online publication date: 22-Mar-2024
https://doi.org/10.4018/979-8-3693-0900-1.ch012
Troullinou GAgathangelos GKondylakis HStefanidis KPlexousakis D(2024)DIAERESIS: RDF data partitioning and query processing on SPARKSemantic Web10.3233/SW-243554(1-27)Online publication date: 6-Mar-2024
https://doi.org/10.3233/SW-243554
Gu ZCorcoglioniti FLanti DMosca AXiao GXiong JCalvanese D(2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
https://doi.org/10.3233/SW-223201
Li B(2024)GreatFree as a Generic Distributed Programming Language and the Foundation of the Cloud-Side Operating SystemInternational Journal of Advanced Network, Monitoring and Controls10.2478/ijanmc-2023-00788:4(66-81)Online publication date: 16-Mar-2024
https://doi.org/10.2478/ijanmc-2023-0078
Vazquez HGrosboillot V(2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
https://doi.org/10.21105/joss.06684
Lyu CFan QGuyard PDiao Y(2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682021
Srivastava TFernandez R(2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682018
Al-Sayeh HJibril MSattler K(2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681990
He WSabek ILou YCafarella M(2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654639
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents