Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2742797acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Spark SQL: Relational Data Processing in Spark

Published: 27 May 2015 Publication History

Abstract

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g. machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g. schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

References

[1]
A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible loading: Access-driven data transfer from raw files into database systems. In EDBT, 2013.
[2]
A. Alexandrov et al. The Stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, Dec. 2014.
[3]
AMPLab big data benchmark. https://amplab.cs.berkeley.edu/benchmark.
[4]
Apache Avro project. http://avro.apache.org.
[5]
Apache Parquet project. http://parquet.incubator.apache.org.
[6]
Apache Spark project. http://spark.apache.org.
[7]
M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and D. A. Patterson. The case for PIQL: a performance insightful query language. In SOCC, 2010.
[8]
A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011.
[9]
G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML schema definitions from XML data. In VLDB, 2007.
[10]
BigDF project. https://github.com/AyasdiOpenSource/bigdf.
[11]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010.
[12]
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. VLDB, 2009.
[13]
DDF project. http://ddf.io.
[14]
B. Emir, M. Odersky, and J. Williams. Matching objects with patterns. In ECOOP 2007 -- Object-Oriented Programming, volume 4609 of LNCS, pages 273--298. Springer, 2007.
[15]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014.
[16]
G. Graefe. The Cascades framework for query optimization. IEEE Data Engineering Bulletin, 18(3), 1995.
[17]
G. Graefe and D. DeWitt. The EXODUS optimizer generator. In SIGMOD, 1987.
[18]
J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006.
[19]
Hive data definition language. https://cwiki.apache.org/confluence/display/Hive/LanguageManual
[21]
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.
[22]
Jackson JSON processor. http://jackson.codehaus.org.
[23]
Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building efficient query engines in a high-level language. PVLDB, 7(10):853--864, 2014.
[24]
M. Kornacker et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR, 2015.
[25]
Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB, 2012.
[26]
S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010.
[27]
X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML pipelines: a new high-level API for MLlib. https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html.
[28]
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ICDM, 1998.
[29]
F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking data-intensive science using scalable analytics systems. In SIGMOD, 2015.
[30]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[31]
\textttpandas Python data analysis library. http://pandas.pydata.org.
[32]
A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
[33]
R project for statistical computing. http://www.r-project.org.
[34]
scikit-learn: machine learning in Python. http://scikit-learn.org.
[35]
D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for Scala, a technical report. Technical Report 185242, École Polytechnique Fédérale de Lausanne, 2013.
[36]
D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL system for multi-structured data. In SIGMOD, 2014.
[37]
A. Thusoo et al. Hive--a petabyte scale data warehouse using Hadoop. In ICDE, 2010.
[38]
P. Wadler. Monads for functional programming. In Advanced Functional Programming, pages 24--52. Springer, 1995.
[39]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, 2013.
[40]
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
[41]
K. Zeng et al. G-OLA: Generalized online aggregation for interactive analysis on big data. In SIGMOD, 2015.

Cited By

View all
  • (2024)Machine Learning MasteryPractical Applications of Data Processing, Algorithms, and Modeling10.4018/979-8-3693-2909-2.ch002(16-29)Online publication date: 14-Jun-2024
  • (2024)Big Data and Cloud ComputingEmerging Trends in Cloud Computing Analytics, Scalability, and Service Models10.4018/979-8-3693-0900-1.ch012(219-252)Online publication date: 22-Mar-2024
  • (2024)DIAERESIS: RDF data partitioning and query processing on SPARKSemantic Web10.3233/SW-243554(1-27)Online publication date: 6-Mar-2024
  • Show More Cited By

Index Terms

  1. Spark SQL: Relational Data Processing in Spark

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data warehouse
    2. databases
    3. hadoop
    4. machine learning
    5. spark

    Qualifiers

    • Research-article

    Funding Sources

    • NSF
    • LBNL
    • DARPA

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,700
    • Downloads (Last 6 weeks)147
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Machine Learning MasteryPractical Applications of Data Processing, Algorithms, and Modeling10.4018/979-8-3693-2909-2.ch002(16-29)Online publication date: 14-Jun-2024
    • (2024)Big Data and Cloud ComputingEmerging Trends in Cloud Computing Analytics, Scalability, and Service Models10.4018/979-8-3693-0900-1.ch012(219-252)Online publication date: 22-Mar-2024
    • (2024)DIAERESIS: RDF data partitioning and query processing on SPARKSemantic Web10.3233/SW-243554(1-27)Online publication date: 6-Mar-2024
    • (2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
    • (2024)GreatFree as a Generic Distributed Programming Language and the Foundation of the Cloud-Side Operating SystemInternational Journal of Advanced Network, Monitoring and Controls10.2478/ijanmc-2023-00788:4(66-81)Online publication date: 16-Mar-2024
    • (2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
    • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
    • (2024)Saving Money for Analytical Workloads in the CloudProceedings of the VLDB Endowment10.14778/3681954.368201817:11(3524-3537)Online publication date: 1-Jul-2024
    • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
    • (2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media