Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
211 views

Spark SQL

Spark SQL provides a relational processing engine for Apache Spark. It allows users to write SQL queries over distributed datasets and take advantage of Spark's optimizations. Spark SQL includes a DataFrame API that represents data as distributed tables, a Catalyst optimizer that applies rules to execution plans, and integration with data sources and machine learning libraries. It aims to support SQL queries on large datasets through its automatic optimization capabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views

Spark SQL

Spark SQL provides a relational processing engine for Apache Spark. It allows users to write SQL queries over distributed datasets and take advantage of Spark's optimizations. Spark SQL includes a DataFrame API that represents data as distributed tables, a Catalyst optimizer that applies rules to execution plans, and integration with data sources and machine learning libraries. It aims to support SQL queries on large datasets through its automatic optimization capabilities.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Spark SQL

The 8 fastest-growing tech skills worth over


$110,000

No. 1: Spark, up 120%, worth $113,214


DO you know how to write code in
Spark ?
Can you write SQL ?

“SQL is a highly sought-after technical skill due to its ability to work with
nearly all databases.”
Ibro Palic, CEO of Resumes Templates
History and Evolution of Big Data
Technologies

Procedural
Programing
interface

Declarative
Queries Automatic
Optimization
So Far…

We have established that we need


platform with Automatic Optimization
What user want ?

•ETL from different


1
sources

•Advanced
2
Analytics
Introducing

Spark SQL : Relational Data Processing


in Spark
Background

 Apache Spark is a general-purpose cluster computing engine with


APIs in Scala, Java and Python and libraries for streaming, graph
processing and machine learning
 RDDs are fault-tolerant, in that the system can recover lost data
using the lineage graph of the RDDs (by rerunning operations such
as the filter above to rebuild missing partitions). They can also
explicitly be cached in memory or on disk to support iteration
 Shark, a modified the Apache Hive system to run on Spark and
implemented traditional RDBMS optimizations, such as columnar
processing, over the Spark engine.
Goals for Spark SQL

 Support Relational Processing both within Spark


programs and on external data sources
 Provide High Performance using established DBMS
techniques.
 Easily support New Data Sources
 Enable Extension with advanced analytics algorithms
such as graph processing and machine learning.
Programming Interface
DataFrame API

 DataFrame is a distributed collection of rows with a


homogeneous schema

Keep Track of
Hashtags ##
# A Lazy Computation
Data Model and DataFrame
Operations
 Spark SQL uses a nested data model based on Hive
 It supports all major SQL data types, including boolean, integer, double,
decimal, string, date, timestamp and also User Defined Data types

Example of DataFrame Operations


DataFrame Operations Cont.

#Access DF with DSL or SQL


Real World Problems

#Heterogeneous
Data Sources
Schema Inference

 Spark SQL can automatically infer the schema of these


objects using reflection
 Scala/Java - extracted from the language’s type system
 Python – Sampling the Dataset
In – Memory Caching

#Invoked with .cache()


User-Defined Functions

How Spark SQLs User defined


functions are different than traditional
Database Systems ?
Catalyst Optimizer

 Catalyst is based on functional programming constructs in Scala

Purposes

Ability to add new


optimization techniques
and features to Ability to extend the
optimizer
Spark SQL
Catalyst Optimization

#Trees

#Rules
Catalyst Optimization Cont.

Rule Based Optimization

Cost Based Optimization


Query Planning in Spark SQL
Extension Points

#Open Source Projects


Extension Points Cont.

 Data Sources
Examples :
 CSV
 Avro
 Parquet
 JDBC
Extension Points Cont.
 User Defined Types (UDTs)

#Useful for Machine Learning


Advanced Analytics Features

1.Schema Inference for Semi structured Data

2.Query Federation to External Databases


Advanced Analytics Features Cont.
3.Integration with Spark’s Machine
Learning Library
Evaluation

 SQL Performance
Evaluation Cont.

 DataFrames vs. Native Spark Code


Pipeline Performance
Applications

 Generalized Online Aggregation


 Computational Genomics
 List is infinite only limited by your imagination…
Conclusion

Our Final Hash Tags


#A Platform with
#Automatic optimization
#Complex pipelines that mix relational and complex analytics
#Large-scale data analysis
#Semi-structured data
#Data types for machine learning
#Extensible optimizer called Catalyst
#Easy to add Optimization rules, data sources and data types

You might also like