0% found this document useful (0 votes)

211 views

Spark SQL

Spark SQL provides a relational processing engine for Apache Spark. It allows users to write SQL queries over distributed datasets and take advantage of Spark's optimizations. Spark SQL includes a DataFrame API that represents data as distributed tables, a Catalyst optimizer that applies rules to execution plans, and integration with data sources and machine learning libraries. It aims to support SQL queries on large datasets through its automatic optimization capabilities.

Uploaded by

Roxana Godoy Astudillo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

211 views

Spark SQL

Uploaded by

Roxana Godoy Astudillo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Spark SQL

The 8 fastest-growing tech skills worth over

$110,000

No. 1: Spark, up 120%, worth $113,214

DO you know how to write code in
Spark ?
Can you write SQL ?

“SQL is a highly sought-after technical skill due to its ability to work with
nearly all databases.”
Ibro Palic, CEO of Resumes Templates
History and Evolution of Big Data
Technologies

Procedural
Programing
interface

Declarative
Queries Automatic
Optimization
So Far…

We have established that we need

platform with Automatic Optimization
What user want ?

•ETL from different

1
sources

•Advanced
2
Analytics
Introducing

Spark SQL : Relational Data Processing

in Spark
Background

 Apache Spark is a general-purpose cluster computing engine with

APIs in Scala, Java and Python and libraries for streaming, graph
processing and machine learning
 RDDs are fault-tolerant, in that the system can recover lost data
using the lineage graph of the RDDs (by rerunning operations such
as the filter above to rebuild missing partitions). They can also
explicitly be cached in memory or on disk to support iteration
 Shark, a modified the Apache Hive system to run on Spark and
implemented traditional RDBMS optimizations, such as columnar
processing, over the Spark engine.
Goals for Spark SQL

 Support Relational Processing both within Spark

programs and on external data sources
 Provide High Performance using established DBMS
techniques.
 Easily support New Data Sources
 Enable Extension with advanced analytics algorithms
such as graph processing and machine learning.
Programming Interface
DataFrame API

 DataFrame is a distributed collection of rows with a

homogeneous schema

Keep Track of
Hashtags ##
# A Lazy Computation
Data Model and DataFrame
Operations
 Spark SQL uses a nested data model based on Hive
 It supports all major SQL data types, including boolean, integer, double,
decimal, string, date, timestamp and also User Defined Data types

Example of DataFrame Operations

DataFrame Operations Cont.

#Access DF with DSL or SQL

Real World Problems

#Heterogeneous
Data Sources
Schema Inference

 Spark SQL can automatically infer the schema of these

objects using reflection
 Scala/Java - extracted from the language’s type system
 Python – Sampling the Dataset
In – Memory Caching

#Invoked with .cache()

User-Defined Functions

How Spark SQLs User defined

functions are different than traditional
Database Systems ?
Catalyst Optimizer

 Catalyst is based on functional programming constructs in Scala

Purposes

Ability to add new

optimization techniques
and features to Ability to extend the
optimizer
Spark SQL
Catalyst Optimization

#Trees

#Rules
Catalyst Optimization Cont.

Rule Based Optimization

Cost Based Optimization

Query Planning in Spark SQL
Extension Points

#Open Source Projects

Extension Points Cont.

 Data Sources
Examples :
 CSV
 Avro
 Parquet
 JDBC
Extension Points Cont.
 User Defined Types (UDTs)

#Useful for Machine Learning

Advanced Analytics Features

1.Schema Inference for Semi structured Data

2.Query Federation to External Databases

Advanced Analytics Features Cont.
3.Integration with Spark’s Machine
Learning Library
Evaluation

 SQL Performance
Evaluation Cont.

 DataFrames vs. Native Spark Code

Pipeline Performance
Applications

 Generalized Online Aggregation

 Computational Genomics
 List is infinite only limited by your imagination…
Conclusion

Our Final Hash Tags

#A Platform with
#Automatic optimization
#Complex pipelines that mix relational and complex analytics
#Large-scale data analysis
#Semi-structured data
#Data types for machine learning
#Extensible optimizer called Catalyst
#Easy to add Optimization rules, data sources and data types

Snowflake Certification
No ratings yet
Snowflake Certification
102 pages
General Test Cases
No ratings yet
General Test Cases
21 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Building Data Pipelines - 4
No ratings yet
Building Data Pipelines - 4
38 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Hive Commands
No ratings yet
Hive Commands
3 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Informatica Interview Questions Part 2 For 2021
No ratings yet
Informatica Interview Questions Part 2 For 2021
63 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
OLTP
No ratings yet
OLTP
12 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Spark Notes
0% (1)
Spark Notes
23 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Transformer All Functions
100% (1)
Transformer All Functions
47 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
188 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Srikanth
No ratings yet
Srikanth
7 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Informatica 9.x Course Curriculum
No ratings yet
Informatica 9.x Course Curriculum
8 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Hadoop Interview Questions - Part 1
No ratings yet
Hadoop Interview Questions - Part 1
8 pages
SSIS Interview Questions
No ratings yet
SSIS Interview Questions
4 pages
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
No ratings yet
What Are The Difference Between DDL, DML and DCL Commands - Oracle FAQ
4 pages
Understanding Business Intelligence:: ETL and Data Mart Best Practices
No ratings yet
Understanding Business Intelligence:: ETL and Data Mart Best Practices
20 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Querry Test
No ratings yet
Querry Test
106 pages
Introduction To: What Is SQL?
No ratings yet
Introduction To: What Is SQL?
25 pages
Mannual Testing 1 Syllabus Excel
No ratings yet
Mannual Testing 1 Syllabus Excel
3 pages
Lecture 0 - CS50's Web Programming With Python and JavaScript
No ratings yet
Lecture 0 - CS50's Web Programming With Python and JavaScript
27 pages
Engineering Productivity
No ratings yet
Engineering Productivity
15 pages
Adobe AIR Building-Apps
No ratings yet
Adobe AIR Building-Apps
256 pages
C String
No ratings yet
C String
14 pages
Rakesh Kumar: Objective
No ratings yet
Rakesh Kumar: Objective
4 pages
Java Lectures
No ratings yet
Java Lectures
3 pages
Excel & Visual Basic For Applications (VBA)
No ratings yet
Excel & Visual Basic For Applications (VBA)
7 pages
Issues Faced
No ratings yet
Issues Faced
33 pages
UtkarshGupta Resume
No ratings yet
UtkarshGupta Resume
1 page
Company Profile
No ratings yet
Company Profile
5 pages
Generics
No ratings yet
Generics
28 pages
Tharun Kumar Goud
No ratings yet
Tharun Kumar Goud
3 pages
Using Sun Studio Collector/Analyzer: 1. Overview
No ratings yet
Using Sun Studio Collector/Analyzer: 1. Overview
6 pages
Dimple Savla - Resume
No ratings yet
Dimple Savla - Resume
4 pages
07. Microservices Notes
No ratings yet
07. Microservices Notes
14 pages
Vikas Rai: Areer Bjective
No ratings yet
Vikas Rai: Areer Bjective
2 pages
Installing .NET Framework 3.5 - 3.0 - 2.0 Error 0x800F0906
No ratings yet
Installing .NET Framework 3.5 - 3.0 - 2.0 Error 0x800F0906
1 page
ETL CV
100% (1)
ETL CV
42 pages
C++ - Module 02: Ad-Hoc Polymorphism, Operators Overload and Orthodox Canonical Classes
No ratings yet
C++ - Module 02: Ad-Hoc Polymorphism, Operators Overload and Orthodox Canonical Classes
13 pages
PLC Program Development Guide
No ratings yet
PLC Program Development Guide
19 pages
Computer Programming Tutorial
100% (10)
Computer Programming Tutorial
79 pages
A Simple Java RMI Example
No ratings yet
A Simple Java RMI Example
4 pages
OS - Kali Linux
No ratings yet
OS - Kali Linux
23 pages
Advanced Web Application Lab Programs
No ratings yet
Advanced Web Application Lab Programs
14 pages
Step-By-Step: 1st Step All Steps Answer Only
No ratings yet
Step-By-Step: 1st Step All Steps Answer Only
2 pages
Experiment-1 Environment Setup
No ratings yet
Experiment-1 Environment Setup
3 pages
Group Project 40%
No ratings yet
Group Project 40%
3 pages
Graphtec Pro Studio Graphtec Pro Studio Plus Cutting Master 4
No ratings yet
Graphtec Pro Studio Graphtec Pro Studio Plus Cutting Master 4
46 pages