Spark SQL Tutorial PDF
Spark SQL Tutorial PDF
This is a brief tutorial that explains the basics of Spark SQL programming.
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would
be useful for Analytics Professionals and ETL developers as well.
Prerequisite
Before you start proceeding with this tutorial, we assume that you have prior exposure
to Scala programming, database concepts, and any of the Linux operating system
flavors.
All the content and graphics published in this e-book are the property of Tutorials Point
(I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or
republish any contents or a part of contents of this e-book in any manner without written
consent of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely
as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I)
Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of
our website or its contents including this tutorial. If you discover any errors on our
website or in this tutorial, please notify us at contact@tutorialspoint.com
i
Spark SQL
Table of Contents
About the Tutorial ............................................................................................................................................ i
Audience........................................................................................................................................................... i
Prerequisite ...................................................................................................................................................... i
Table of Contents............................................................................................................................................. ii
ii
Spark SQL
SQLContext .................................................................................................................................................... 14
iii
1. SPARK SQL – INTRODUCTION Spark SQL
Industries are using Hadoop extensively to analyze their data sets. The reason is that
Hadoop framework is based on a simple programming model (MapReduce) and it
enables a computing solution that is scalable, flexible, fault-tolerant and cost effective.
Here, the main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not,
really, dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage purpose
only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data
in memory.
1
Spark SQL
Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
2
Spark SQL
Components of Spark
The following illustration depicts the different components of Spark.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.
3
2. SPARK SQL – RDD Spark SQL
There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable
storage system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
4
Spark SQL
The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.
5
Spark SQL
Let us now try to find out how iterative and interactive operations take place in Spark
RDD.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.
7
3. SPARK SQL – INSTALLATION Spark SQL
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.
$java -version
If Java is already, installed on your system, you get to see the following response –
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
$scala -version
If Scala is already installed on your system, you get to see the following response –
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
8
Spark SQL
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
$scala -version
If Scala is already installed on your system, you get to see the following response –
9
Spark SQL
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
$ source ~/.bashrc
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
10
Spark SQL
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
11
4. SPARK SQL – FEATURES AND ARCHITECTURE Spark SQL
Integrated: Seamlessly mix SQL queries with Spark programs. Spark SQL lets
you query structured data as a distributed dataset (RDD) in Spark, with
integrated APIs in Python, Scala and Java. This tight integration makes it easy to
run SQL queries alongside complex analytic algorithms.
Unified Data Access: Load and query data from a variety of sources. Schema-
RDDs provide a single interface for efficiently working with structured data,
including Apache Hive tables, parquet files and JSON files.
Scalability: Use the same engine for both interactive and long queries. Spark
SQL takes advantage of the RDD model to support mid-query fault tolerance,
letting it scale to large jobs too. Do not worry about using a different engine for
historical data.
12
Spark SQL
This architecture contains three layers namely, Language API, Schema RDD, and Data
Sources.
Language API: Spark is compatible with different languages and Spark SQL. It
is also, supported by these languages- API (python, scala, java, HiveQL).
Schema RDD: Spark Core is designed with special data structure called RDD.
Generally, Spark SQL works on schemas, tables, and records. Therefore, we can
use the Schema RDD as temporary table. We can call this Schema RDD as Data
Frame.
Data Sources: Usually the Data source for spark-core is a text file, Avro file, etc.
However, the Data Sources for Spark SQL is different. Those are Parquet file,
JSON document, HIVE tables, and Cassandra database.
13
5. SPARK SQL – DATAFRAMES Spark SQL
A DataFrame can be constructed from an array of different sources such as Hive tables,
Structured Data files, external databases, or existing RDDs. This API was designed for
modern Big Data and data science applications taking inspiration from DataFrame in R
Programming and Pandas in Python.
Features of DataFrame
Here is a set of few characteristic features of DataFrame:
Ability to process the data in the size of Kilobytes to Petabytes on a single node
cluster to large cluster.
Supports different data formats (Avro, csv, elastic search, and Cassandra) and
storage systems (HDFS, HIVE tables, mysql, etc).
State of art optimization and code generation through the Spark SQL Catalyst
optimizer (tree transformation framework).
Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
SQLContext
SQLContext is a class and is used for initializing the functionalities of Spark SQL.
SparkContext class object (sc) is required for initializing SQLContext class object.
The following command is used for initializing the SparkContext through spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name sc when the spark-shell
starts.
Example
Let us consider an example of employee records in a JSON file named employee.json.
Use the following commands to create a DataFrame (df) and read a JSON document
named employee.json with the following content.
14
Spark SQL
employee.json – Place this file in the directory where the current scala> pointer is
located.
DataFrame Operations
DataFrame provides a domain-specific language for structured data manipulation. Here,
we include some basic examples of structured data processing using DataFrames.
Use the following command to read the JSON document named employee.json. The
data is shown as a table with the fields – id, name, and age.
scala> dfs.show()
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
scala> dfs.select("name").show()
16
Spark SQL
Output
scala> dfs.groupBy("age").count().show()
Generally, in the background, SparkSQL supports two different methods for converting
existing RDDs into DataFrames:
Case classes can also be nested or contain complex types such as Sequences or Arrays.
This RDD can be implicitly be converted to a DataFrame and then registered as a table.
Tables can be used in subsequent SQL statements.
Example
Let us consider an example of employee records in a text file named employee.txt.
Create an RDD by reading the data from text file and convert it into DataFrame using
Default SQL functions.
Given Data: Take a look into the following data of a file named employee.txt placed it
in the current respective directory where the spark shell point is running.
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
$ spark-shell
Create SQLContext
Generate SQLContext using the following command. Here, sc means SparkContext
object.
18
Spark SQL
Here, two map functions are defined. One is for splitting the text record into fields
(.map(_.split(“,”))) and the second map function for converting individual fields (id,
name, age) into one case class object (.map(e(0).trim.toInt, e(1),
e(2).trim.toInt)).
At last, toDF() method is used for converting the case class object with schema into a
DataFrame.
Output
scala> empl.registerTempTable("employee")
The employee table is ready. Let us now pass some sql queries on the table using
SQLContext.sql() method.
19
Spark SQL
To see the result data of allrecords DataFrame, use the following command.
scala> allrecords.show()
Output
+----+--------+---+
| id| name|age|
+----+--------+---+
|1201| satish| 25|
|1202| krishna| 28|
|1203| amith| 39|
|1204| javed| 23|
|1205| prudvi| 23|
+----+--------+---+
To see the result data of agefilter DataFrame, use the following command.
scala> agefilter.show()
Output
20
Spark SQL
The previous two queries were passed against the whole table DataFrame. Now let us try
to fetch data from the result DataFrame by applying Transformations on it.
Output
This reflection based approach leads to more concise code and works well when you
already know the schema while writing your Spark application.
Apply the schema to the RDD of Rows via createDataFrame method provided
bySQLContext.
Example
Let us consider an example of employee records in a text file named employee.txt.
Create a Schema using DataFrame directly by reading the data from text file.
Given Data: Look at the following data of a file named employee.txt placed in the
current respective directory where the spark shell point is running.
1201, satish, 25
1202, krishna, 28
1203, amith, 39
21
Spark SQL
1204, javed, 23
1205, prudvi, 23
$ spark-shell
Output
Generate Schema
The following command is used to generate a schema by reading the schemaString
variable. It means you need to read each field by splitting the whole string with space as
a delimiter and take each field type is String type, by default.
22
Spark SQL
Output
scala> employeeDF.registerTempTable("employee")
The employee table is now ready. Let us pass some SQL queries into the table using the
method SQLContext.sql().
To see the result data of allrecords DataFrame, use the following command.
scala> allrecords.show()
Output
+----+--------+---+
23
Spark SQL
| id| name|age|
+----+--------+---+
|1201| satish| 25|
|1202| krishna| 28|
|1203| amith| 39|
|1204| javed| 23|
|1205| prudvi| 23|
+----+--------+---+
The method sqlContext.sql allows you to construct DataFrames when the columns and
their types are not known until runtime. Now you can run different SQL queries into it.
24
6. SPARK SQL – DATA SOURCES Spark SQL
In this chapter, we will describe the general methods for loading and saving data using
different Spark DataSources. Thereafter, we will discuss in detail the specific options that
are available for the built-in data sources.
There are different types of data sources available in SparkSQL, some of which are listed
below:
JSON Datasets
Hive Tables
Parquet Files
JSON Datasets
Spark SQL can automatically capture the schema of a JSON dataset and load it as a
DataFrame. This conversion can be done using SQLContext.read.json() on either an
RDD of String or a JSON file.
Spark SQL provides an option for querying JSON data along with auto-capturing of JSON
schemas for both reading and writing data. Spark SQL understands the nested fields in
JSON data and allows users to directly access these fields without any explicit
transformations.
Example
Let us consider an example of employee records in a text file named employee.json.
Use the following commands to create a DataFrame (df).
Read a JSON document named employee.json with the following content and generate
a table based on the schema in the JSON document.
employee.json – Place this file into the directory where the current scala> pointer is
located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
25
Spark SQL
DataFrame Operations
DataFrame provides a domain-specific language for structured data manipulation. Here
we include some basic examples of structured data processing using DataFrames.
Use the following command to read the JSON document named employee.json
containing the fields – id, name, and age. It creates a DataFrame named dfs.
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
scala> dfs.show()
26
Spark SQL
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+
Then we can run different SQL statements in it. Users can migrate data into JSON format
with minimal effort, regardless of the origin of the data source.
Hive Tables
Hive comes bundled with the Spark library as HiveContext, which inherits from
SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore
and write queries on it using HiveQL. Users who do not have an existing Hive
deployment can still create a HiveContext. When not configured by the hive-site.xml,
the context automatically creates a metastore called metastore_db and a folder called
warehouse in the current directory.
Consider the following example of employee record using Hive tables. All the recorded
data is in the text file named employee.txt. Here, we will first initialize the HiveContext
object. Using that, we will create a table, load the employee record data into it using
HiveQL language, and apply some queries on it.
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
$ su
password:
#spark-shell
scala>
Use the following command for initializing the HiveContext into the Spark Shell.
To display the record data, call the show() method on the result DataFrame.
scala> result.show()
Output
28
Spark SQL
+----+---------+---+
Parquet Files
Parquet is a columnar format, supported by many data processing systems. The
advantages of having a columnar storage are as follows:
Columnar storage can fetch specific columns that you need to access.
Spark SQL provides support for both reading and writing parquet files that automatically
capture the schema of the original data. Like JSON datasets, parquet files follow the
same procedure.
Let’s take another look at the same example of employee record data named
employee.parquet placed in the same directory where spark-shell is running.
Given data: Do not bother about converting the input data of employee records into
parquet format. We use the following commands that convert the RDD data into Parquet
file. Place the employee.json document, which we have used as the input file in our
previous examples.
$ spark-shell
Scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Scala> val employee = sqlContext.read.json(“emplaoyee”)
Scala> employee.write.parquet(“employee.parquet”)
It is not possible to show you the parquet file. It is a directory structure, which you can
find in the current directory. If you want to see the directory and file structure, use the
following command.
$ cd employee.parquet/
$ ls
_common_metadata
Part-r-00001.gz.parquet
_metadata
_SUCCESS
The following commands are used for reading, registering into table, and applying some
queries on it.
29
Spark SQL
$ spark-shell
scala> Parqfile.registerTempTable(“employee”)
The employee table is ready. Let us now pass some SQL queries on the table using the
method SQLContext.sql().
To see the result data of allrecords DataFrame, use the following command.
scala> allrecords.show()
Output
+----+--------+---+
| id| name|age|
+----+--------+---+
|1201| satish| 25|
|1202| krishna| 28|
|1203| amith| 39|
|1204| javed| 23|
30
Spark SQL
31