Spark SQL
Spark SQL
Spark - Introduction
Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a
computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main
concern is to maintain speed in processing large datasets in terms of waiting time between
queries and waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has
its own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Spark – RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain any
type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is a
fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are not
so efficient.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There is
also support for persisting RDDs on disk, or replicated across multiple nodes.
Spark - Installation
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.
Features of DataFrame
Here is a set of few characteristic features of DataFrame −
Ability to process the data in the size of Kilobytes to Petabytes on a single
node cluster to large cluster.
Supports different data formats (Avro, csv, elastic search, and Cassandra) and
storage systems (HDFS, HIVE tables, mysql, etc).
State of art optimization and code generation through the Spark SQL Catalyst
optimizer (tree transformation framework).
Can be easily integrated with all Big Data tools and frameworks via Spark-
Core.
Provides API for Python, Java, Scala, and R Programming.
SQLContext
SQLContext is a class and is used for initializing the functionalities of Spark SQL.
SparkContext class object (sc) is required for initializing SQLContext class object.
The following command is used for initializing the SparkContext through spark-shell.
$ spark-shell
By default, the SparkContext object is initialized with the name sc when the spark-shell
starts.
Use the following command to create SQLContext.
scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
Example
Let us consider an example of employee records in a JSON file named employee.json. Use
the following commands to create a DataFrame (df) and read a JSON document
named employee.json with the following content.
employee.json − Place this file in the directory where the current scala> pointer is located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
DataFrame Operations
DataFrame provides a domain-specific language for structured data manipulation. Here, we
include some basic examples of structured data processing using DataFrames.
Follow the steps given below to perform DataFrame operations −
Read the JSON Document
First, we have to read the JSON document. Based on this, generate a DataFrame named (dfs).
Use the following command to read the JSON document named employee.json. The data is
shown as a table with the fields − id, name, and age.
scala> val dfs = sqlContext.read.json("employee.json")
Output − The field names are taken automatically from employee.json.
dfs: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
Show the Data
If you want to see the data in the DataFrame, then use the following command.
scala> dfs.show()
Output − You can see the employee data in a tabular format.
<console>:22, took 0.052610 s
+----+------+--------+
|age | id | name |
+----+------+--------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
| 23 | 1204 | javed |
| 23 | 1205 | prudvi |
+----+------+--------+
Use printSchema Method
If you want to see the Structure (Schema) of the DataFrame, then use the following
command.
scala> dfs.printSchema()
Output
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
Use Select Method
Use the following command to fetch name-column among three columns from the
DataFrame.
scala> dfs.select("name").show()
Output − You can see the values of the name column.
<console>:22, took 0.044023 s
+--------+
| name |
+--------+
| satish |
| krishna|
| amith |
| javed |
| prudvi |
+--------+
Use Age Filter
Use the following command for finding the employees whose age is greater than 23 (age >
23).
scala> dfs.filter(dfs("age") > 23).show()
Output
<console>:22, took 0.078670 s
+----+------+--------+
|age | id | name |
+----+------+--------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith |
+----+------+--------+
Use groupBy Method
Use the following command for counting the number of employees who are of the same age.
scala> dfs.groupBy("age").count().show()
Output − two employees are having age 23.
<console>:22, took 5.196091 s
+----+-----+
|age |count|
+----+-----+
| 23 | 2 |
| 25 | 1 |
| 28 | 1 |
| 39 | 1 |
+----+-----+
Running SQL Queries Programmatically
An SQLContext enables applications to run SQL queries programmatically while running
SQL functions and returns the result as a DataFrame.
Generally, in the background, SparkSQL supports two different methods for converting
existing RDDs into DataFrames −
Example
Let us consider an example of employee records in a text file named employee.txt. Create an
RDD by reading the data from text file and convert it into DataFrame using Default SQL
functions.
Given Data − Take a look into the following data of a file named employee.txt placed it in
the current respective directory where the spark shell point is running.
1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23
The following examples explain how to generate a schema using Reflections.
1 JSON Datasets
Spark SQL can automatically capture the schema of a JSON dataset and load
it as a DataFrame.
2 Hive Tables
Hive comes bundled with the Spark library as HiveContext, which inherits
from SQLContext.
3 Parquet Files
Parquet is a columnar format, supported by many data processing systems.
Example
Let us consider an example of employee records in a text file named employee.json. Use the
following commands to create a DataFrame (df).
Read a JSON document named employee.json with the following content and generate a
table based on the schema in the JSON document.
employee.json − Place this file into the directory where the current scala> pointer is located.
{
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}
}
Let us perform some Data Frame operations on given data.
DataFrame Operations
DataFrame provides a domain-specific language for structured data manipulation. Here we
include some basic examples of structured data processing using DataFrames.
Follow the steps given below to perform DataFrame operations −
$ ls
_common_metadata
Part-r-00001.gz.parquet
_metadata
_SUCCESS
The following commands are used for reading, registering into table, and applying some
queries on it.