Spark SQL
Spark SQL
Spark SQL
Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data
sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones. Spark SQL
provides DataFrame APIs which perform relational operations on both external data sources and Spark’s built-in
distributed collections. It introduces extensible optimizer called Catalyst as it helps in supporting a wide range of data
sources and algorithms in Big-data.
SQL Service:
SQL Service is the entry point for working along structured data in Spark. It allows the creation of DataFrame objects as
well as the execution of SQL queries.
Features Of Spark SQL
The following are the features of Spark SQL:
Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs,
using SQL or a DataFrame API which can be used in Java, Scala, Python and R. To run streaming computation, developers
simply write a batch computation against the DataFrame / Dataset API, and Spark automatically increments the
computation to run it in a streaming fashion. This powerful design means that developers don’t have to manually manage
state, failures, or keeping the application in sync with batch jobs. Instead, the streaming job always gives the same answer
as a batch job on the same data.
Hive Compatibility
Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full
compatibility with current Hive data, queries, and UDFs.
•Standard Connectivity
Connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence
tools.
•Performance And Scalability
Spark SQL incorporates a cost-based optimizer, code generation and columnar storage to make queries agile alongside
computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance. The interfaces
provided by Spark SQL provide Spark with more information about the structure of both the data and the computation
being performed. Internally, Spark SQL uses this extra information to perform extra optimization. Spark SQL can directly
read from multiple sources (files, HDFS, JSON/Parquet files, existing RDDs, Hive, etc.). It ensures fast execution of existing
Hive queries.
User Defined Functions
Spark SQL has language integrated User-Defined Functions (UDFs). UDF is a feature of Spark SQL to define new Column-
based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. UDFs are black boxes in their
execution.
The example below defines a UDF to convert a given text to upper case.
Code explanation:
1. Creating a dataset “hello world”
2. Defining a function ‘upper’ which converts a string into upper case.
3. We now import the ‘udf’ package into Spark.
4. Defining our UDF, ‘upperUDF’ and importing our function ‘upper’.
5. Displaying the results of our User Defined Function in a new column ‘upper’.
Example:
Example:
spark.udf.register("myUpper", (input:String) => input.toUpperCase)
spark.catalog.listFunctions.filter('name like "%upper%").show(false)
Example:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-
value").getOrCreate()
import spark.implicits._
val df = spark.read.json("examples/src/main/resources/employee.json")
df.show()
Querying using SparkSQL:
In Apache Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a
table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with
RDD:
Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame /
RDD after applying transformations.
Lazy Evaluations: Which means that a task is not executed until an action is performed.
Distributed: RDD and DataFrame both are distributed in nature.
Create a list of tuples. Each tuple contains name of a person with age.
Create a RDD from the list above.
Convert each tuple to a row.
Create a DataFrame by applying createDataFrame on RDD with the help of sqlContext.
type(schemaPeople)
Output:
pyspark.sql.dataframe.DataFrame
train = spark.read.csv(“Downloads/train.csv”)
test = spark.read.csv(“Downloads/test.csv”)
DataFrame Manipulations:
To see the types of columns in DataFrame, we can use the printSchema, dtypes. Let’s apply printSchema() on train which will
Print the schema in a tree format.
train.printSchema()
We can use head operation to see first n observation (say, 5 observation). Head operation in PySpark is similar to head
operation in Pandas.
train.head(5)
Above results are comprised of row like format. To see the result in more interactive manner (rows under the columns), we
can use the show operation. Let’s apply show operation on train and take first 2 rows of it. We can pass the argument
truncate = True to truncate the result.
train.show(2)
train.count()
test.count()
How to get the summary statistics (mean, standard deviance, min ,max , count) of numerical columns in a DataFrame?
describe operation is use to calculate the summary statistics of numerical column(s) in DataFrame. If we don’t specify the
name of columns it will calculate summary statistics for all numerical columns present in DataFrame.
train.describe().show()
Let’s check what happens when we specify the name of a categorical / String columns in describe operation.
train.describe(“age”).show()
To subset the columns, we need to use select operation on DataFrame and we need to pass the columns names separated
by commas inside select Operation.
train.describe(“age”,”name”).show()
How to find the number of distinct rows:
The distinct operation can be used here, to calculate the number of distinct rows in a DataFrame.
train.select(“age”).distinct().count()
train.crosstab('Age', 'Gender').show()
Output:
+----------+-----+------+
|Age_Gender| F| M|
+----------+-----+------+
| 0-17| 5083| 10019|
| 46-50|13199| 32502|
| 18-25|24628| 75032|
| 36-45|27170| 82843|
| 55+| 5083| 16421|
| 51-55| 9894| 28607|
| 26-35|50752|168835|
+----------+-----+------+
What If I want to get the DataFrame which won’t have duplicate rows of given DataFrame?
We can use dropDuplicates operation to drop the duplicate rows of a DataFrame and get the DataFrame which won’t have
duplicate rows.
train.select('Age','Gender').dropDuplicates().show()
The dropna operation can be use here. To drop row from the DataFrame it consider three options.
how– ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how
parameter.
subset – optional list of column names to consider.
train.na.drop.show()
What if I want to drop the all rows with null value?
The dropna operation can be use here. To drop row from the DataFrame it consider three options.
how– ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how
parameter.
subset – optional list of column names to consider.
train.na.fill(-1).show(2)
If I want to filter the rows in train which has Purchase more than 15000?
We can apply the filter operation on Purchase column in train DataFrame to filter out the rows with values more than
15000. We need to pass a condition. Let’s apply filter on Purchase column in train DataFrame and print the number of
rows which has more purchase than 15000.
The groupby operation can be used here to find the mean of Purchase for each age group in train. Let’s see how can we
get the mean purchase for the ‘Age’ column train.
train.groupBy(“Age”).agg({mean(“age”)}).show()
We can also apply sum, min, max, count with groupby when we want to get different summary insight each group.
train.groupBy(“Age”).count().show()
Interoperating with RDDs
To convert existing RDDs into DataFrames, Spark SQL supports two methods:
Reflection Based method: Infers an RDD schema containing specific types of objects. Works well when the schema is
already known when writing the Spark application.
Programmatic method: Enables you to build a schema and apply to an already existing RDD. Allows building DataFrames
when you do not know the columns and their types until runtime.
In the next example, you will be creating an RDD of person objects and register it as a table.
import sqlContext.implicits._
p(1).trim.toInt)).toDF()
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext:
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// By field name:
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
SQL statements will run using the SQL methods provided by SQLContext or by field name, and finally, retrieves multiple
columns at once into a Map.
Using the Programmatic Approach
This method is used when you cannot define case classes ahead of time; for example, when the records structure is
encoded in a text dataset or a string.
To create a case class using programmatic approach the following steps can be used:
Create the schema represented by a StructType which matches the rows structure.
Apply the schema to the RDD of rows using the createDataFrame method.
In the next example, sc is an existing SparkContext, where you will be creating an RDD, then the schema is encoded in a
string, and it will generate the schema based on the string of schema.
// sc is an existing SparkContext:
// Create an RDD:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// Generate the schema based on the string of schema, Convert records of the RDD (people) to Rows and Apply the
schema to the RDD.
peopleDataFrame.registerTempTable("people")
It will also convert records of the RDD people to Rows and apply the schema to the RDD.