Spark from the Surface

Booking Hotel, Flight, Train, Event & Rental Car
Apache Spark
Created By Josi Aranda @ Tiket.com

Apache Spark
• Apache Spark is an open-source powerful distributed querying and processing
engine.
• It provides flexibility and extensibility of MapReduce but at significantly higher
speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory
and up to 10 times when accessing disk.

Spark’s Features
Apache Spark achieves high performance
for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query
optimizer, and a physical execution engine.
Speed
Logistic regression in Hadoop and Spark

Spark’s Features
Write applications quickly in Java, Scala,
Python, R, and SQL. Spark offers over 80
high-level operators that make it easy to
build parallel apps. And you can use it
interactively from the Scala, Python, R, and
SQL shells.
Ease of Use
Spark's Python DataFrame API
Read JSON files with automatic schema
inference

Spark’s Features
Combine SQL, streaming, and complex
analytics. Spark powers a stack of libraries
including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark
Streaming. You can combine these libraries
seamlessly in the same application.
Generality

Spark’s Features
Spark runs on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud. It
can access diverse data sources.
Runs Everywhere

Spark Execution Process
• Any Spark application spins off a single driver process (that can contain multiple
jobs) on the master node that then directs executor processes (that contain multiple
tasks) distributed to a number of worker nodes.
• The driver process determines the number and the composition of the task
processes directed to the executor nodes based on the graph generated for the
given job. Note, that any worker node can execute tasks from a number of different
jobs.

Resilient Distributed Dataset (RDD)
• Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM
objects that allow you to perform calculations very quickly, and they are the
backbone of Apache Spark.
• RDDs have two sets of parallel operations: transformations (which return pointers to
new RDDs) and actions (which return values to the driver after running a
computation)
• RDD transformation operations are lazy in a sense that they do not compute their
results immediately. The transformations are only computed when an action is
executed and the results need to be returned to the driver.*
* RDD is like a teenager doing chores. They won’t do it until their mom starts to check.
(they will do it so fast and effectively)

RDD (cont.)
Transformations
Mostly Used RDD Operations
Actions
• .map()
• .filter()
• .flatMap()
• .distinct()
• .sample()
• .leftOuterJoin()
• .repartition()
• .take()
• .collect()
• .reduce()
• .count()
• .saveAsTextFile()
• .foreach()

RDD (cont.)
SparkContext().textFile(‘order__cart.csv’)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web
n
• CSV Line
• Partition
• RDD

RDD (cont.)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web 56312, paid, native_apps
56313, paid, web
56315, paid, web
.filter(lambda line:line[1]==‘paid’)
.map(lambda line:(line[2],1))
(native_apps,1)
(web,2)
(native_apps,1)
(web,1)
(web,1)
.reduceByKey(lambda x,y:x+y)

Spark DataFrame
• A DataFrame is an immutable distributed collection of data that is organized into
named columns analogous to a table in a relational database. Introduced as an
experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed
to DataFrames as part of the Apache Spark 1.3 release.
• By imposing a structure onto a distributed collection of data, this allows Spark users
to query structured data in Spark SQL or using expression methods (instead of
lambdas).

Ways to Create DataFrame
Spark SQL
_c0 _c1 _c2 _c3
• General Files
• Parquet
• ORC Files
• JSON
• Hive Tables
• JDBC

Ways to Create DataFrame (cont.)
a)Traditional df creation. b). df creation with SQL direct. Both will
return the same result.

Spark Dataset
• Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API
that allows users to easily express transformations on domain objects, while also
providing the performance and benefits of the robust Spark SQL execution engine.
As part of the Spark 2.0 release (and as noted in the diagram above), the
DataFrame APIs is merged into the Dataset API thus unifying data processing
capabilities across all libraries.
• Conceptually, the Spark DataFrame is an alias for a collection of generic objects
Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast,
is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or
Java

Performance Benchmark (kind of)
0 200 400 600 800
MySQL*
MapReduce
Spark RDD
Spark DataFrame
Spark DataFrame(direct sql)
Calculate Quarterly Gross rev.
Execution time in seconds (lower is better)
Year Quarter B2C Gross rev.
2016 1 419,563,291,996
2016 2 574,505,787,224
2016 3 537,110,941,199
2016 4 639,459,753,264
2017 1 456,482,358,961
2017 2 587,207,246,225
2017 3 660,881,531,765
2017 4 742,243,815,992
2018 1 1,124,567,178,623
• 6 worker nodes, 2vcpus(12 YARN cores), 13GB memory (62.4GB YARN memory)
• *single node, 32vcpus, 120GB memory

Spark from the Surface

Related slideshows

More Related Content

Spark from the Surface