Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Booking Hotel, Flight, Train, Event & Rental Car
Apache Spark
Created By Josi Aranda @ Tiket.com
Apache Spark
• Apache Spark is an open-source powerful distributed querying and processing
engine.
• It provides flexibility and extensibility of MapReduce but at significantly higher
speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory
and up to 10 times when accessing disk.
Spark’s Features
Apache Spark achieves high performance
for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query
optimizer, and a physical execution engine.
Speed
Logistic regression in Hadoop and Spark
Spark’s Features
Write applications quickly in Java, Scala,
Python, R, and SQL. Spark offers over 80
high-level operators that make it easy to
build parallel apps. And you can use it
interactively from the Scala, Python, R, and
SQL shells.
Ease of Use
Spark's Python DataFrame API
Read JSON files with automatic schema
inference
Spark’s Features
Combine SQL, streaming, and complex
analytics. Spark powers a stack of libraries
including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark
Streaming. You can combine these libraries
seamlessly in the same application.
Generality
Spark’s Features
Spark runs on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud. It
can access diverse data sources.
Runs Everywhere
Spark Execution Process
• Any Spark application spins off a single driver process (that can contain multiple
jobs) on the master node that then directs executor processes (that contain multiple
tasks) distributed to a number of worker nodes.
• The driver process determines the number and the composition of the task
processes directed to the executor nodes based on the graph generated for the
given job. Note, that any worker node can execute tasks from a number of different
jobs.
Resilient Distributed Dataset (RDD)
• Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM
objects that allow you to perform calculations very quickly, and they are the
backbone of Apache Spark.
• RDDs have two sets of parallel operations: transformations (which return pointers to
new RDDs) and actions (which return values to the driver after running a
computation)
• RDD transformation operations are lazy in a sense that they do not compute their
results immediately. The transformations are only computed when an action is
executed and the results need to be returned to the driver.*
* RDD is like a teenager doing chores. They won’t do it until their mom starts to check.
(they will do it so fast and effectively)
RDD (cont.)
Transformations
Mostly Used RDD Operations
Actions
• .map()
• .filter()
• .flatMap()
• .distinct()
• .sample()
• .leftOuterJoin()
• .repartition()
• .take()
• .collect()
• .reduce()
• .count()
• .saveAsTextFile()
• .foreach()
RDD (cont.)
SparkContext().textFile(‘order__cart.csv’)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web
n
• CSV Line
• Partition
• RDD
RDD (cont.)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web 56312, paid, native_apps
56313, paid, web
56315, paid, web
.filter(lambda line:line[1]==‘paid’)
.map(lambda line:(line[2],1))
(native_apps,1)
(web,2)
(native_apps,1)
(web,1)
(web,1)
.reduceByKey(lambda x,y:x+y)
Spark DataFrame
• A DataFrame is an immutable distributed collection of data that is organized into
named columns analogous to a table in a relational database. Introduced as an
experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed
to DataFrames as part of the Apache Spark 1.3 release.
• By imposing a structure onto a distributed collection of data, this allows Spark users
to query structured data in Spark SQL or using expression methods (instead of
lambdas).
Ways to Create DataFrame
Spark SQL
_c0 _c1 _c2 _c3
• General Files
• Parquet
• ORC Files
• JSON
• Hive Tables
• JDBC
Ways to Create DataFrame (cont.)
a)Traditional df creation. b). df creation with SQL direct. Both will
return the same result.
Spark Dataset
• Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API
that allows users to easily express transformations on domain objects, while also
providing the performance and benefits of the robust Spark SQL execution engine.
As part of the Spark 2.0 release (and as noted in the diagram above), the
DataFrame APIs is merged into the Dataset API thus unifying data processing
capabilities across all libraries.
• Conceptually, the Spark DataFrame is an alias for a collection of generic objects
Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast,
is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or
Java
Performance Benchmark (kind of)
0 200 400 600 800
MySQL*
MapReduce
Spark RDD
Spark DataFrame
Spark DataFrame(direct sql)
Calculate Quarterly Gross rev.
Execution time in seconds (lower is better)
Year Quarter B2C Gross rev.
2016 1 419,563,291,996
2016 2 574,505,787,224
2016 3 537,110,941,199
2016 4 639,459,753,264
2017 1 456,482,358,961
2017 2 587,207,246,225
2017 3 660,881,531,765
2017 4 742,243,815,992
2018 1 1,124,567,178,623
• 6 worker nodes, 2vcpus(12 YARN cores), 13GB memory (62.4GB YARN memory)
• *single node, 32vcpus, 120GB memory

More Related Content

Spark from the Surface

  • 1. Booking Hotel, Flight, Train, Event & Rental Car Apache Spark Created By Josi Aranda @ Tiket.com
  • 2. Apache Spark • Apache Spark is an open-source powerful distributed querying and processing engine. • It provides flexibility and extensibility of MapReduce but at significantly higher speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory and up to 10 times when accessing disk.
  • 3. Spark’s Features Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Speed Logistic regression in Hadoop and Spark
  • 4. Spark’s Features Write applications quickly in Java, Scala, Python, R, and SQL. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. Ease of Use Spark's Python DataFrame API Read JSON files with automatic schema inference
  • 5. Spark’s Features Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Generality
  • 6. Spark’s Features Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. Runs Everywhere
  • 7. Spark Execution Process • Any Spark application spins off a single driver process (that can contain multiple jobs) on the master node that then directs executor processes (that contain multiple tasks) distributed to a number of worker nodes. • The driver process determines the number and the composition of the task processes directed to the executor nodes based on the graph generated for the given job. Note, that any worker node can execute tasks from a number of different jobs.
  • 8. Resilient Distributed Dataset (RDD) • Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM objects that allow you to perform calculations very quickly, and they are the backbone of Apache Spark. • RDDs have two sets of parallel operations: transformations (which return pointers to new RDDs) and actions (which return values to the driver after running a computation) • RDD transformation operations are lazy in a sense that they do not compute their results immediately. The transformations are only computed when an action is executed and the results need to be returned to the driver.* * RDD is like a teenager doing chores. They won’t do it until their mom starts to check. (they will do it so fast and effectively)
  • 9. RDD (cont.) Transformations Mostly Used RDD Operations Actions • .map() • .filter() • .flatMap() • .distinct() • .sample() • .leftOuterJoin() • .repartition() • .take() • .collect() • .reduce() • .count() • .saveAsTextFile() • .foreach()
  • 10. RDD (cont.) SparkContext().textFile(‘order__cart.csv’) 56312, paid, native_apps 56313, paid, web 56314, shopping_cart, web 56315, paid, web n • CSV Line • Partition • RDD
  • 11. RDD (cont.) 56312, paid, native_apps 56313, paid, web 56314, shopping_cart, web 56315, paid, web 56312, paid, native_apps 56313, paid, web 56315, paid, web .filter(lambda line:line[1]==‘paid’) .map(lambda line:(line[2],1)) (native_apps,1) (web,2) (native_apps,1) (web,1) (web,1) .reduceByKey(lambda x,y:x+y)
  • 12. Spark DataFrame • A DataFrame is an immutable distributed collection of data that is organized into named columns analogous to a table in a relational database. Introduced as an experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed to DataFrames as part of the Apache Spark 1.3 release. • By imposing a structure onto a distributed collection of data, this allows Spark users to query structured data in Spark SQL or using expression methods (instead of lambdas).
  • 13. Ways to Create DataFrame Spark SQL _c0 _c1 _c2 _c3 • General Files • Parquet • ORC Files • JSON • Hive Tables • JDBC
  • 14. Ways to Create DataFrame (cont.) a)Traditional df creation. b). df creation with SQL direct. Both will return the same result.
  • 15. Spark Dataset • Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. As part of the Spark 2.0 release (and as noted in the diagram above), the DataFrame APIs is merged into the Dataset API thus unifying data processing capabilities across all libraries. • Conceptually, the Spark DataFrame is an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or Java
  • 16. Performance Benchmark (kind of) 0 200 400 600 800 MySQL* MapReduce Spark RDD Spark DataFrame Spark DataFrame(direct sql) Calculate Quarterly Gross rev. Execution time in seconds (lower is better) Year Quarter B2C Gross rev. 2016 1 419,563,291,996 2016 2 574,505,787,224 2016 3 537,110,941,199 2016 4 639,459,753,264 2017 1 456,482,358,961 2017 2 587,207,246,225 2017 3 660,881,531,765 2017 4 742,243,815,992 2018 1 1,124,567,178,623 • 6 worker nodes, 2vcpus(12 YARN cores), 13GB memory (62.4GB YARN memory) • *single node, 32vcpus, 120GB memory