Learning Apache Spark With Python

Learning Apache Spark with Python
22 Chapter 3. Configure Running Platform

unzip sparkling-water-2.4.5.zip
cd ~/sparkling-water-2.4.5/bin
./pysparkling
If you have a correct setup for PySpark, then you will get the following results:
Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental

˓→property
Python 3.7.1 (default, Dec 14 2018, 13:28:58)

[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-02-15 14:08:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop
˓→library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.
˓→properties
Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
˓→setLogLevel(newLevel).
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port
˓→4040. Attempting port 4041.
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port
˓→4041. Attempting port 4042.
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
(continues on next page)
3.5. PySparkling Water: Spark + H2O 23

(continued from previous page)

library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)

SparkSession available as 'spark'.
3. Setup pysparkling with Jupyter notebook

Add the following alias to your bashrc (Linux systems) or bash_profile (Mac system)
alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS=

˓→ "notebook" /~/~/sparkling-water-2.4.5/bin/pysparkling"
4. Open pysparkling in terminal
sparkling
3.6 Set up Spark on Cloud
Following the setup steps in Configure Spark on Mac and Ubuntu, you can set up your own cluster on the
cloud, for example AWS, Google Cloud. Actually, for those clouds, they have their own Big Data tool. Yon
can run them directly whitout any setting just like Databricks Community Cloud. If you want more details,
please feel free to contact with me.
3.7 Demo Code in this Section
The code for this section is available for download test_pyspark, and the Jupyter notebook can be download
from test_pyspark_ipynb.
• Python Source code
## set up SparkSession
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

(continued from previous page)

df = spark.read.format('com.databricks.spark.csv').\
options(header='true', \
inferschema='true').\
load("/home/feng/Spark/Code/data/Advertising.csv
˓→",header=True)
df.show(5)
df.printSchema()
3.7. Demo Code in this Section 25


CHAPTER
FOUR
AN INTRODUCTION TO APACHE SPARK
Chinese proverb
Know yourself and know your enemy, and you will never be defeated – idiom, from Sunzi’s Art of War
4.1 Core Concepts
Most of the following content comes from [Kirillov2016]. So the copyright belongs to Anton Kirillov. I
will refer you to get more details from Apache Spark core concepts, architecture and internals.
Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark
• Job: A piece of code which reads some input from HDFS or local, performs some computation on the
data and writes some output data.
• Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to
understand if you have worked on Hadoop and want to correlate). Stages are divided based on com-
putational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens
over many stages.
• Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data
on one executor (machine).
• DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
• Executor: The process responsible for executing a task.
• Master: The machine on which the Driver program runs
• Slave: The machine on which the Executor program runs
4.2 Spark Components
1. Spark Driver
27
28 Chapter 4. An Introduction to Apache Spark

• separate process to execute user applications

• creates SparkContext to schedule jobs execution and negotiate with cluster manager
2. Executors
• run tasks scheduled by driver
• store computation results in memory, on disk or off-heap
• interact with storage systems
3. Cluster Manager
• Mesos
• YARN
• Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs executed
on cluster:
• SparkContext
– represents the connection to a Spark cluster, and can be used to create RDDs, accu-
mulators and broadcast variables on that cluster
• DAGScheduler
– computes a DAG of stages for each job and submits them to TaskScheduler deter-
mines preferred locations for tasks (based on cache status or shuffle files locations)
and finds minimum schedule to run the jobs
• TaskScheduler
– responsible for sending tasks to the cluster, running them, retrying if there are failures,
and mitigating stragglers
• SchedulerBackend
4.2. Spark Components 29

– backend interface for scheduling systems that allows plugging in different implemen-
tations(Mesos, YARN, Standalone, local)
• BlockManager
– provides interfaces for putting and retrieving blocks both locally and remotely into
various stores (memory, disk, and off-heap)
4.3 Architecture
4.4 How Spark Works?
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities.
The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications. As you enter
your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph. When
the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler
divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of
the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map
operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final
result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task
scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t
know about dependencies among stages.
30 Chapter 4. An Introduction to Apache Spark

CHAPTER
FIVE
PROGRAMMING WITH RDDS
Chinese proverb
If you only know yourself, but not your opponent, you may win or may lose. If you know neither
yourself nor your enemy, you will always endanger yourself – idiom, from Sunzi’s Art of War
RDD represents Resilient Distributed Dataset. An RDD in Spark is simply an immutable distributed
collection of objects sets. Each RDD is split into multiple partitions (similar pattern with smaller sets),
which may be computed on different nodes of the cluster.
5.1 Create RDD
Usually, there are two popular way to create the RDDs: loading an external dataset, or distributing a set
of collection of objects. The following examples show some simplest ways to create RDDs by using
parallelize() fucntion which takes an already existing collection in your program and pass the same
to the Spark Context.
1. By using parallelize( ) fucntion
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),

(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])
Then you will get the RDD data:
df.show()
+----+----+----+-----+
31

Learning Apache Spark With Python

Uploaded by

Copyright:

Available Formats

Learning Apache Spark With Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Apache Spark With Python

Uploaded by

Copyright:

Available Formats

What are RDDs and how are they created in Spark?

What are RDDs and how are they created in Spark?

What are the different stages in Spark job execution?

What are the different stages in Spark job execution?

Learning Apache Spark with Python

22 Chapter 3. Configure Running Platform

Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental

Python 3.7.1 (default, Dec 14 2018, 13:28:58)

Setting default log level to "WARN".

3.5. PySparkling Water: Spark + H2O 23

(continued from previous page)

Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)

3. Setup pysparkling with Jupyter notebook

alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS=

4. Open pysparkling in terminal

3.6 Set up Spark on Cloud

3.7 Demo Code in this Section

(continues on next page)

24 Chapter 3. Configure Running Platform

(continued from previous page)

3.7. Demo Code in this Section 25

26 Chapter 3. Configure Running Platform

AN INTRODUCTION TO APACHE SPARK

4.1 Core Concepts

4.2 Spark Components

28 Chapter 4. An Introduction to Apache Spark

• separate process to execute user applications

4.2. Spark Components 29

4.4 How Spark Works?

30 Chapter 4. An Introduction to Apache Spark

PROGRAMMING WITH RDDS

5.1 Create RDD

from pyspark.sql import SparkSession

df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),

Then you will get the RDD data:

You might also like