Spark-Tutorial - IV - Python
Spark-Tutorial - IV - Python
1. Prerequisite ............................................................................................ 3
2. Exercise 1: Spark Installation: Windows: .............................................. 4
3. Install Spark in centos Linux – 60 Minutes(D) ...................................... 9
4. Analysis Using Spark- Overview......................................................... 12
5. Exercise : Spark Installation - Redhat Linux. .................................... 15
6. Exploring DataFrames using pyspark – 35 Minutes ............................ 19
7. Working with DataFrames and Schemas – 30 Minutes ....................... 23
8. Analyzing Data with DataFrame Queries – 40 Minutes ...................... 25
9. Interactive Analysis with pySpark – 30 Minutes ................................. 31
10. Transformation and Action – RDD – 45 Minutes ............................... 33
11. Work with Pair RDD – 30 Minutes ................................................... 38
12. Create & Explore PairRDD – 60 Minutes ...................................... 42
13. Spark SQL – 30 Minutes...................................................................... 63
14. Applications ( Transformation ) – Python – 30 Minutes ..................... 76
15. Zeppelin Installation ............................................................................ 78
16. Using the Python Interpreter ................................................................ 91
18. PySpark-Example – Zeppelin. ............................................................. 93
19. Installing Jupyter for Spark – 35 Minutes.(D) ..................................... 95
20. Spark Standalone Cluster(VM) – 60 Minutes. ................................... 108
21. Spark Two Nodes Cluster Using Docker – 90 Minutes ..................... 120
22. Launching on a Cluster: Hadoop YARN – 150 Minutes ................... 138
23. Jobs Monitoring : Using Web UI. – 45 Minutes ................................ 170
24. Spark Streaming With Socket – 45 Minutes ..................................... 182
25. Spark Streaming With Kafka – 60 Minutes ...................................... 187
26. Streaming with State Operations- Python – 60 Minutes .................... 198
27. Spark – Hive Integration – 45 Minutes .............................................. 202
28. Spark integration with Hive Cluster (Local Mode) ........................... 209
29. Annexure & Errata: ............................................................................ 210
Unable to start spark shell with the following error ................................ 210
issue: Cannot assign requested address ................................................... 211
Unable to start spark shell: ...................................................................... 212
Caused by: java.io.IOException: Error accessing
/opt/jdk/jre/lib/ext/._cld*.jar .................................................................... 212
Such kind of error resolve by removing all the hidden file. .* (You can
determine the file by running #ls -alt). List all the hidden dot file and
remove it. ................................................................................................. 212
Yum repo config ...................................................................................... 212
Using Docker:
Create the first container:
#docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v
/Volumes/Samsung_T5/software/:/Software -v /Volumes/Samsung_T5/software/install/:/opt -v
/Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p 4040:4040 -p 8081:8081 -
p 8090:8090 -p 8888:8888 centos:7 /usr/sbin/init
Apache Spark – Version 3.0.1 – Prebuilt for Apache Hadoop 3.2 and later.
File : spark-3.0.1-bin-hadoop3.2.tgz / spark-3.2.1-bin-hadoop3.2.tgz
Url : https://spark.apache.org/downloads.html
You need to install java on your system PATH, or the JAVA_HOME environment
variable pointing to a Java installation.
sbt-0.13.8.msi
Untar scala-2.11.6
Set SCALA_HOME environment variable & set the PATH variable to the bin directory
of scala.
#vi ~/.bashrc
export JAVA_HOME=/opt/jdk
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:/opt/spark/bin
# pyspark
Enter the following command in the pyspark console to count each alphabets.
//In the first two lines we are importing the Python libraries.
/*
Here we have used the object sc, sc is the SparkContext object which is created by pyspark
before showing the console.
The parallelize() function is used to create RDD from String. RDD is also know as Resilient
Distributed Datasets which is distributed data set in Spark. RDD process is done on the
distributed Spark cluster.
Now with the following example we calculate number of characters and print on the console.
*/
counts = data.map(lambda x:
(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
ascending=False).collect()
http://ht:4040/jobs/
I would like to access the column using a name. So Let us intitialize RDD
with a schema colum.
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:15 PM.
%pyspark
parts = rdd.map(lambda l: l.split(" "))
log = parts.map(lambda p: Row(ip=p[0], ts=(p[3]), atype = p[5], url=p[6] , code = p[8] ,
byte = p[9]))
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:16 PM.
Have I split the record fields correctly? Let us review the transformation
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:16 PM.
%pyspark
parts.collect()
[['64.242.88.10', '-', '-', '[07/Mar/2004:16:05:49', '-0800]', '"GET',
'/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVaria
bles', 'HTTP/1.1"', '401', '12846'], ['64.242.88.10', '-', '-',
'[07/Mar/2004:16:06:51', '-0800]', '"GET',
'/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2', 'HTTP/1.1"',
'200', '4523'], ['94.242.88.10', '-', '-', '[07/Mar/2004:16:30:29', '-0800]',
'"GET', '/twiki/bin/attach/Main/OfficeLocations', 'HTTP/1.1"', '401',
'12851']]
SPARK JOB FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:17 PM.
I would like to use the Dataframe API. Let us convert from RDD to DF
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:17 PM.
%pyspark
# Infer the schema, and register the DataFrame as a table.
mylog = spark.createDataFrame(log)
SPARK JOB FINISHED
Took 1 sec. Last updated by anonymous at July 04 2021, 1:17:18 PM.
Now, we have the data in spark, let us answer the queries we have define
in the first paragraphs.
FINISHED
Took 0 sec. Last updated by anonymous at July 04 2021, 1:17:19 PM.
Create a folder /spark and un tar all related software inside this folder only. So that all
installation happens in a single folder for better manageability.
#mkdir /spark
Untar scala-2.11.6
Set the path and variable as follows: We are including Scala home and bin folder in the
path variable.
vi ~/.bashrc
export SCALA_HOME=/spark/scala-2.11.6
export PATH=$PATH:$SCALA_HOME/bin
Type bash
Install JDK as follows and set the PATH with environment as follows: (chose 32 or 64
bit depends on your server)
export JAVA_HOME=/spark/jdk1.8.0_45
export PATH=:$JAVA_HOME/bin :$PATH:$SCALA_HOME/bin
Rename the installation folder as shown below: You need to change to installation
directory before issuing this command.
enter the following command in the scala console to create a data set of 1…10000
integers
val data = 1 to 10000
http://ht:4040/jobs/
6. Exploring DataFrames using pyspark – 35 Minutes
Following features of Spark will be demonstrated here:
#pysparkl
Create a text file users.json which contains sample data as listed below in data folder:
{"name":"Alice", "pcode":"94304"}
{"name":"Brayden", "age":30, "pcode":"94304"}
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46}
{"name":"Etienne", "pcode":"94104"}
Scala:
Initiate the spark-shell from the folder which you have created the above file.
As shown above, three fields will be displayed according to the json fields specified in
the text file.
Out of the three fields, we are interested in only name and age fields. So, let us create a
dataframe with only these two fields and apply a filter expression in which only person
greater than 20 years are there in the dataframe.
nameAgeDF = usersDF.select("name","age")
nameAgeOver20DF = nameAgeDF.where("age > 20")
nameAgeOver20DF.show()
usersDF.select("name","age").where("age > 20").show()
You can also combine the functions as shown above. You will get the same result.
Zeppelin Ouput.
----------------------------------- Lab Ends Here---------------------------------------------------
7. Working with DataFrames and Schemas – 30 Minutes
You will understand the following:
pcode,lastName,firstName,age
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
usersDF.printSchema()
// As shown above, age is an Integer, which will be mapped to String by default in case
its not define in the structure schema.
nameAgeDF = usersDF.select("firstname","age")
nameAgeDF.show()
You can save the dataframe consisting of First Name and age to a file.
# nameAgeDF.write.json("age.json")
Open a terminal and verify the file. It will create a folder by the name age.json and
inside that output will be there.
/software/people.csv
pcode,lastName,firstName,age
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
Load the people data and fetch column, age in the
following way:
peopleDF = spark.read.option("header","true").csv("people.csv")
peopleDF["age"]
peopleDF.age
peopleDF.select(peopleDF["age"]).show()
Manipulate the column age i.e multiple age by 10.
peopleDF.select("lastName",(peopleDF.age * 10).alias("age_10")).show()
Perform aggregration
peopleDF.groupBy("pcode").count().show()
Next let us join two dataframes:
/software/people-no-pcode.csv
pcode,lastName,firstName,age
02134,Hopper,Grace,52
,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
/software/pcodes.csv
pcode,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL
Load the people and code files in DF.
peopleDF = spark.read.option("header","true").csv("people-no-pcode.csv")
pcodesDF = spark.read.option("header","true").csv("pcodes.csv")
peopleDF.join(pcodesDF, "pcode").show()
Perform the Left outer join
You can see null value in the pcode of the second row.
Joining on Columns with Different Names
/software/zcodes.csv
zip,city,state
02134,Boston,MA
94020,Palo Alto,CA
87501,Santa Fe,NM
60645,Chicago,IL Join with the pcode and zip of the
second file.
zcodesDF = spark.read.option("header","true").csv("zcodes.csv")
peopleDF.join(zcodesDF, peopleDF.pcode == zcodesDF.zip).show()
myData = ["Alice","Carlos","Frank","Barbara"]
myRDD = sc.parallelize(myData)
create a new RDD by combining the second RDD to the previous RDD.
myNames = myRDD.union(myRDD1)
textFile = sc.textFile("README.md")
Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) el
se b)
One common data flow pattern is MapReduce, as popularized by
Hadoop. Spark can implement MapReduce flows easily:
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (w
ord, 1)).reduceByKey(lambda a, b: a+b)
Here, we combined
the flatMap, map and reduceByKey transformations to compute the
per-word counts in the file as an RDD of (String, Int) pairs. To
collect the word counts in our shell, we can use the collect action:
wordCounts.collect()
let’s mark our linesWithSpark dataset to be cached:
linesWithSpark.cache()
linesWithSpark.count()
02134,Hopper,Grace,52
94020,Turing,Alan,32
94020,Lovelace,Ada,28
87501,Babbage,Charles,49
02134,Wirth,Niklaus,48
In this activity, you will load SFPD data from a CSV file. You will create pair RDD and apply pair RDD
operations to explore the data.
Scenario
Our dataset is a .csv file that consists of SFPD incident data from SF OpenData
(https://data.sfgov.org/). For each incident, we have the following information:
The dataset has been modified to decrease the size of the files and also to make it easier to use. We use
this same dataset for all the labs in this course.
Objectives
1. To launch the Interactive Shell, at the command line, run the following command:
To load the data we are going to use the SparkContext method textFile. The SparkContext is
available in the interactive shell as the variable sc. We also want to split the file by the separator “,”.
1. We define the mapping for our input variables. While this isn’t a necessary step, it makes it easier to refer to
the different fields by names.
39
val IncidntNum = 0
val Category = 1
val Descript = 2
val DayOfWeek = 3
val Date = 4
val Time = 5
val PdDistrict = 6
val Resolution = 7
val Address = 8
val X = 9
val Y = 10
val PdId = 11
sfpdRDD = sc.textFile("sfpd.csv").map(_.split(","))
sfpdRDD.first()
sfpdRDD.take(5)
40
3. What is the total number of incidents?
totincs = sfpdRDD.count()
print(totincs)
___________________________________________________________________
__
_________________________________________________________________
______________________________
41
42 Learning Spark - PySpark
In the previous activity we explored the data in the sfpdRDD. We used RDD operations. In
this activity, we will create pairRDD to find answers to questions about the data.
Objectives
Start the spark shell if not done else skip the next command.
#pyspark
http://thinkopensource.in
43 Learning Spark - PySpark
http://thinkopensource.in
44 Learning Spark - PySpark
Answer:
Add filter to validate proper record i.e it should have exactly 14 fields only in each record.
2. Which five addresses have the highest incidents?
a. Create pairRDD (map)
b. Get the count for key (reduceByKey)
c. Pair RDD with key and count switched (map
d. Sort in descending order (sortByKey
Answer:
http://thinkopensource.in
45 Learning Spark - PySpark
top5Adds
Answer:
top3Cat
http://thinkopensource.in
46 Learning Spark - PySpark
Answer:
num_inc_dist
This activity illustrates how joins work in Spark (python). There are two small datasets
provided for this activity - J_AddCat.csv and J_AddDist.csv.
http://thinkopensource.in
47 Learning Spark - PySpark
http://thinkopensource.in
48 Learning Spark - PySpark
1. Given these two datasets, you want to find the type of incident and district for each address. What
is one way of doing this? (HINT: An operation on pairs or pairRDDs)
[A join is the same as an inner join and only keys that are present in both RDDs are output. If you
compare the addresses in both the datasets, you find that there are five addresses in common and
http://thinkopensource.in
49 Learning Spark - PySpark
they are unique. Thus the resulting dataset will contain five elements. If there are multiple values for
the same key, the resulting RDD will have an entry for every possible pair of values with that key from
the two RDDs.]
3. If you did a right outer join on the two datasets with Address/Category being the source RDD,
[A right outer join results in a pair RDD that has entries for each key in the other pairRDD. If the
source RDD contains data from J_AddCat.csv and the “other” RDD is represented by J_AddDist.csv,
then since “other” RDD has 9 distinct addresses, the size of the result of a right outer join is 9.]
4. If you did a left outer join on the two datasets with Address/Category being the source RDD,
http://thinkopensource.in
50 Learning Spark - PySpark
[A left outer join results in a pair RDD that has entries for each key in the source pairRDD. If
thesource RDD contains data from J_AddCat.csv and the “other” RDD is represented by
J_AddDist.csv,then since “source” RDD has 13 distinct addresses, the size of the result of a left
outer join is 13.]
5. Load each dataset into separate pairRDDs with “address” being the key.
http://thinkopensource.in
51 Learning Spark - PySpark
6. List the incident category and district for those addresses that have both category and district
information. Verify that the size estimated earlier is correct.
catJdist = catAdd.join(distAdd)
catJdist.collect()
catJdist.count()
catJdist.take(5)
http://thinkopensource.in
52 Learning Spark - PySpark
7. List the incident category and district for all addresses irrespective of whether each address has
category and district information.
catJdist1 = catAdd.leftOuterJoin(distAdd)
catJdist1.collect()
catJdist1.count()
http://thinkopensource.in
53 Learning Spark - PySpark
8. List the incident district and category for all addresses irrespective of whether each address has
category and district information. Verify that the size estimated earlier is correct.
catJdist2 = catAdd.rightOuterJoin(distAdd)
catJdist2.collect()
catJdist2.count()
http://thinkopensource.in
54 Learning Spark - PySpark
Explore Partitioning
In this activity we see how to determine the number of partitions, the type of partitioner and how
to specify partitions in a transformation.
Objective
Note: Ensure that you have started the spark shell with --master local[*] ( Refer the First Lab)
http://thinkopensource.in
55 Learning Spark - PySpark
The above result will depend on the no of cores. In my case its 2 cores so 2.
If there is no partitioner the partitioning is not based upon characteristic of data but distribution is random and uniformed
across nodes.
3. Create a pair RDD when only the length of the items is 14 per record that its 14 fields.
http://thinkopensource.in
56 Learning Spark - PySpark
incByDists.getNumPartitions()
http://thinkopensource.in
57 Learning Spark - PySpark
4. Add a map
inc_map = incByDists.map(lambda (incidence, district) : (district,incidence))
How many partitions does inc_map have? inc_map.getNumPartitions()
http://thinkopensource.in
58 Learning Spark - PySpark
5. Add groupByKey
inc_group = sfpdRDD.map(lambda incident: (incident[6],1)).groupByKey()
What type of partitioner does inc_group have? inc_group.partitioner
http://thinkopensource.in
59 Learning Spark - PySpark
http://thinkopensource.in
60 Learning Spark - PySpark
7. You can specify the number of partitions when you use the join operation.
catJdist = catAdd.join(distAdd,8)
How many partitions does the joined RDD have? catJdist.getNumPartitions()
http://thinkopensource.in
61 Learning Spark - PySpark
Finally, let us perform the word count application using the Pair data model.
http://thinkopensource.in
62 Learning Spark - PySpark
http://thinkopensource.in
63 Learning Spark - PySpark
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as
kwargs to the Row class.
Henry,42
Rajnita,40
Henderson,14
Tiraj,5
Open a terminal from /software folder and enter the following command.
Using Spark SQL – RDD data structure. (Inferring the Schema Using Reflection)
#pyspark
http://thinkopensource.in
64 Learning Spark - PySpark
peopleL = sc.textFile("person.txt")
peopleL.take(4)
// SQL statements can be run by using the sql methods provided by sqlContext.
# SQL can be run over DataFrames that have been registered as a table.
kids = spark.sql("SELECT name FROM people WHERE age >= 1 AND age <= 9")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
kidNames = kids.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)
http://thinkopensource.in
65 Learning Spark - PySpark
http://thinkopensource.in
66 Learning Spark - PySpark
http://thinkopensource.in
67 Learning Spark - PySpark
spark.catalog.listTables('default')
or
http://thinkopensource.in
68 Learning Spark - PySpark
http://thinkopensource.in
69 Learning Spark - PySpark
/software/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
peopleDF = spark.read.format("json").load("people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
peopleDF.select("name","age").show
http://thinkopensource.in
70 Learning Spark - PySpark
http://thinkopensource.in
71 Learning Spark - PySpark
sqlDF.show()
In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']).
http://thinkopensource.in
72 Learning Spark - PySpark
http://thinkopensource.in
73 Learning Spark - PySpark
http://thinkopensource.in
74 Learning Spark - PySpark
http://thinkopensource.in
75 Learning Spark - PySpark
http://thinkopensource.in
76 Learning Spark - PySpark
"""SimpleApp.py"""
from pyspark import SparkContext
sc.stop()
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll
need to replace YOUR_SPARK_HOME /logFile with the location where Spark is installed in your machine.
# Use spark-submit to run your application, change directory to /spark before executing the following command
http://thinkopensource.in
77 Learning Spark - PySpark
You can view the job status with the following URl
http://192.168.150.128:4040/jobs/
In this way, you can deploy any python application in the spark cluster.
http://thinkopensource.in
78 Learning Spark - PySpark
You don’t require spark installation in case you are using zeppelin-0.6.2-bin-all.tgz
Create a zeppelin user and switch to zeppelin user or if zeppelin user is already created then login as zeppelin.
Password should be life213
Use root credentials
groupadd hadoop
useradd -g hadoop zeppelin
passwd zeppelin
http://thinkopensource.in
79 Learning Spark - PySpark
su - zeppelin
whoami
http://thinkopensource.in
80 Learning Spark - PySpark
http://thinkopensource.in
81 Learning Spark - PySpark
Start Zeppelin
cd /spark/zeppelin-0.5.6-incubating-bin-all
bin/zeppelin-daemon.sh start
jps
Note: You don’t required to start the spark cluster before starting Zeppelin
http://thinkopensource.in
82 Learning Spark - PySpark
http://master:8081/#/
http://thinkopensource.in
83 Learning Spark - PySpark
http://thinkopensource.in
84 Learning Spark - PySpark
Stop Zeppelin
bin/zeppelin-daemon.sh stop
Let us connect Zeppelin with the existing Spark Cluster. You can start Spark in a single standalone cluster or Use
any existing Spark Cluster.
Unzip the derby database and start as below: You can use zeppelin user account to start the derby.
After the derby server, you need to modify the spark to refer the above database server. For this create a file hive-
site.xml in the SPARK_HOME/conf folder and paste the following content,
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://10.10.20.27:1527/metastore_db;create=true</value>
</property>
http://thinkopensource.in
85 Learning Spark - PySpark
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
</property>
</configuration>
Remove the derby jar from the Spark installation jar folder and replace with the derby client jar.
At the end of the above step you should be able to access the Spark UI
http://10.10.20.27:8081/
http://thinkopensource.in
86 Learning Spark - PySpark
Now, we have configured SPARK standalone cluster. We need to have atleast one worker node as shown below:
Add hp.tos.com in the slave file and start the slave with the following command
# sh start-slave.sh spark://hp.tos.com:7077
http://10.10.20.27:8080/
http://thinkopensource.in
87 Learning Spark - PySpark
Stop the zeppelin if not done before. Use zeppelin user ID to execute the following command
#bin/zeppelin-daemon.sh stop
http://thinkopensource.in
88 Learning Spark - PySpark
open the zeppelin-env.sh file present in $zeppelin_HOME/conf directory and provide the below specified configurations.
export MASTER=spark://hp.tos.com:7077
export SPARK_HOME=/apps/spark-2.2.0
#cd /apps/zeppelin-0.7.3/conf
Now open your Zeppelin dashboard and go to the list of interprets and search for Spark interpreter.
Ensure that you modify the below parameters: You need to logon using admin/life213 credentials.
http://thinkopensource.in
89 Learning Spark - PySpark
Create a New Notebook [Notebook à Create New Note à Enter Hello Spark as the name]
Enter sc and the click Run which is on the right corner. If everything is ok, you should be able to see as the above
screenshot. Else review the log file in zeppelin installation folder.
http://thinkopensource.in
90 Learning Spark - PySpark
http://thinkopensource.in
91 Learning Spark - PySpark
You should set your PYSPARK_DRIVER_PYTHON environment variable so that Spark uses Anaconda. You can
get more information here:
https://spark.apache.org/docs/1.6.2/programming-guide.html
Export SPARK_HOME
http://thinkopensource.in
92 Learning Spark - PySpark
In conf/zeppelin-env.sh, export SPARK_HOME environment variable with your Spark installation path.
For example,
export SPARK_HOME=/spark/spark-2.1.0
Install Anaconda2 and set in the Path Variable of root login. (vi ~/.bashrc)
export PATH="/spark/anaconda2/bin:$PATH"
http://thinkopensource.in
93 Learning Spark - PySpark
%pyspark
words.flatMap(lambda x: x.lower().split(' ')) \
.filter(lambda x: x.isalpha()).map(lambda x: (x, 1)) \
.reduceByKey(lambda a,b: a+b)
%pyspark
words.first()
http://thinkopensource.in
94 Learning Spark - PySpark
http://thinkopensource.in
95 Learning Spark - PySpark
Using pip
# yum install python3-pip -y
#pip3 install --upgrade setuptools
#pip3 install --upgrade pip
#pip3 install jupyterlab
http://thinkopensource.in
96 Learning Spark - PySpark
c.NotebookApp.ip = 'spark0'
c.NotebookApp.open_browser = False
c.NotebookApp.allow_remote_access = True
c.NotebookApp.allow_root = True
#jupyter notebook
http://thinkopensource.in
97 Learning Spark - PySpark
http://thinkopensource.in
98 Learning Spark - PySpark
http://thinkopensource.in
99 Learning Spark - PySpark
Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
$ pyspark
Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on
‘New’ > ‘Notebooks Python [default]’.
Copy and paste Pi calculation script and run it by pressing Shift + Enter.
http://thinkopensource.in
100 Learning Spark - PySpark
// Code Begin
import random
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
pi = 4 * count / num_samples
print(pi)
sc.stop()
// Code Ends.
http://thinkopensource.in
101 Learning Spark - PySpark
http://thinkopensource.in
102 Learning Spark - PySpark
http://thinkopensource.in
103 Learning Spark - PySpark
Anaconda 4.2.0
For Linux
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
In your terminal window type one of the below and follow the instructions:
Python 3.5 version
bash Anaconda3-4.2.0-Linux-x86_64.sh
Python 2.7 version
bash Anaconda2-4.2.0-Linux-x86_64.sh
NOTE: Include the "bash" command even if you are not using the bash shell.
http://thinkopensource.in
104 Learning Spark - PySpark
#bash
http://thinkopensource.in
105 Learning Spark - PySpark
http://thinkopensource.in
106 Learning Spark - PySpark
Anaconda 4.2.0
For Windows
Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.
Changelog
Download the installer
Optional: Verify data integrity with MD5 or SHA-256 More info
Double-click the .exe file to install Anaconda and follow the instructions on the screen
Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific
computing and data science.
Congratulations, you have installed Jupyter Notebook.
http://thinkopensource.in
107 Learning Spark - PySpark
Reference:
https://www.sicara.ai/blog/2017-05-02-get-started-pyspark-jupyter-notebook-3-minutes
https://community.hortonworks.com/articles/75551/installing-and-exploring-spark-20-with-jupyter-not.html
https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b
http://thinkopensource.in
108 Learning Spark - PySpark
Create another instance of the VM by copying the VM folder or making clone. You need to shutdown the VM for that.
If you use clone, then specify as shown below. Hints: Select the parent virtual machine and select VM > Manage > Clone.
Add another node i.e slave node to the Cluster. Change the hostname of the slave node to slave.
Logon to the slave VM. Your slave node should be as shown below.
Ensure to provide entries in /etc/hosts of both the VMs, so that it can communicate each other's using hostname.
192.168.188.178 master
192.168.188.174 slave
http://thinkopensource.in
109 Learning Spark - PySpark
Logon to the first VM i.e master to configure password less connection between the nodes.
SSH access
The root user on the master must be able to connect
• to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and
• to the root user account on the slave via a password-less SSH login.
You have to add the root@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this user’s$HOME/.ssh/authorized_keys).
The following steps will ensure that root user can ssh to its own account without password in all nodes.
Let us generate the keys for user root on the master node, to make sure that root user on master can ssh to slave
nodes without password.
$ su - root
$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
Accept it, yes and supply the root password for the last time.
http://thinkopensource.in
110 Learning Spark - PySpark
This command will prompt you for the login password for user root on slave, then copy the public SSH key for
you, creating the correct directory and fixing the permissions as necessary.
The final step is to test the SSH setup by connecting with user root from the master to the user account root on
the slave. This step is also needed to save slave‘s host key fingerprint to the root@master‘s known_hosts file.
http://thinkopensource.in
111 Learning Spark - PySpark
http://thinkopensource.in
112 Learning Spark - PySpark
Go to SPARK_HOME/conf/ and create new file with name spark-env.sh on the master node
There will be spark-env.sh.template in same folder and this file gives you detail on how to declare various
environment variables.
http://thinkopensource.in
113 Learning Spark - PySpark
#sh sbin/start-all.sh
http://master:8080/
You can access the Spark Master with the above URL.
http://thinkopensource.in
114 Learning Spark - PySpark
Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers
to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which
is http://localhost:8080 by default.
Once you have started a worker, look at the master’s web UI (http://localhost:8080 by default). You should see the
new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
In our case its having 2 nodes.
Then you can verify the Node using the master UI:
http://thinkopensource.in
115 Learning Spark - PySpark
You can perform from any client/server node i.e windows desktop client.
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.
To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.
http://thinkopensource.in
116 Learning Spark - PySpark
Let’s make a new RDD from the text of the README file in the Spark source directory:
textFile = sc.textFile("README.md")
http://thinkopensource.in
117 Learning Spark - PySpark
http://thinkopensource.in
118 Learning Spark - PySpark
Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
linesWithSpark.collect()
We can chain together transformations and actions:
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?
http://thinkopensource.in
119 Learning Spark - PySpark
Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)
http://thinkopensource.in
120 Learning Spark - PySpark
- Install Docker
- Pull Centos Image.
http://thinkopensource.in
121 Learning Spark - PySpark
#docker run -it --name spark0 --hostname spark0 --privileged -p 8080:8080 -p 7077:7077 -p 4040:4040 -p
8081:8081 -p 8090:8090 centos:7 /usr/sbin/init
# docker run -it --name spark0 --hostname spark0 --privileged --network spark-net -v /Volumes/Samsung_T5/software/:/Software -v
/Volumes/Samsung_T5/software/install/:/opt -v /Volumes/Samsung_T5/software/data/:/data -p 8080:8080 -p 7077:7077 -p
4040:4040 -p 8081:8081 -p 8090:8090 centos:7 /usr/sbin/init
To connect a running container to an existing user-defined bridge, use the docker network
connect command. If the spark0 is already started without the network being attached execute the
following else skip the following step.
http://thinkopensource.in
122 Learning Spark - PySpark
Start the second node and perform the installation as specify in the first lab.
#docker run -dit --name spark1 -p 8082:8081 -p 4041:4040 --network spark-net --entrypoint
/bin/bash centos:7
or
# docker run -it --name spark1 --hostname spark1 --privileged --network spark-net -v
/Users/henrypotsangbam/Documents/Docker:/opt -p 4041:4040 -p 8082:8081 centos:7 /usr/sbin/init
Inspect the network and find the IP addresses of the two containers
http://thinkopensource.in
123 Learning Spark - PySpark
Attach to the spark-master container and test its communication to the spark-worker container using
both it’s IP address and then using its container name.
# ping spark0
http://thinkopensource.in
124 Learning Spark - PySpark
http://thinkopensource.in
125 Learning Spark - PySpark
Architecture
http://thinkopensource.in
126 Learning Spark - PySpark
Change the Master Port to 8090, there is issue when it is executed in 7077 in docker environment.
#export PATH=$PATH:/opt/spark/bin
#export SPARK_MASTER_PORT=8090
#spark-class org.apache.spark.deploy.master.Master
http://thinkopensource.in
127 Learning Spark - PySpark
Attach a worker node to the cluster, execute the following in the spark0 container.
#export PATH=$PATH:/opt/spark/bin
http://thinkopensource.in
128 Learning Spark - PySpark
If unable to connect to localhost replace it with the container IP or the container alias i.e spark0.
Refresh the web ui, ensure that you can see a worker as shown below.
http://thinkopensource.in
129 Learning Spark - PySpark
http://127.0.0.1:8080
http://thinkopensource.in
130 Learning Spark - PySpark
At the end of this step, you should have 2 worker nodes as shown below:
http://thinkopensource.in
131 Learning Spark - PySpark
You can perform from any client/server node i.e windows desktop client.
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to
the SparkContext constructor.
To run an interactive Spark shell against the cluster(Specify the master IP), run the following command from the
bin folder:
We are executing from the slave node.
http://thinkopensource.in
132 Learning Spark - PySpark
Let’s make a new RDD from the text of the README file in the Spark source directory:
textFile = sc.textFile("README.md")
http://thinkopensource.in
133 Learning Spark - PySpark
http://thinkopensource.in
134 Learning Spark - PySpark
Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset
of the items in the file. Use collect to display the output.
linesWithSpark = textFile.filter(lambda line : "Spark" in line)
linesWithSpark.collect()
We can chain together transformations and actions:
textFile.filter(lambda line : "Spark" in line).count() // How many lines contain "Spark"?
http://thinkopensource.in
135 Learning Spark - PySpark
Let’s say we want to find the line with the most words
textFile.map(lambda line : len(line.split(" "))). reduce(lambda a, b: a if (a > b) else b)
http://thinkopensource.in
136 Learning Spark - PySpark
http://thinkopensource.in
137 Learning Spark - PySpark
You need to copy the data files in both the nodes. However, for our lab its already present in the spark folder.
https://towardsdatascience.com/diy-apache-spark-docker-bb4f11c10d24
http://thinkopensource.in
138 Learning Spark - PySpark
For Me, the first VM hostname is hp.com and the second one is ht.com. Ensure to follow the same nomencleature
to avoid confusion. Its very important.
Host/VM Container Remarks
hp.com Hadoop0 YARN Services
ht.com Spark0 Spark Client or Spark.
Using Docker:
#docker run -it --name hadoop0 --privileged -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 8032:8032 -p
8188:8188 -p 8020:8020 --hostname hadoop0 centos:7 /usr/sbin/init
Verify the hostname and the /etc/hosts. You need to enter each IP and host name as shown above. Update details
accordingly in your local machine.
http://thinkopensource.in
139 Learning Spark - PySpark
You should see in your window as follows for ht.com. Changes the IP as your system.
All the below commands should be executed on hp.com (Hadoop Node) unless specify. We are configuring YARN
cluster now.
Untar as follows:
tar -xvf hadoop-X.tar.gz -C /opt
http://thinkopensource.in
140 Learning Spark - PySpark
Unpack the downloaded Hadoop distribution. In the distribution, edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh to define some
parameters as follows:
export JAVA_HOME=/opt/jdk
# cd /opt/hadoop
$ bin/hadoop
http://thinkopensource.in
141 Learning Spark - PySpark
You need to modify some setting as follows: Replace with your hostname accordingly.
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:8020</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Now check that you can ssh to the localhost without a passphrase:
http://thinkopensource.in
142 Learning Spark - PySpark
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
http://thinkopensource.in
143 Learning Spark - PySpark
$ sbin/start-dfs.sh
1. Browse the web interface for the NameNode; by default it is available at:
o NameNode - http://localhost:9870/
2. Make the HDFS directories required to execute MapReduce jobs:
http://thinkopensource.in
144 Learning Spark - PySpark
etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>
http://thinkopensource.in
145 Learning Spark - PySpark
etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HAD
OOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
$ sbin/start-yarn.sh
http://thinkopensource.in
146 Learning Spark - PySpark
3. Browse the web interface for the ResourceManager; by default it is available at:
o ResourceManager - http://localhost:8088/
4. When you’re done, you can stop the daemons with: Optional.
$ sbin/stop-yarn.sh
http://thinkopensource.in
147 Learning Spark - PySpark
Hadoop Node - Let us set some environment and path variable as follows: ( vi ~/.bashrc)
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
Let us create a temporary folder, that will be used for working space for the YARN cluster.
hadoop fs -mkdir /tmp
hadoop fs -chmod -R 1777 /tmp
hadoop fs -mkdir /tmp/in
hadoop fs -ls /tmp
You can get the README.md file from the Software folder, which is provided along with the training.
hadoop fs -copyFromLocal /opt/hadoop/README.txt /tmp/in
hadoop fs -ls /tmp/in
If you are using docker, join the hadoop0 and spark0 containers in a single network:
Verify it:
#docker network inspect spark-net
http://thinkopensource.in
148 Learning Spark - PySpark
http://thinkopensource.in
149 Learning Spark - PySpark
Start the Spark VM/spark container, if it's not yet started, i.e ht.com and issue the following command.
Logon to the machine using telnet. You can use the VM, share folder options to copy the file to the VM from your
workstation or else copy using the mouse from your workstation to the folder specify.
Let us configure hadoop client setting on the Spark VM --> ht.com or spark0 node.
Copy the compressed hadoop folder to the spark0 or spark node and uncompressed in /opt folder.
[Docker:
copy from hadoop to host : docker cp hadoop0:/opt/hadoop.tar .
copy from host to spark : docker cp hadoop.tar spark0:/opt/]
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>
http://thinkopensource.in
150 Learning Spark - PySpark
Create the following folder on HDFS cluster and copy the file in the folder.
# hdfs dfs -mkdir -p /opt/spark
# hdfs dfs -copyFromLocal /opt/spark/README.md /opt/spark/
You can verify the file with the following command. This command will display the content of the file in the
HDFS cluster.
Export Hadoop and Yarn environment variable pointing to the Spark Node – Hadoop conf file.
http://thinkopensource.in
151 Learning Spark - PySpark
"""SimpleApp.py"""
from pyspark.sql import SparkSession
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
spark.stop()
#export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
#export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
#./spark-submit --master yarn --deploy-mode cluster --executor-memory 512m --num-executors 2 SimpleApp.py
or client side.
http://thinkopensource.in
152 Learning Spark - PySpark
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the
Hadoop cluster.
Note --> Application py/Jar should be in the local file system, In folder is in HDFS and Out should not be present
in the HDFS,it will be automatically created.
http://thinkopensource.in
153 Learning Spark - PySpark
http://thinkopensource.in
154 Learning Spark - PySpark
http://thinkopensource.in
155 Learning Spark - PySpark
http://thinkopensource.in
156 Learning Spark - PySpark
After sometimes click on Finished, after the console exit the program.
http://thinkopensource.in
157 Learning Spark - PySpark
Since we have print the output. You can verify from the log fie only.
http://thinkopensource.in
158 Learning Spark - PySpark
Go to the User Log folders of the hadoop installation. (Job ID and the container ID will be different in your
machine, update acordingly)
#cd /opt/hadoop/logs/userlogs
#more application_1625129460519_0004/container_1625129460519_0004_01_000001/stdout
Task -> Let us write the output into file instead of writing in console.
Create a file SimpleAppP.py and update with the following code. It only writes the dataframe which has the line
having ‘a’ in it.
//-------------------------- Code Begin ---------------------
"""SimpleApp.py"""
from pyspark.sql import SparkSession
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
http://thinkopensource.in
159 Learning Spark - PySpark
myDF = logData.filter(logData.value.contains('a'))
myDF.write.save("mydatap")
spark.stop()
At the end of the execution, you should have the following output in the console.
http://thinkopensource.in
160 Learning Spark - PySpark
You can verify the output file in the hdfs as shown below:
#hdfs dfs -ls -R /user/root/mydatap
# hdfs dfs -cat /user/root/mydatap/part-00000-554ff738-ee36-449c-b654-9a7fa92f17b0-c000.snappy.parquet
http://localhost:9870/explorer.html#/
http://thinkopensource.in
161 Learning Spark - PySpark
Click Utilities --> Browse the file system à /user/root/output (Enter in the Directory ) - Go
http://thinkopensource.in
162 Learning Spark - PySpark
Congrats!.
---------------------------------------------- End of Lab-----------------------------------
http://thinkopensource.in
163 Learning Spark - PySpark
Errata:
1) 15/06/14 21:50:46 INFO storage.BlockManagerMasterActor: Registering block manager ht.com:50475 with
267.3 MB RAM, BlockManagerId(<driver>, ht.com, 50475)
15/06/14 21:50:46 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-
events/application_1434297324587_0001.inprogress, expected: hdfs://hp.com at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1262)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:1262)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:128)
http://thinkopensource.in
164 Learning Spark - PySpark
http://thinkopensource.in
165 Learning Spark - PySpark
Check that spark.eventLog.dir is begin with hdfs:// and the /tmp/spark-events is already created in HDFS system
You can verify on the YARN cluster only.
Yarn-site.xml
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value> <!-- 4 GB -->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value> <!-- 1 GB -->
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>256</value> <!-- 1 GB -->
</property>
yarn-site.xml – Spark Node. (Specify the hostname of the Hadoop RM Node as shown below)
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>
<description>The hostname of the RM.</description>
http://thinkopensource.in
167 Learning Spark - PySpark
</property>
3) If unhealthy nodes are being displayed in the cluster URL with 1/1 local-dirs are bad: /tmp/hadoop-yarn/nm-
local-dir;
Solution: delete the folder using the following command and ensure that nodes are healthy before proceeding
forward, you can restart if require
You can view jps in YARN cluster when spark job is executing. It will start Coarse* processes.
http://thinkopensource.in
168 Learning Spark - PySpark
Issue:
http://thinkopensource.in
169 Learning Spark - PySpark
Verify the ip of the resource manager and set the home directory appropriately.
Try updating the following setting in yarn-site.xml of Hadoop and spark node too..
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop0:8032</value>
<description>The hostname of the RM.</description>
</property>
-------------------------------- ----------------------------------------------------------------------------
http://thinkopensource.in
170 Learning Spark - PySpark
{"firstName":"Grace","lastName":"Hopper"}
{"firstName":"Alan","lastName":"Turing"}
{"firstName":"Ada","lastName":"Lovelace"}
{"firstName":"Charles","lastName":"Babbage"}
http://thinkopensource.in
171 Learning Spark - PySpark
http://127.0.0.1:8080
http://thinkopensource.in
172 Learning Spark - PySpark
You can verify the Jobs details Using Spark UI: http://master:4040/jobs/
web UI comes with the following tabs (which may not all be visible at once as they are lazily created on demand,
e.g. Streaming tab):
§ Jobs
§ Stages
§ Storage with RDD size and memory use
§ Environment
§ Executors
§ SQL
The Jobs Tab shows status of all Spark jobs in a Spark application
http://thinkopensource.in
173 Learning Spark - PySpark
Display the timeline of Executor being added in the Job. When the Job Get completed etc.
In the above, you can verify that 2 executors have been added.
When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the
Summary section.
The Event Timeline section shows not only jobs but also executors.
You can verify the completed and failed jobs too:
http://thinkopensource.in
174 Learning Spark - PySpark
http://thinkopensource.in
175 Learning Spark - PySpark
You can verify some of the following Queries from this console?
How Many Statges and Tasks created per Job? ( 2: 1 stages and only one task)
http://thinkopensource.in
176 Learning Spark - PySpark
Click on On job (1 – write action or job ) -> for details on Description à stages to view DAG - When you click a
job in All Jobs Page page, you see the Details for Job page.
http://thinkopensource.in
177 Learning Spark - PySpark
As seen above, there is 1 stage in the Job. It scans the file and create a map partition RDD.
Questions:
How long does the scheduler take to schedule the task? (Check the Light Blue Block)
How long does the task deserialization take? (Check the Orange Block)
What is the actual computation time? (Verify the Green Color block)
http://thinkopensource.in
178 Learning Spark - PySpark
Scroll down to view the task metrics: which task take majority of the time etc. How much byte consume or output by each task and its
statistics.
You can verify from the above, Locality_Level column – the data is process locally. i.e data locality.
http://thinkopensource.in
179 Learning Spark - PySpark
Executors Tab
Stats of each executor, how much time it spent on GC etc. How much data get shuffle?
Executors tab in web UI shows
http://thinkopensource.in
180 Learning Spark - PySpark
You can view the statistics of the SQL – How much Rows read and how much bytes etc.
http://thinkopensource.in
181 Learning Spark - PySpark
http://thinkopensource.in
182 Learning Spark - PySpark
You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by
using
$ nc -lk 9999
Then, any lines typed in the terminal running the netcat server will be counted and printed on screen
every second. It will look something like the following.
http://thinkopensource.in
183 Learning Spark - PySpark
First, we import StreamingContext, which is the main entry point for all streaming functionality. We
create a local StreamingContext with two execution threads, and batch interval of 1 second.
http://thinkopensource.in
184 Learning Spark - PySpark
This lines DStream represents the stream of data that will be received from the data server. Each record
in this DStream is a line of text. Next, we want to split the lines by space into words.
flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new
records from each record in the source DStream. In this case, each line will be split into multiple words
and the stream of words is represented as the words DStream. Next, we want to count these words.
http://thinkopensource.in
185 Learning Spark - PySpark
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs,
which is then reduced to get the frequency of words in each batch of data.
Finally, wordCounts.pprint() will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it will perform
when it is started, and no real processing has started yet. To start the processing after all the
transformations have been setup, we finally call
http://thinkopensource.in
186 Learning Spark - PySpark
http://thinkopensource.in
187 Learning Spark - PySpark
Steps to be performed:
§ Install Kafka and start it along with producer.
§ Write python program to connect to Kafka and get data from Kafka using streaming
§ Execute it to fetch the data.
Use the Spark VM
Install kafka:
bin/zookeeper-server-start.sh config/zookeeper.properties
http://thinkopensource.in
188 Learning Spark - PySpark
bin/kafka-server-start.sh config/server.properties
http://thinkopensource.in
189 Learning Spark - PySpark
http://thinkopensource.in
190 Learning Spark - PySpark
Create a topic
Let's create a topic named "test" with a single partition and only one replica: Open one more terminal
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
We can now see that topic if we run the list topic command:
> bin/kafka-topics.sh --list --zookeeper localhost:2181
http://thinkopensource.in
191 Learning Spark - PySpark
Start a consumer
http://thinkopensource.in
192 Learning Spark - PySpark
You have successfully configure Kafka, Now let us fetch information using spark.
http://thinkopensource.in
193 Learning Spark - PySpark
import sys
# Expression that reads in raw data from dataframe as a string�~@�# and names the column "words"�~@�
lines = lines.selectExpr("CAST(value AS STRING) as words")
http://thinkopensource.in
194 Learning Spark - PySpark
query = lines.writeStream.outputMode("append").format("console").start()
# Terminates the stream on abort
query.awaitTermination()
http://thinkopensource.in
195 Learning Spark - PySpark
--------------
http://thinkopensource.in
196 Learning Spark - PySpark
Whatever you push to the Kafka should be written on the console as shown below.
http://thinkopensource.in
197 Learning Spark - PySpark
http://thinkopensource.in
198 Learning Spark - PySpark
To run this on your local machine, you need to first run a Netcat server.
$ nc -lk 9999
We will be sending message to the above netcat server and then apply transformation to it using Spark Streaming.
#pyspark
The above commands import the necessary required packages for execution of Spark Streaming application.
#Initialize the Streamming context using the pre created spark context, sc. The interval is 5 5 seconds. i.e micro batch of 5
mseconds.
ssc = StreamingContext(sc, 5)
# Specify the checkpoint directory, since we are going to maintain state across the batch.
ssc.checkpoint("checkpoint")
http://thinkopensource.in
199 Learning Spark - PySpark
#Define the function that will update the state of the count.
def updateFunc(new_values, last_sum): \
return sum(new_values) + (last_sum or 0)
# Apply the transformation logic to count the word. Here, we are using flatMap, since it can return list of objects.
running_counts.pprint()
# start the streaming context and wait for termination from external.
ssc.start()
ssc.awaitTermination()
Snippets:
http://thinkopensource.in
200 Learning Spark - PySpark
On the telnet terminal, Type sentences like following with some interval between the sentences.
http://thinkopensource.in
201 Learning Spark - PySpark
As you can observe that “Henry” word has occurred twice across the data records. Thus state has been maintained.
http://thinkopensource.in
202 Learning Spark - PySpark
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('/opt/spark/spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config("spark.sql.catalogImplementation", "hive") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
http://thinkopensource.in
203 Learning Spark - PySpark
# The results of SQL queries are themselves DataFrames and support all normal functions.
sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
# The items in DataFrames are of type Row, which allows you to access each column by
ordinal.
stringsDS = sqlDF.rdd.map(lambda row: "Key: %d, Value: %s" % (row.key, row.value))
for record in stringsDS.collect():
print(record)
# Key: 0, Value: val_0
# Key: 0, Value: val_0
http://thinkopensource.in
204 Learning Spark - PySpark
# You can also use DataFrames to create temporary views within a SparkSession.
Record = Row("key", "value")
recordsDF = spark.createDataFrame([Record(i, "val_" + str(i)) for i in range(1, 101)])
recordsDF.createOrReplaceTempView("records")
# Queries can then join DataFrame data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
Execution Output:
http://thinkopensource.in
205 Learning Spark - PySpark
http://thinkopensource.in
206 Learning Spark - PySpark
http://thinkopensource.in
207 Learning Spark - PySpark
http://thinkopensource.in
208 Learning Spark - PySpark
http://thinkopensource.in
209 Learning Spark - PySpark
export SPARK_HOME=/opt/spark
export HIVE_HOME=/opt/hive_local
http://thinkopensource.in
210 Learning Spark - PySpark
You could also add a file "/etc/sysconfig/network-scripts/ifcfg-eth1", can you bring up eth1 then?
Changing hostname:
§ /etc/hosts
§ vi /etc/sysconfig/network
§ sysctl kernel.hostname=slave
§ bash
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
... 85 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
http://thinkopensource.in
211 Learning Spark - PySpark
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Solution:
§ include host to ip entry in /etc/hosts --> 192.168.188.178 master
§ update in conf/spark-env.sh
http://thinkopensource.in
212 Learning Spark - PySpark
o SPARK_LOCAL_IP=127.0.0.1
o SPARK_MASTER_HOST=192.168.188.178
Create an entry in /etc/fstab so that the system always mounts the DVD image after a reboot.
/mnt/hgfs/MyExperiment/rhel-server-6.4-x86_64-dvd.iso /redhatimg iso9660 loop,ro 0 0
/etc/yum.repos.d/rheldiso.repo
[rhel6dvdiso]
name=RedHatOS DVD ISO
mediaid=1359576196.686790
baseurl=file:///redhatimg
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
http://thinkopensource.in