1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Report
Share
Report
Share
1 of 47
Download to read offline
More Related Content
SparkR - Play Spark Using R (20160909 HadoopCon)
1. SparkR
- Play Spark Using R
Gil Chen
@HadoopCon 2016
Demo: http://goo.gl/VF77ad
2. about me
• R, Python & Matlab User
• Taiwan R User Group
• Taiwan Spark User Group
• Co-founder
• Data Scientist @
7. Spark Origin
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
• The first 2 contributors of SparkR:
Shivaram Venkataraman & Zongheng Yang
https://amplab.cs.berkeley.edu/
13. RDD (Resilient Distributed Dataset)
https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD
Internally, each RDD is characterized
by five main properties:
1. A list of partitions
2. A function for computing each split
3. A list of dependencies on other
RDDs
4. Optionally, a Partitioner for key-value
RDDs (e.g. to say that the RDD is
hash-partitioned)
5. Optionally, a list of preferred
locations to compute each split on
(e.g. block locations for an HDFS
file)
https://docs.cloud.databricks.com/docs/latest/courses
14. RDD dependencies
• Narrow dependency: Each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)
• Wide dependency: Multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the parents
are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
21. How does sparkR works?
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
22. Upgrading From SparkR 1.6 to 2.0
Before 1.6.2 Since 2.0.0
data type naming DataFrame SparkDataFrame
read csv
Package from
Databricks
built-in
function
(like approxQuantile)
X O
ML function glm
more
(or use sparklyr)
SQLContext
/ HiveContext
sparkRSQL.init(sc)
merge in
sparkR.session()
Execute Message very detailed simple
Launch on EC2 API X
https://spark.apache.org/docs/latest/sparkr.html
25. Documents
• If you have to use RDD, refer to AMP-Lab github:
http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/
and use “:::”
e.g. SparkR:::textFile, SparkR:::lapply
• Otherwise, refer to SparkR official documents:
https://spark.apache.org/docs/2.0.0/api/R/index.html
26. Starting to Use SparkR (v1.6.2)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc:sparkContext
sc <- sparkR.init(appName = "Demo_SparkR")
# Initialize SQLContext
sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
27. Starting to Use SparkR (v2.0.0)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc: sparkContext
sc <- sparkR.session(appName = "Demo_SparkR")
# Initialize SQLContext (don’t need anymore since 2.0.0)
# sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
28. DataFrames
# Load the flights CSV file using read.df
sdf <- read.df(sqlContext,"data_flights.csv",
"com.databricks.spark.csv", header = "true")
# Filter flights from JFK
jfk_flights <- filter(sdf, sdf$origin == "JFK")
# Group and aggregate flights to each destination
dest_flights <- summarize(
groupBy(jfk_flights, jfk_flights$dest),
count = n(jfk_flights$dest))
# Running SQL Queries
registerTempTable(sdf, "tempTable")
training <- sql(sqlContext,
"SELECT dest, count(dest) as cnt FROM tempTable
WHERE dest = 'JFK' GROUP BY dest")
29. Word Count
# read data into RDD
rdd <- SparkR:::textFile(sc, "data_word_count.txt")
# split word
words <- SparkR:::flatMap(rdd, function(line) {
strsplit(line, " ")[[1]]
})
# map: give 1 for each word
wordCount <- SparkR:::lapply(words, function(word) {
list(word, 1)
})
# reduce: count the value by key(word)
counts <- SparkR:::reduceByKey(wordCount, "+", 2)
# convert RDD to list
op <- SparkR:::collect(counts)
42. Some Tricks
• Customize spark config for launch
• cache()
• Some codes can’t run in Rstudio, try to use terminal
• Packages from 3rd party, like package of read csv
file from databricks
43. The Future of SparkR
• More MLlib API
• Advanced User Define Function
• package(“sparklyr”) from Rstudio
44. Reference
• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,
Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica,
and Matei Zaharia. SIGMOD 2016. June 2016.
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark
Summit, June 2014, San Francisco.
https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
• Apache Spark Official Research
http://spark.apache.org/research.html
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Apache Spark Official Document
http://spark.apache.org/docs/latest/api/scala/
• AMPlab UC Berkeley - SparkR Project
https://github.com/amplab-extras/SparkR-pkg
• Databricks Official Blog
https://databricks.com/blog/category/engineering/spark
• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio
https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
46. Join Us
• Fansboard
• Web Designer (php & JavaScript)
• Editor w/ facebook & instagram
• Vpon - Data Scientist
• Taiwan Spark User Group
• Taiwan R User Group
47. Thanks for your attention
& Taiwan Spark User Group
& Vpon Data Team