Intro to Spark

Intro to Spark
Kyle Burke - IgnitionOne
Data Science Engineer
March 24, 2016
https://www.linkedin.com/in/kyleburke

Today’s Topics
• Why Spark?
• Spark Basics
• Spark Under the Hood.
• Quick Tour of Spark Core, SQL, and Streaming.
• Tips from the Trenches.
• Setting up Spark Locally.
• Ways to Learn More.

Who am I?
• Background mostly in Data Warehousing with
some app and web development work.
• Currently a data engineer/data scientist with
IgnitionOne.
• Began using Spark last year.
• Currently using Spark to read data from Kafka
stream to load to Redshift/Cassandra.

Why Spark?
• You find yourself writing code to parallelize data and then
have to resync.
• Your database is overloaded and you want to off load some of
the workload.
• You’re being asked to preform both batch and streaming
operations with your data.
• You’ve got a bunch of data sitting in files that you’d like to
analyze.
• You’d like to make yourself more marketable.

Spark Basics Overview
• Spark Conf – Contains config information about your app.
• Spark Context – Contains config information about
cluster. Driver which defines jobs and constructs the DAG to
outline work on a cluster.
• Resilient Distributed Dataset (RDD) – Can be
thought of as a distributed collection.
• SQL Context – Entry point into Spark SQL functionality. Only
need a Spark Context to create.
• DataFrame – Can be thought of as a distributed collection
of rows contain named columns or similar to database table.

Spark Core
• First you’ll need to create a SparkConf and SparkContext.
val conf = new SparkConf().setAppName(“HelloWorld”)
val sc = new SparkContext(conf)
• Using the SparkContext, you can read in data from Hadoop
compatible and local file systems.
val clicks_raw = sc.textFile(path_to_clicks)
val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer
val ga_clicks_cnt = ga_clicks.count // This is an action
• Map function allows for operations to be performed on each
row in RDD.
• Lazy evaluation means that no data processing occurs until an
Action happens.

Spark SQL
• Allows dataframes to be registered as temporary tables.
rawbids = sqlContext.read.parquet(parquet_directory)
rawbids.registerTempTable(“bids”)
• Tables can be queried using SQL or HiveQL language
sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”)
• Supports User-Defined Functions
import urllib
sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType())
bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”)
• First class support for complex data types (ex typically found in JSON
structures)

Spark Streaming
• Streaming Context – context for the cluster to create and manage streams.
• Dstream – Sequence of RDDs. Formally called discretized stream.
//File Stream Example
val ssc = new StreamingContext(conf, Minutes(1))
val ImpressionStream = ssc.textFileStream(path_to_directory)
ImpressionStream.foreachRDD((rdd, time) => {
//normal rdd processing goes here
}

Tips
• Use mapPartitions if you’ve got expensive objects to instantiate.
def partitionLines(lines:Iterator[String] )={
val parser = new CSVParser('t')
lines.map(parser.parseLine(_).size)
}
rdd.mapPartitions(partitionLines)
• Caching if you’re going to reuse objects.
rdd.cache() == rdd.persist(MEMORY_ONLY)
• Partition files to improve read performance
all_bids.write
.mode("append")
.partitionBy("insert_date","insert_hr")
.json(stage_path)

Tips (Cont’d)
• Save DataFrame to JSON/Parquet
• CSV is more cumbersome to deal with but spark-csv package
• Avro data conversions seem buggy.
• Parquet is the format where the most effort is being done for
performance optimizations.
• Spark History Server is helpful for troubleshooting.
– Started by running “$SPARK_HOME/sbin/start-history-server.sh”
– By default you can access it from port 18080.
• Hive external tables
• Check out spark-packages.org

Spark Local Setup
Step Shell Command
Download and place tgz in Spark
folder.
>>mkdir Spark
>> cd spark-1.6.1.tgz Spark/
Untar spark tgz file >>tar -xvf spark-1.6.1.tgz
cd extract folder >>cd spark-1.6.1
Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g -
XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m
Build Spark >>mvn -Pyarn -Phadoop-2.6 -
Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -
DskipTests clean package

Ways To Learn More
• Edx Course: Intro to Spark
• Spark Summit – Previous conferences are
available to view for free.
• Big Data University – IBM’s training.

Intro to Spark

Related slideshows

More Related Content

Intro to Spark

Editor's Notes