Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Intro to Spark
Kyle Burke - IgnitionOne
Data Science Engineer
March 24, 2016
https://www.linkedin.com/in/kyleburke
Today’s Topics
• Why Spark?
• Spark Basics
• Spark Under the Hood.
• Quick Tour of Spark Core, SQL, and Streaming.
• Tips from the Trenches.
• Setting up Spark Locally.
• Ways to Learn More.
Who am I?
• Background mostly in Data Warehousing with
some app and web development work.
• Currently a data engineer/data scientist with
IgnitionOne.
• Began using Spark last year.
• Currently using Spark to read data from Kafka
stream to load to Redshift/Cassandra.
Why Spark?
• You find yourself writing code to parallelize data and then
have to resync.
• Your database is overloaded and you want to off load some of
the workload.
• You’re being asked to preform both batch and streaming
operations with your data.
• You’ve got a bunch of data sitting in files that you’d like to
analyze.
• You’d like to make yourself more marketable.
Spark Basics Overview
• Spark Conf – Contains config information about your app.
• Spark Context – Contains config information about
cluster. Driver which defines jobs and constructs the DAG to
outline work on a cluster.
• Resilient Distributed Dataset (RDD) – Can be
thought of as a distributed collection.
• SQL Context – Entry point into Spark SQL functionality. Only
need a Spark Context to create.
• DataFrame – Can be thought of as a distributed collection
of rows contain named columns or similar to database table.
Spark Core
• First you’ll need to create a SparkConf and SparkContext.
val conf = new SparkConf().setAppName(“HelloWorld”)
val sc = new SparkContext(conf)
• Using the SparkContext, you can read in data from Hadoop
compatible and local file systems.
val clicks_raw = sc.textFile(path_to_clicks)
val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer
val ga_clicks_cnt = ga_clicks.count // This is an action
• Map function allows for operations to be performed on each
row in RDD.
• Lazy evaluation means that no data processing occurs until an
Action happens.
Spark SQL
• Allows dataframes to be registered as temporary tables.
rawbids = sqlContext.read.parquet(parquet_directory)
rawbids.registerTempTable(“bids”)
• Tables can be queried using SQL or HiveQL language
sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”)
• Supports User-Defined Functions
import urllib
sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType())
bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”)
• First class support for complex data types (ex typically found in JSON
structures)
Spark Core Advanced Topics
Spark Streaming
• Streaming Context – context for the cluster to create and manage streams.
• Dstream – Sequence of RDDs. Formally called discretized stream.
//File Stream Example
val ssc = new StreamingContext(conf, Minutes(1))
val ImpressionStream = ssc.textFileStream(path_to_directory)
ImpressionStream.foreachRDD((rdd, time) => {
//normal rdd processing goes here
}
Tips
• Use mapPartitions if you’ve got expensive objects to instantiate.
def partitionLines(lines:Iterator[String] )={
val parser = new CSVParser('t')
lines.map(parser.parseLine(_).size)
}
rdd.mapPartitions(partitionLines)
• Caching if you’re going to reuse objects.
rdd.cache() == rdd.persist(MEMORY_ONLY)
• Partition files to improve read performance
all_bids.write
.mode("append")
.partitionBy("insert_date","insert_hr")
.json(stage_path)
Tips (Cont’d)
• Save DataFrame to JSON/Parquet
• CSV is more cumbersome to deal with but spark-csv package
• Avro data conversions seem buggy.
• Parquet is the format where the most effort is being done for
performance optimizations.
• Spark History Server is helpful for troubleshooting.
– Started by running “$SPARK_HOME/sbin/start-history-server.sh”
– By default you can access it from port 18080.
• Hive external tables
• Check out spark-packages.org
Spark Local Setup
Step Shell Command
Download and place tgz in Spark
folder.
>>mkdir Spark
>> cd spark-1.6.1.tgz Spark/
Untar spark tgz file >>tar -xvf spark-1.6.1.tgz
cd extract folder >>cd spark-1.6.1
Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g -
XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m
Build Spark >>mvn -Pyarn -Phadoop-2.6 -
Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -
DskipTests clean package
Ways To Learn More
• Edx Course: Intro to Spark
• Spark Summit – Previous conferences are
available to view for free.
• Big Data University – IBM’s training.

More Related Content

Intro to Spark

  • 1. Intro to Spark Kyle Burke - IgnitionOne Data Science Engineer March 24, 2016 https://www.linkedin.com/in/kyleburke
  • 2. Today’s Topics • Why Spark? • Spark Basics • Spark Under the Hood. • Quick Tour of Spark Core, SQL, and Streaming. • Tips from the Trenches. • Setting up Spark Locally. • Ways to Learn More.
  • 3. Who am I? • Background mostly in Data Warehousing with some app and web development work. • Currently a data engineer/data scientist with IgnitionOne. • Began using Spark last year. • Currently using Spark to read data from Kafka stream to load to Redshift/Cassandra.
  • 4. Why Spark? • You find yourself writing code to parallelize data and then have to resync. • Your database is overloaded and you want to off load some of the workload. • You’re being asked to preform both batch and streaming operations with your data. • You’ve got a bunch of data sitting in files that you’d like to analyze. • You’d like to make yourself more marketable.
  • 5. Spark Basics Overview • Spark Conf – Contains config information about your app. • Spark Context – Contains config information about cluster. Driver which defines jobs and constructs the DAG to outline work on a cluster. • Resilient Distributed Dataset (RDD) – Can be thought of as a distributed collection. • SQL Context – Entry point into Spark SQL functionality. Only need a Spark Context to create. • DataFrame – Can be thought of as a distributed collection of rows contain named columns or similar to database table.
  • 6. Spark Core • First you’ll need to create a SparkConf and SparkContext. val conf = new SparkConf().setAppName(“HelloWorld”) val sc = new SparkContext(conf) • Using the SparkContext, you can read in data from Hadoop compatible and local file systems. val clicks_raw = sc.textFile(path_to_clicks) val ga_clicks = clicks_raw.filter(s => s.contains(“Georgia”)) //transformer val ga_clicks_cnt = ga_clicks.count // This is an action • Map function allows for operations to be performed on each row in RDD. • Lazy evaluation means that no data processing occurs until an Action happens.
  • 7. Spark SQL • Allows dataframes to be registered as temporary tables. rawbids = sqlContext.read.parquet(parquet_directory) rawbids.registerTempTable(“bids”) • Tables can be queried using SQL or HiveQL language sqlContext.sql(“SELECT url, insert_date, insert_hr from bids”) • Supports User-Defined Functions import urllib sqlContext.registerFunction("urlDecode", lambda s: urllib.unquote(s), StringType()) bids_urls = sqlContext.sql(“SELECT urlDecode(url) from bids”) • First class support for complex data types (ex typically found in JSON structures)
  • 9. Spark Streaming • Streaming Context – context for the cluster to create and manage streams. • Dstream – Sequence of RDDs. Formally called discretized stream. //File Stream Example val ssc = new StreamingContext(conf, Minutes(1)) val ImpressionStream = ssc.textFileStream(path_to_directory) ImpressionStream.foreachRDD((rdd, time) => { //normal rdd processing goes here }
  • 10. Tips • Use mapPartitions if you’ve got expensive objects to instantiate. def partitionLines(lines:Iterator[String] )={ val parser = new CSVParser('t') lines.map(parser.parseLine(_).size) } rdd.mapPartitions(partitionLines) • Caching if you’re going to reuse objects. rdd.cache() == rdd.persist(MEMORY_ONLY) • Partition files to improve read performance all_bids.write .mode("append") .partitionBy("insert_date","insert_hr") .json(stage_path)
  • 11. Tips (Cont’d) • Save DataFrame to JSON/Parquet • CSV is more cumbersome to deal with but spark-csv package • Avro data conversions seem buggy. • Parquet is the format where the most effort is being done for performance optimizations. • Spark History Server is helpful for troubleshooting. – Started by running “$SPARK_HOME/sbin/start-history-server.sh” – By default you can access it from port 18080. • Hive external tables • Check out spark-packages.org
  • 12. Spark Local Setup Step Shell Command Download and place tgz in Spark folder. >>mkdir Spark >> cd spark-1.6.1.tgz Spark/ Untar spark tgz file >>tar -xvf spark-1.6.1.tgz cd extract folder >>cd spark-1.6.1 Give Maven extra memory >>export MAVEN_OPTS="-Xmx2g - XX:MaxPermSize=512M - XX:ReservedCodeCacheSize=512m Build Spark >>mvn -Pyarn -Phadoop-2.6 - Dhadoop.version=2.6.0 -Phive -Phive-thriftserver - DskipTests clean package
  • 13. Ways To Learn More • Edx Course: Intro to Spark • Spark Summit – Previous conferences are available to view for free. • Big Data University – IBM’s training.

Editor's Notes

  1. DAG – Directed Acylic Graph When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages. The Worker executes the tasks. A new JVM is started per job. The worker knows only about the code that is passed to it.