Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
FEELIN' THE 
FLOW 
GETTING YOUR DATA MOVING 
WITH SPARK AND CASSANDRA 
Presented by / Rich Beaudoin @RichGBeaudoinOctober 14th, 2014
ABOUT ME... 
Sr. Software Engineer at Pearson 
Organizer of 
Lover of Music 
All around solid dude 
Distributed Computing Denver
OVERVIEW 
What is Spark 
The problem it solves 
The core concepts 
Spark integration with Cassandra 
Tables as RDDs 
Writing RDDs to Cassandra 
Question and Summary
WHAT IS SPARK? 
Apache Spark™ is a fast and general engine 
for large-scale data processing. 
Created by AMPLab at UC Berkeley 
Became Apache Top-Level Project in 2014 
Supports Scala, Java, and Python APIs
THE PROBLEM, PART 
ONE... 
Approaches like MapReduce read from, and store to HDFS 
...so each cycle of processing incurs latency from HDFS reads
THE PROBLEM, PART 
TWO... 
Any robust, distributed data processing framework needs fault 
tolerance 
But existing solutions allow for "fine-grained" (cell level) 
updates, which can complicate the handling of faults where 
data needs to be rebuilt/recalculated
SPARK ATTEMPTS TO 
ADDRESS THESE TWO 
PROBLEMS 
Solution 1: store intermediate results in memory 
Solution 2: introduce a new expressive data abstraction
RDD 
A Resilient Distributed Dataset (RDD) is an an 
immutable, partioned record that supports 
basic operations (e.g. map, filter, join). It 
maintains a graph of transformations in order 
to enable recovery of a lost partition 
*See the RDD white paper for more details
TRANSFORMATIONS 
AND ACTIONS 
"transformation" creates another RDD, is evaluated lazily 
"action" returns a value, evaluated immediately
RDDS ARE EXPRESSIVE 
It turns out that coarse-grained operations cover many existing 
parrallel computing cases 
Consequently, the RDD abstraction can implement existing 
systems like MapReduce, Pregel, Dryad, etc.
SPARK CLUSTER 
OVERVIEW 
Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager
JOB SCHEDULING AND 
STAGES
SPARK AND CASSANDRA 
If we can turn Cassandra data into RDDs, and RDDs into 
Cassandra data, then the data can start flowing between the 
two systems and give us some insight into our data. 
allows us to perform the 
The Spark Cassandra Connector 
transformation from Cassadra table to RDD and then back 
again!
THE SETUP
FROM CASSANDRA 
TABLE TO RDD 
import org.apache.spark._ 
import com.datastax.spark.connector._ 
val rdd = sc.cassandraTable("music", "albums_by_artist") 
Run these commands spark-shell, requires specifying the spark-connector jar on the commandline
SIMPLE MAPREDUCE FOR 
RDD COLUMN COUNT 
val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)
SAVE THE RDD TO 
CASSANDRA 
count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))
CASSANDRA WITH 
SPARKSQL 
import org.apache.spark.sql.cassandra.CassandraSQLContext 
val cc = new CassandraSQLContext(sc) 
val rdd = cc.sql("SELECT * from music.label_count")
JOINS!!! 
import sqlContext.createSchemaRDD 
import org.apache.spark.sql._ 
case class LabelCount(label: String, count: Int) 
case class AlbumArtist(artist: String, album: String, label: String, year: Int) 
case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count 
val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cache 
val labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache 
val albumsByLabelId = albumArtists.keyBy(x => x.label) 
val countsByLabelId = labelCounts.keyBy(x => x.label) 
val joinedAlbums = albumsByLabelId.join(countsByLabelId).cache 
val albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist,
OTHER THINGS TO 
CHECK OUT 
Spark Streaming 
Spark SQL
QUESTIONS?
THE END 
References 
Resilient Distributed Datasets: A Fault-Tolerant Abstraction 
for In-Memory Cluster Computing 
Spark Programming Guide 
Apache Spark Website 
Datastax Spark Cassandra Connector Documentation

More Related Content

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

  • 1. FEELIN' THE FLOW GETTING YOUR DATA MOVING WITH SPARK AND CASSANDRA Presented by / Rich Beaudoin @RichGBeaudoinOctober 14th, 2014
  • 2. ABOUT ME... Sr. Software Engineer at Pearson Organizer of Lover of Music All around solid dude Distributed Computing Denver
  • 3. OVERVIEW What is Spark The problem it solves The core concepts Spark integration with Cassandra Tables as RDDs Writing RDDs to Cassandra Question and Summary
  • 4. WHAT IS SPARK? Apache Spark™ is a fast and general engine for large-scale data processing. Created by AMPLab at UC Berkeley Became Apache Top-Level Project in 2014 Supports Scala, Java, and Python APIs
  • 5. THE PROBLEM, PART ONE... Approaches like MapReduce read from, and store to HDFS ...so each cycle of processing incurs latency from HDFS reads
  • 6. THE PROBLEM, PART TWO... Any robust, distributed data processing framework needs fault tolerance But existing solutions allow for "fine-grained" (cell level) updates, which can complicate the handling of faults where data needs to be rebuilt/recalculated
  • 7. SPARK ATTEMPTS TO ADDRESS THESE TWO PROBLEMS Solution 1: store intermediate results in memory Solution 2: introduce a new expressive data abstraction
  • 8. RDD A Resilient Distributed Dataset (RDD) is an an immutable, partioned record that supports basic operations (e.g. map, filter, join). It maintains a graph of transformations in order to enable recovery of a lost partition *See the RDD white paper for more details
  • 9. TRANSFORMATIONS AND ACTIONS "transformation" creates another RDD, is evaluated lazily "action" returns a value, evaluated immediately
  • 10. RDDS ARE EXPRESSIVE It turns out that coarse-grained operations cover many existing parrallel computing cases Consequently, the RDD abstraction can implement existing systems like MapReduce, Pregel, Dryad, etc.
  • 11. SPARK CLUSTER OVERVIEW Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager
  • 13. SPARK AND CASSANDRA If we can turn Cassandra data into RDDs, and RDDs into Cassandra data, then the data can start flowing between the two systems and give us some insight into our data. allows us to perform the The Spark Cassandra Connector transformation from Cassadra table to RDD and then back again!
  • 15. FROM CASSANDRA TABLE TO RDD import org.apache.spark._ import com.datastax.spark.connector._ val rdd = sc.cassandraTable("music", "albums_by_artist") Run these commands spark-shell, requires specifying the spark-connector jar on the commandline
  • 16. SIMPLE MAPREDUCE FOR RDD COLUMN COUNT val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)
  • 17. SAVE THE RDD TO CASSANDRA count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))
  • 18. CASSANDRA WITH SPARKSQL import org.apache.spark.sql.cassandra.CassandraSQLContext val cc = new CassandraSQLContext(sc) val rdd = cc.sql("SELECT * from music.label_count")
  • 19. JOINS!!! import sqlContext.createSchemaRDD import org.apache.spark.sql._ case class LabelCount(label: String, count: Int) case class AlbumArtist(artist: String, album: String, label: String, year: Int) case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cache val labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache val albumsByLabelId = albumArtists.keyBy(x => x.label) val countsByLabelId = labelCounts.keyBy(x => x.label) val joinedAlbums = albumsByLabelId.join(countsByLabelId).cache val albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist,
  • 20. OTHER THINGS TO CHECK OUT Spark Streaming Spark SQL
  • 22. THE END References Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Spark Programming Guide Apache Spark Website Datastax Spark Cassandra Connector Documentation