Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

FEELIN' THE
FLOW
GETTING YOUR DATA MOVING
WITH SPARK AND CASSANDRA
Presented by / Rich Beaudoin @RichGBeaudoinOctober 14th, 2014

ABOUT ME...
Sr. Software Engineer at Pearson
Organizer of
Lover of Music
All around solid dude
Distributed Computing Denver

OVERVIEW
What is Spark
The problem it solves
The core concepts
Spark integration with Cassandra
Tables as RDDs
Writing RDDs to Cassandra
Question and Summary

WHAT IS SPARK?
Apache Spark™ is a fast and general engine
for large-scale data processing.
Created by AMPLab at UC Berkeley
Became Apache Top-Level Project in 2014
Supports Scala, Java, and Python APIs

THE PROBLEM, PART
ONE...
Approaches like MapReduce read from, and store to HDFS
...so each cycle of processing incurs latency from HDFS reads

THE PROBLEM, PART
TWO...
Any robust, distributed data processing framework needs fault
tolerance
But existing solutions allow for "fine-grained" (cell level)
updates, which can complicate the handling of faults where
data needs to be rebuilt/recalculated

SPARK ATTEMPTS TO
ADDRESS THESE TWO
PROBLEMS
Solution 1: store intermediate results in memory
Solution 2: introduce a new expressive data abstraction

RDD
A Resilient Distributed Dataset (RDD) is an an
immutable, partioned record that supports
basic operations (e.g. map, filter, join). It
maintains a graph of transformations in order
to enable recovery of a lost partition
*See the RDD white paper for more details

TRANSFORMATIONS
AND ACTIONS
"transformation" creates another RDD, is evaluated lazily
"action" returns a value, evaluated immediately

RDDS ARE EXPRESSIVE
It turns out that coarse-grained operations cover many existing
parrallel computing cases
Consequently, the RDD abstraction can implement existing
systems like MapReduce, Pregel, Dryad, etc.

SPARK CLUSTER
OVERVIEW
Spark can be run with Apache Mesos, HADOOP Yarn, or it's own standalone cluster manager

SPARK AND CASSANDRA
If we can turn Cassandra data into RDDs, and RDDs into
Cassandra data, then the data can start flowing between the
two systems and give us some insight into our data.
allows us to perform the
The Spark Cassandra Connector
transformation from Cassadra table to RDD and then back
again!

FROM CASSANDRA
TABLE TO RDD
import org.apache.spark._
import com.datastax.spark.connector._
val rdd = sc.cassandraTable("music", "albums_by_artist")
Run these commands spark-shell, requires specifying the spark-connector jar on the commandline

SIMPLE MAPREDUCE FOR
RDD COLUMN COUNT
val count = rdd.map(x => (x.get[String]("label"),1)).reduceByKey(_ + _)

SAVE THE RDD TO
CASSANDRA
count.saveToCassandra("music", "label_count",SomeColumns("label", "count"))

CASSANDRA WITH
SPARKSQL
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
val rdd = cc.sql("SELECT * from music.label_count")

JOINS!!!
import sqlContext.createSchemaRDD
import org.apache.spark.sql._
case class LabelCount(label: String, count: Int)
case class AlbumArtist(artist: String, album: String, label: String, year: Int)
case class AlbumArtistCount(artist: String, album: String, label: String, year: Int, count
val albumArtists = sc.cassandraTable[AlbumArtist]("music","albums_by_artists").cache
val labelCounts = sc.cassandraTable[LabelCount]("music", "label_count").cache
val albumsByLabelId = albumArtists.keyBy(x => x.label)
val countsByLabelId = labelCounts.keyBy(x => x.label)
val joinedAlbums = albumsByLabelId.join(countsByLabelId).cache
val albumArtistCountObjects = joinedAlbums.map(x => (new AlbumArtistCount(x._2._1.artist,

OTHER THINGS TO
CHECK OUT
Spark Streaming
Spark SQL

THE END
References
Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing
Spark Programming Guide
Apache Spark Website
Datastax Spark Cassandra Connector Documentation

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra

More Related Content

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Cassandra