Big Data Analytics with Scala at SCALA.IO 2013

Big Data Analytics
with Scala
Sam BESSALAH
@samklr

What is Big Data Analytics?

It’s about doing aggregations and running
complex models on large datasets, offline, in
real time or both.

Lambda Architecture
Blueprint for a Big Data analytics
architecture

Map Reduce redux
map : (Km, Vm)  List (Km, Vm)
in Scala : T =>
List[(K,V)]
reduce :(Km, List(Vm))List(Kr, Vr)
(K, List[V]) => List[(K,V)]

Big data ‘’Hello World’’ : Word count

Word Count Redux
(Flat)Map -Reduce

SCALDING
class WordCount(args : Args) extends Job(args) {
TextLine(args("input"))
.flatMap ('line -> 'word) {
line :String => line.split(“ s+”)
}
.groupBy('word){ group => group.size }
.write(Tsv(args("output")))

}

SCALDING : Clustering with Mahout
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

TextLine(args("input"))
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
DenseVector(vec))
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent);

cl }
.flatMap(c => c.iterator.asScala.toIterable)

val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values

Scalding

- Two APIs : Field based API, and Typed API
- Field API : project, map, discard , groupBy…
- Typed API : TypedPipe[T], works like
scala.collection.Iterator[T]

- Matrix Library
- ALGEBIRD : Abstract Algebra library … we’ll
talk about it later

- Distributed, fault tolerant, real time stream
computation engine.
- Four concepts
- Streams : infinite sequence of tuples
- Spouts : Source of streams
- Bolts : Process and produces streams
Can do : Filtering, aggregations, Joins, …
- Topologies : define a flow or network of
spouts and blots.

Trident
TridentTopology topology = new TridentTopology();

TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);

ScalaStorm by Evan Chan

class SplitSentence extends
StormBolt(outputFields = List("word")) {
def execute(t: Tuple) = t matchSeq {
case Seq(line: String) => line.split(‘’’’).foreach
{ word => using anchor t emit (word) }
t ack
}
}

SummingBird

Write your job once and run it on Storm and
Hadoop

def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap {
line => line.split(‘’s+’’).map(_ -> 1L) }
.sumByKey(store)

SummingBird
trait Platform[P <: Platform[P]]
{
type Source[+T]
type Store[-K, V]
type Sink[-T]
type Service[-K, +V]
type Plan[T}
}

On Storm

- Source[+T] : Spout[(Long, T)]
- Store[-K, V] : StormStore [K, V]
- Sink[-T] : (T => Future[Unit])
- Service[-K, +V] : StormService[K,V]
- Plan[T] : StormTopology

SummingBird dependencies

• StoreHaus
• Chill
• Scalding
• Algebird
• Tormenta

But

- Can only aggregate values that are
associative : Monoids!!!!!!

trait Monoid [V] {
def zero : V
def aggregate(left : V, right :V): V
}

Clustering with Mahout redux
def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) {
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

source
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
DenseVector(vec))
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent); cl }
.flatMap(c => c.iterator.asScala.toIterable)

val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values
.saveTo(store)
}

What is Spark?

•
•
•

Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude
faster (order of magnitude faster)

Improves efficiency through:
-General execution graphs
-In-memory storage
Improves usability through:
-Rich APIs in Java, Scala, Python
-Interactive shell

Key idea

•
•

Write programs in terms of transformations on distributed
datasets
Concept: resilient distributed datasets (RDDs)
- Collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)

Other RDD Operators

•
•
•
•
•
•
•
•

map
filter

groupBy
sort
union
join
leftOuterJoin
rightOuterJoin

Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)

Base Transformed
RDD
RDD

results

errors = lines.filter(s => s.startswith(“ERROR”))

messages = errors.map(s => s.split(“t”))
messages.cache()

messages.filter(s=> s.contains(“foo”)).count()

Cache 1

Driver

Worker

tasks Block 1

Action

messages.filter(s=> s.contains(“bar”)).count()

Cache 2

Worker

. . .

Cache 3

Worker
Result: full-text search scaled to 1 TBin 0.5 in 5 (vs 20 s for on-disk
Result: of Wikipedia data sec sec
(vs 180 sec for on-disk data)
data)

Block 3

Block 2

Fault Recovery
RDDs track lineage information that can be
used to efficiently recompute lost data
Ex:

msgs = textFile.filter(-=> _.startsWith(“ERROR”))
.map(_ => _.split(“t”))

HDFS File

Filtered RDD

filter
(func = _.contains(...))

Mapped RDD

map
(func = _.split(...))

Spark Streaming

- Extends Spark capabilities to large scale stream
processing.
- Scales to 100s of nodes and achieves second scale
latencies
-Efficient and fault-tolerant stateful stream processing
- Simple batch-like API for implementing complex
algorithms

Discretized Stream
Processing
live data stream

 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches

Spark
Streaming

batches of X
seconds
Spark
processed
results

44

Discretized Stream
Processing
live data stream

 Batch sizes as low as ½ second, latency
of about 1 second

 Potential for combining batch
processing and streaming processing
in the same system

Spark
Streaming

batches of X seconds

Spark
processed
results

45

Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream
of data
Twitter Streaming API

batch @ t

batch @ t+1

batch @ t+2

tweets DStream

stored in memory as an RDD
(immutable, distributed)

Example – Get hashtags from Twitter

val hashTags = tweets.flatMap (status => getTags(status))
new DStream

transformation: modify data in one DStream to create another
DStream
batch @ t

batch @ t+1

batch @ t+2

tweets DStream

hashTags Dstream
[#cat, #dog, … ]

flatMap

flatMap

…

flatMap

new RDDs created
for every batch


hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed
data
batch @ t

batch @ t+1

batch @ t+2

tweets DStream
flatMap

hashTags
DStream

flatMap

flatMap

foreach

foreach

foreach

Write to database, update analytics
UI, do whatever you want


hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage

batch @ t

batch @ t+1

batch @ t+2

tweets DStream
flatMap

flatMap

flatMap

save

save

save

hashTags DStream

every batch
saved to HDFS

Window-based Transformations

val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
operation

window length

sliding interval

window length

DStream of data

sliding interval

Compute TopK Ip addresses
val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …)
val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..)
val addresses = stream.map(ipAddress => ipAddress.getText)

val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC)
val globalCMS = cms.zero
val mm = new MapMonoid[Long, Int]()
//init
val topAddresses = adresses.mapPartitions(ids => {
ids.map(id => cms.create(id))
})
.reduce(_ ++ _)

topAddresses.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
val partialTopK = partial.heavyHitters.map(id =>
(id, partial.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id =>
(id, globalCMS.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalTopK.mkString("[", ",", "]")))
}
})

Multi purpose analytics stack

MLBASE

TACHYON

Stream
Processing

Spark
+
Shark
+
Spark
Streaming
Batch
Processing

Ad-hoc
Queries

GraphX

BLINK DB

SPARK

SPARK STREAMING
-

Almost Similar API for batch or Streaming
Single¨Platform with fewer moving parts
Order of magnitude faster

References
Sam Ritchie : SummingBird

https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-attwitter
Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala

http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learningwith-scala-linkedin
Apache Spark : http://spark.incubator.apache.org

Matei Zaharia : Parallel Programming with Spark

Big Data Analytics with Scala at SCALA.IO 2013

Related slideshows

More Related Content

Big Data Analytics with Scala at SCALA.IO 2013