Paris Data Geek - Spark Streaming

Spark Streaming As Near
Realtime ETL
Paris Data Geek
18/09/2014
Djamel Zouaoui
@DjamelOnLine

Who am I ?
Djamel Zouaoui
Director Of Engineering
@DjamelOnLine
#Data
#Scala
#RecSys #Tech
#MachineLearning
#NoSql
#BigData
#Spark
#Dev
#R
#Architecture

What is
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
• Efficient • Usable
• General execution
graphs
• In-memory storage
• Rich APIs in Java,
Scala, Python
• Interactive shell

RDD in
• Resilient Distributed Dataset
• Storage abstraction for dataset in Spark
• Imutable
• Fault recovery
– Each RDD remembers how it was created, and can recover if any part of
the data is lost
• 3 kinds of operations
– Transformations: Lazy in nature, allow to create a new dataset from one
– Actions: Returns a value or exports data after performing a computation
– Persistence: caching dataset (on Disk/Ram/Mixed) for future operations

sparkContext.textFiles("hdfs://…")
.flatmap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
.collect()

textFiles map map reduceByKey
collect
.collect()

.collect()
collect
collect
Stage 1 Stage 2

.collect()
collect
collect
Stage 1 Stage 2
Stage 1 Stage 2

Ecosystem
RDD-Based
Matrices
RDD-Based
Graphs
Spark RDD API
DStream’s:
Streams of RDD’s
Spark
Streaming GraphX MLLib
RDD-Based
Tables
Spark
SQL
HDFS, S3, Cassandra
YARN, Mesos,
Standalone

What is
Project started in early 2012, extends Spark
for doing big data stream processing which:
Scales to hundreds of nodes
Achieves second-scale latencies
Efficiently recover from failures
Integrates with batch and interactive processing

How it works ?
• Input Source
Definition
• Input D-Stream
D-Stream Computations
• Window level
• Statefull option
• …
Classic RDDs
manipulation
• Transformation
• Action

Code
TOPOLOG
Y
FREE
//StreamingContext & Input source creation
//Standard transformations
//Window usage
//Start the streaming and put it in the background

Internals
• Two main processes
– Receivers in charge of the D-Stream creation
– Workers which in charge of data processing
• These processes are autonomous & independent
– No cores & resources shared
– No information shared

Execution Model – Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Receiver
Data
received
Blocks pushed
Blocks replicated
Block
Manager
Block
Manager
Master
Block
Manager

Execution Model – Job Scheduling
Spark Streaming + Spark Driver
Network
Input
Tracker
RDDs Block IDs
Job Scheduler
Spark’s
Schedulers
Receiver
Block
Manager
Block
Manager
Jobs executed on
worker nodes
DStream
Graph
Job
Manager
Job Queue
Jobs

Use Case: Find The True Love !
Build a recommender system based on implicit
and explicit data to find the best matching for you
• Based on Machine Learning models
• Processed offline (batch)
• On big (bunch of) data
• Main goals of streaming platforms :
– Need to store a lot of data
– Need to clean them
– Need to transform them

Overview
Data
Receiver
Data
Cleaning
job
KAFKA
Topics
Data
Modelin
g
job
HDFS
Storage
HDFS
Storage
Spark Cluster
• Spark in Standalone mode
• 120 cores available on Spark
• 4.5 GB RAM per core
• Based on Hadoop cluster for
HDFS storage (10 To)
• HDP 2.0
• 8 machines (2 masters, 6
slaves)

Data
Receiver
Data
Cleaning
job
Data
Modelin
g
job
HDFS
Storage
HDFS
Storage
• Use of provided Kafka
source
• Naive implementation:
– Based on
autocommit
– Automatic Offset
management
• Cleaning with classic
RDD transformations
• Persist new RDDs
– In HDFS for other spark
job (batch)
– In RAM to speed up
next step
• Binary matrix
• Scoring based on current
events and history
– History is load from
RDDs stored on HDFS
Job details

Issues
• Data Lost
– In the receiver phase due to naive kafka consumer
– Need a more robust client with handly offset management (VS
autocommit)
• The delights of (de)serialisation
– Kryo / Avro / Parquet…: Not directly due to Spark but not ease
Major issues are during import/export steps

And Beyond…@VIADEO
More than ETL, an analytics backend
Data
Receiver
Data
Modelin
g
RabbitMQ
Data
Modelin
Generic
Index
ElasticSearch
Spark Cluster Cluster
D3.JS
webapp
Data
Modelin
g
g

Join the Viadeo adventure
Wanted: Software Engineers
• We use Node.js, Spark,
ElasticSearch, CQRS, AWS and
many more…
• We love FullStack Engineers and
flat organization
• We work in autonomous product
team
• We lunch for free ;-)

Paris Data Geek - Spark Streaming

More Related Content

Paris Data Geek - Spark Streaming