This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
1 of 24
More Related Content
Paris Data Geek - Spark Streaming
1. Spark Streaming As Near
Realtime ETL
Paris Data Geek
18/09/2014
Djamel Zouaoui
@DjamelOnLine
2. Who am I ?
Djamel Zouaoui
Director Of Engineering
@DjamelOnLine
#Data
#Scala
#RecSys #Tech
#MachineLearning
#NoSql
#BigData
#Spark
#Dev
#R
#Architecture
3. What is
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
• Efficient • Usable
• General execution
graphs
• In-memory storage
• Rich APIs in Java,
Scala, Python
• Interactive shell
4. RDD in
• Resilient Distributed Dataset
• Storage abstraction for dataset in Spark
• Imutable
• Fault recovery
– Each RDD remembers how it was created, and can recover if any part of
the data is lost
• 3 kinds of operations
– Transformations: Lazy in nature, allow to create a new dataset from one
– Actions: Returns a value or exports data after performing a computation
– Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
10. What is
Project started in early 2012, extends Spark
for doing big data stream processing which:
Scales to hundreds of nodes
Achieves second-scale latencies
Efficiently recover from failures
Integrates with batch and interactive processing
13. How it works ?
• Input Source
Definition
• Input D-Stream
D-Stream Computations
• Window level
• Statefull option
• …
Classic RDDs
manipulation
• Transformation
• Action
14. Code
TOPOLOG
Y
FREE
//StreamingContext & Input source creation
//Standard transformations
//Window usage
//Start the streaming and put it in the background
15. Internals
• Two main processes
– Receivers in charge of the D-Stream creation
– Workers which in charge of data processing
• These processes are autonomous & independent
– No cores & resources shared
– No information shared
16. Execution Model – Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Receiver
Data
received
Blocks pushed
Blocks replicated
Block
Manager
Block
Manager
Master
Block
Manager
17. Execution Model – Job Scheduling
Spark Streaming + Spark Driver
Network
Input
Tracker
RDDs Block IDs
Job Scheduler
Spark’s
Schedulers
Receiver
Block
Manager
Block
Manager
Jobs executed on
worker nodes
DStream
Graph
Job
Manager
Job Queue
Jobs
18. Use Case: Find The True Love !
Build a recommender system based on implicit
and explicit data to find the best matching for you
• Based on Machine Learning models
• Processed offline (batch)
• On big (bunch of) data
• Main goals of streaming platforms :
– Need to store a lot of data
– Need to clean them
– Need to transform them
19. Overview
Data
Receiver
Data
Cleaning
job
KAFKA
Topics
Data
Modelin
g
job
HDFS
Storage
HDFS
Storage
Spark Cluster
• Spark in Standalone mode
• 120 cores available on Spark
• 4.5 GB RAM per core
• Based on Hadoop cluster for
HDFS storage (10 To)
• HDP 2.0
• 8 machines (2 masters, 6
slaves)
20. Data
Receiver
Data
Cleaning
job
Data
Modelin
g
job
HDFS
Storage
HDFS
Storage
• Use of provided Kafka
source
• Naive implementation:
– Based on
autocommit
– Automatic Offset
management
• Cleaning with classic
RDD transformations
• Persist new RDDs
– In HDFS for other spark
job (batch)
– In RAM to speed up
next step
• Binary matrix
• Scoring based on current
events and history
– History is load from
RDDs stored on HDFS
Job details
21. Issues
• Data Lost
– In the receiver phase due to naive kafka consumer
– Need a more robust client with handly offset management (VS
autocommit)
• The delights of (de)serialisation
– Kryo / Avro / Parquet…: Not directly due to Spark but not ease
Major issues are during import/export steps
22. And Beyond…@VIADEO
More than ETL, an analytics backend
Data
Receiver
Data
Modelin
g
RabbitMQ
Data
Modelin
Generic
Index
ElasticSearch
Spark Cluster Cluster
D3.JS
webapp
Data
Modelin
g
g
23. Join the Viadeo adventure
Wanted: Software Engineers
• We use Node.js, Spark,
ElasticSearch, CQRS, AWS and
many more…
• We love FullStack Engineers and
flat organization
• We work in autonomous product
team
• We lunch for free ;-)