Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Spark Streaming As Near 
Realtime ETL 
Paris Data Geek 
18/09/2014 
Djamel Zouaoui 
@DjamelOnLine
Who am I ? 
Djamel Zouaoui 
Director Of Engineering 
@DjamelOnLine 
#Data 
#Scala 
#RecSys #Tech 
#MachineLearning 
#NoSql 
#BigData 
#Spark 
#Dev 
#R 
#Architecture
What is 
Fast and Expressive Cluster Computing 
Engine Compatible with Apache Hadoop 
• Efficient • Usable 
• General execution 
graphs 
• In-memory storage 
• Rich APIs in Java, 
Scala, Python 
• Interactive shell
RDD in 
• Resilient Distributed Dataset 
• Storage abstraction for dataset in Spark 
• Imutable 
• Fault recovery 
– Each RDD remembers how it was created, and can recover if any part of 
the data is lost 
• 3 kinds of operations 
– Transformations: Lazy in nature, allow to create a new dataset from one 
– Actions: Returns a value or exports data after performing a computation 
– Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
textFiles map map reduceByKey 
collect 
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2 
Stage 1 Stage 2
Ecosystem 
RDD-Based 
Matrices 
RDD-Based 
Graphs 
Spark RDD API 
DStream’s: 
Streams of RDD’s 
Spark 
Streaming GraphX MLLib 
RDD-Based 
Tables 
Spark 
SQL 
HDFS, S3, Cassandra 
YARN, Mesos, 
Standalone
What is 
Project started in early 2012, extends Spark 
for doing big data stream processing which: 
Scales to hundreds of nodes 
Achieves second-scale latencies 
Efficiently recover from failures 
Integrates with batch and interactive processing
How it works ?
How it works ?
How it works ? 
• Input Source 
Definition 
• Input D-Stream 
D-Stream Computations 
• Window level 
• Statefull option 
• … 
Classic RDDs 
manipulation 
• Transformation 
• Action
Code 
TOPOLOG 
Y 
FREE 
//StreamingContext & Input source creation 
//Standard transformations 
//Window usage 
//Start the streaming and put it in the background
Internals 
• Two main processes 
– Receivers in charge of the D-Stream creation 
– Workers which in charge of data processing 
• These processes are autonomous & independent 
– No cores & resources shared 
– No information shared
Execution Model – Receiving Data 
Spark Streaming + Spark Driver Spark Workers 
StreamingContext.start() 
Network 
Input 
Tracker 
Receiver 
Data 
received 
Blocks pushed 
Blocks replicated 
Block 
Manager 
Block 
Manager 
Master 
Block 
Manager
Execution Model – Job Scheduling 
Spark Streaming + Spark Driver 
Network 
Input 
Tracker 
RDDs Block IDs 
Job Scheduler 
Spark’s 
Schedulers 
Receiver 
Block 
Manager 
Block 
Manager 
Jobs executed on 
worker nodes 
DStream 
Graph 
Job 
Manager 
Job Queue 
Jobs
Use Case: Find The True Love ! 
Build a recommender system based on implicit 
and explicit data to find the best matching for you 
• Based on Machine Learning models 
• Processed offline (batch) 
• On big (bunch of) data 
• Main goals of streaming platforms : 
– Need to store a lot of data 
– Need to clean them 
– Need to transform them
Overview 
Data 
Receiver 
Data 
Cleaning 
job 
KAFKA 
Topics 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
Spark Cluster 
• Spark in Standalone mode 
• 120 cores available on Spark 
• 4.5 GB RAM per core 
• Based on Hadoop cluster for 
HDFS storage (10 To) 
• HDP 2.0 
• 8 machines (2 masters, 6 
slaves)
Data 
Receiver 
Data 
Cleaning 
job 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
• Use of provided Kafka 
source 
• Naive implementation: 
– Based on 
autocommit 
– Automatic Offset 
management 
• Cleaning with classic 
RDD transformations 
• Persist new RDDs 
– In HDFS for other spark 
job (batch) 
– In RAM to speed up 
next step 
• Binary matrix 
• Scoring based on current 
events and history 
– History is load from 
RDDs stored on HDFS 
Job details
Issues 
• Data Lost 
– In the receiver phase due to naive kafka consumer 
– Need a more robust client with handly offset management (VS 
autocommit) 
• The delights of (de)serialisation 
– Kryo / Avro / Parquet…: Not directly due to Spark but not ease 
Major issues are during import/export steps
And Beyond…@VIADEO 
More than ETL, an analytics backend 
Data 
Receiver 
Data 
Modelin 
g 
RabbitMQ 
Data 
Modelin 
Generic 
Index 
ElasticSearch 
Spark Cluster Cluster 
D3.JS 
webapp 
Data 
Modelin 
g 
g
Join the Viadeo adventure 
Wanted: Software Engineers 
• We use Node.js, Spark, 
ElasticSearch, CQRS, AWS and 
many more… 
• We love FullStack Engineers and 
flat organization 
• We work in autonomous product 
team 
• We lunch for free ;-)
QUESTIONS ?

More Related Content

Paris Data Geek - Spark Streaming

  • 1. Spark Streaming As Near Realtime ETL Paris Data Geek 18/09/2014 Djamel Zouaoui @DjamelOnLine
  • 2. Who am I ? Djamel Zouaoui Director Of Engineering @DjamelOnLine #Data #Scala #RecSys #Tech #MachineLearning #NoSql #BigData #Spark #Dev #R #Architecture
  • 3. What is Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop • Efficient • Usable • General execution graphs • In-memory storage • Rich APIs in Java, Scala, Python • Interactive shell
  • 4. RDD in • Resilient Distributed Dataset • Storage abstraction for dataset in Spark • Imutable • Fault recovery – Each RDD remembers how it was created, and can recover if any part of the data is lost • 3 kinds of operations – Transformations: Lazy in nature, allow to create a new dataset from one – Actions: Returns a value or exports data after performing a computation – Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
  • 5. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 6. textFiles map map reduceByKey collect sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 7. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2
  • 8. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2 Stage 1 Stage 2
  • 9. Ecosystem RDD-Based Matrices RDD-Based Graphs Spark RDD API DStream’s: Streams of RDD’s Spark Streaming GraphX MLLib RDD-Based Tables Spark SQL HDFS, S3, Cassandra YARN, Mesos, Standalone
  • 10. What is Project started in early 2012, extends Spark for doing big data stream processing which: Scales to hundreds of nodes Achieves second-scale latencies Efficiently recover from failures Integrates with batch and interactive processing
  • 13. How it works ? • Input Source Definition • Input D-Stream D-Stream Computations • Window level • Statefull option • … Classic RDDs manipulation • Transformation • Action
  • 14. Code TOPOLOG Y FREE //StreamingContext & Input source creation //Standard transformations //Window usage //Start the streaming and put it in the background
  • 15. Internals • Two main processes – Receivers in charge of the D-Stream creation – Workers which in charge of data processing • These processes are autonomous & independent – No cores & resources shared – No information shared
  • 16. Execution Model – Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data received Blocks pushed Blocks replicated Block Manager Block Manager Master Block Manager
  • 17. Execution Model – Job Scheduling Spark Streaming + Spark Driver Network Input Tracker RDDs Block IDs Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager Job Queue Jobs
  • 18. Use Case: Find The True Love ! Build a recommender system based on implicit and explicit data to find the best matching for you • Based on Machine Learning models • Processed offline (batch) • On big (bunch of) data • Main goals of streaming platforms : – Need to store a lot of data – Need to clean them – Need to transform them
  • 19. Overview Data Receiver Data Cleaning job KAFKA Topics Data Modelin g job HDFS Storage HDFS Storage Spark Cluster • Spark in Standalone mode • 120 cores available on Spark • 4.5 GB RAM per core • Based on Hadoop cluster for HDFS storage (10 To) • HDP 2.0 • 8 machines (2 masters, 6 slaves)
  • 20. Data Receiver Data Cleaning job Data Modelin g job HDFS Storage HDFS Storage • Use of provided Kafka source • Naive implementation: – Based on autocommit – Automatic Offset management • Cleaning with classic RDD transformations • Persist new RDDs – In HDFS for other spark job (batch) – In RAM to speed up next step • Binary matrix • Scoring based on current events and history – History is load from RDDs stored on HDFS Job details
  • 21. Issues • Data Lost – In the receiver phase due to naive kafka consumer – Need a more robust client with handly offset management (VS autocommit) • The delights of (de)serialisation – Kryo / Avro / Parquet…: Not directly due to Spark but not ease Major issues are during import/export steps
  • 22. And Beyond…@VIADEO More than ETL, an analytics backend Data Receiver Data Modelin g RabbitMQ Data Modelin Generic Index ElasticSearch Spark Cluster Cluster D3.JS webapp Data Modelin g g
  • 23. Join the Viadeo adventure Wanted: Software Engineers • We use Node.js, Spark, ElasticSearch, CQRS, AWS and many more… • We love FullStack Engineers and flat organization • We work in autonomous product team • We lunch for free ;-)