Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
June 13
Agenda
Big Data
Overview
Spark
Overview
RRD
Features
Spark
Stack
Spark
Streaming
BIG DATA OVERVIEW
Big Data -- Digital Data growth…
V-V-V
Use Cases
Real Time Feedback
Quick RecapHadoop EcoSystem
Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging
• Lack of interactive SQL
SPARK OVERVIEW
Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3
Consumed Apps…
In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop
Spark Stack
Spark Eco System
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
Benchmarking
RESILIENT DISTRIBUTED DATA (RDD)
Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations
QUICK DEMO
RDD Types
RDD
RDD Operation
Memory and Persistent
Dependencies Types
Task Scheduler , DAG
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery
SPARK CLUSTER
Cluster Support
• Standalone – a simple cluster manager included with Spark that
makes it easy to set up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications
• Hadoop YARN – the resource manager in Hadoop 2
Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Executor
o Task
o Job
o Stage
Spark On Mesos
Spark on YARN
Job Flow
SPARK STACK DETAILS
Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility
Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries
MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than hadoop
• Algorithms (Mlib 1.3):
• linear SVM and logistic regression
• classification and regression tree
• random forest and gradient-boosted trees
• recommendation via alternating least squares
• clustering via k-means, Gaussian mixtures,
• power iteration clustering
• topic modeling via latent Dirichlet allocation
• singular value decomposition
• linear regression with L1- and L2-regularization
• isotonic regression
• multinomial naive Bayes
• frequent itemset mining via FP-growth
• basic statistics
• feature transformations
GraphX- Unifying Graphs and Tables
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections
SPARK STREAMING
Spark Streaming
Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches
Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs
Micro Batch (Near Real Time)
Micro Batch
Window Operation & Checkpoint
Streaming Fault Tolerance
Spark Streaming + SQL
Streaming
SQL
Quick Run Spark UI
Spark with Storm
Quick Recap
• Why Spark? Spark Features?
• What is RDD?
• Fault – tolerance model
• Spark Extensions/Stack?
• Micro- Batch?
Clients…
Thank You….
Additional Slides
BDAS - Berkeley Data Analytics
Stack
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better
Big Data Landscape
RDDs vs Distributed Shared Mem
SQL Optimization

More Related Content

Glint with Apache Spark

Editor's Notes

  1. http://www.meetup.com/devops-bangalore/events/222155834/ http://www.meetup.com/lspe-in/events/212250542/
  2. http://www.business-software.com/wp-content/uploads/2014/09/Spark-Storm.jpg
  3. https://spark-summit.org/2013/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf
  4. Spark Summit2015-sample sldies
  5. http://opensource.com/business/15/1/apache-spark-new-world-record
  6. Transformations (eg: map, filter, group by) : Create a new dataset from an existing one Actions ( eg: count, collect, save) : Return a value to the driver program after running a computation on the dataset
  7. Chop up the live stream into batches of X seconds. Spark treats each batch of data as RDDs and processes them using RDD operations. Finally, the processed results of the RDD operations are returned in batches Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams
  8. Chop up the live stream into batches of X seconds. Spark treats each batch of data as RDDs and processes them using RDD operations. Finally, the processed results of the RDD operations are returned in batches
  9. https://spark.apache.org/docs/latest/streaming-programming-guide.html
  10. https://www.sigmoid.com/fault-tolerant-streaming-workflows/
  11. http://www.aerospike.com/blog/what-the-spark-introduction/