Glint with Apache Spark

Agenda
Big Data
Overview
Spark
Overview
RRD
Features
Spark
Stack
Spark
Streaming

Big Data -- Digital Data growth…

Legacy Architecture Pain Points
• Report arrival latency quite high - Hours to perform joins,
aggregate data
• Existing frameworks cannot do both
• Either, stream processing of 100s of MB/s with low latency
• Or, batch processing of TBs of data with high latency
• Expressibility of business logic in Hadoop MR is challenging
• Lack of interactive SQL

Why Spark
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Better Fault Tolerance
Combine SQL, Streaming and complex analytics
Runs on Hadoop, Mesos, standalone, or in the cloud
Data sources -> HDFS, Cassandra, HBase and S3

In Memory - Spark vs Hadoop
Improve efficiency over MapReduce
100x in memory , 2-10x in disk
Up to 40x faster than Hadoop

Spark Eco System
RDBMS
Streaming
SQL
GraphX
BlinkDB
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib

RESILIENT DISTRIBUTED DATA (RDD)

Resilient Distributed Data (RDD)
Immutable + Distributed+ Catchable+ Lazy evaluated
 Distributed collections of objects
 Can be cached in memory across cluster nodes
 Manipulated through various parallel operations

Task Scheduler , DAG
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

Fault Recovery & Checkpoints
• Efficient fault recovery using Lineage
• log one operation to apply to many elements (lineage)
• Recomputed lost partitions on failure
• Checkpoint RDDs to prevent long lineage chains during fault
recovery

Cluster Support
• Standalone – a simple cluster manager included with Spark that
makes it easy to set up a cluster
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications
• Hadoop YARN – the resource manager in Hadoop 2

Spark Cluster Overview
o Application
o Driver program
o Cluster manage
o Worker node
o Executor
o Task
o Job
o Stage

Spark SQL
• Seamlessly mix SQL queries with Spark programs
• Load and query data from a variety of sources
• Standard Connectivity through (J)ODBC
• Hive Compatibility

Streaming
• Scalable high-throughput
streaming process of live data
• Integrate with many sources
• Fault-tolerant- Stateful
exactly-once semantics out of
box
• Combine streaming with
batch and interactive queries

MLib
• Scalable Machine learning library
• Iterative computing -> High Quality algorithm 100x faster than hadoop
• Algorithms (Mlib 1.3):
• linear SVM and logistic regression
• classification and regression tree
• random forest and gradient-boosted trees
• recommendation via alternating least squares
• clustering via k-means, Gaussian mixtures,
• power iteration clustering
• topic modeling via latent Dirichlet allocation
• singular value decomposition
• linear regression with L1- and L2-regularization
• isotonic regression
• multinomial naive Bayes
• frequent itemset mining via FP-growth
• basic statistics
• feature transformations

GraphX- Unifying Graphs and Tables
• Spark’s API For Graph and Graph-parallel computation
• Graph abstraction: a directed multigraph with properties attached
to each vertex and edge
• Seamlessly works with both graph and collections

Batches…
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes
them using RDD operations
• Finally, the processed results of the RDD operations are
returned in batches

Dstream (Discretized Streams)
DStream is represented by a continuous series of RDDs

Micro Batch (Near Real Time)
Micro Batch

Spark Streaming + SQL
Streaming
SQL

Quick Recap
• Why Spark? Spark Features?
• What is RDD?
• Fault – tolerance model
• Spark Extensions/Stack?
• Micro- Batch?

BDAS - Berkeley Data Analytics
Stack
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.

Optimization
• groupBy is costlier – use mapr() or reduceByKey()
• RDD storage level MEMOR_ONLY is better

RDDs vs Distributed Shared Mem

Glint with Apache Spark

More Related Content

Glint with Apache Spark

Editor's Notes