Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
DEVOPS ADVANCED CLASS
March 2015: Spark Summit East 2015
http://slideshare.net/databricks
www.linkedin.com/in/blueplastic
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines
– from ETL to Exploration and Dashboards, to
Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~50 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Cloudera
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
The Databricks team contributed more than 75% of the code added to Spark in the past year
AGENDA
• History of Spark
• RDD fundamentals
• Spark Runtime Architecture
Integration with Resource Managers
(Standalone, YARN)
• GUIs
• Lab: DevOps 101
Before Lunch
• Memory and Persistence
• Jobs -> Stages -> Tasks
• Broadcast Variables and
Accumulators
• PySpark
• DevOps 102
• Shuffle
• Spark Streaming
After Lunch

Recommended for you

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing

eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting. Speaker: Lantao Jin

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

awsemrhadoop
Some slides will be skipped
Please keep Q&A low during class
(5pm – 5:30pm for Q&A with instructor)
2 anonymous surveys: Pre and Post class
Lunch: noon – 1pm
2 breaks (before lunch and after lunch)
@
• AMPLab project was launched in Jan 2011, 6 year planned duration
• Personnel: ~65 students, postdocs, faculty & staff
• Funding from Government/Industry partnership, NSF Award, Darpa, DoE,
20+ companies
• Created BDAS, Mesos, SNAP. Upcoming projects: Succinct & Velox.
“Unknown to most of the world, the University of California, Berkeley’s AMPLab
has already left an indelible mark on the world of information technology, and
even the web. But we haven’t yet experienced the full impact of the
group[…] Not even close”
- Derrick Harris, GigaOm, Aug 2014
Algorithms
Machines
People
Scheduling Monitoring Distributing
RDBMS
Streaming
SQL
GraphX
Hadoop Input Format
Apps
Distributions:
- CDH
- HDP
- MapR
- DSE
Tachyon
MLlib
DataFrames API

Recommended for you

Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)

The document provides an overview of Amazon Elastic MapReduce (EMR) including how to easily launch and manage clusters, leverage Amazon S3 for storage, optimize file formats and storage, and design patterns for batch processing, interactive querying, and server clusters. It also shares lessons learned from Swiftkey including using Parquet and Cascalog for ETL, getting serialization right, avoiding many small files in S3, using spot instances, and experimenting with instance types. The document concludes by mentioning Apache Spark on EMR for faster in-memory processing directly from S3.

cloudcloud computingevents
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals

This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

apache sparksparkaisummit
Spark Summit East 2015 Advanced Devops Student Slides
General Batch Processing
Pregel
Dremel
Impala
GraphLab
Giraph
Drill
Tez
S4
Storm
Specialized Systems
(iterative, interactive, ML, streaming, graph, SQL, etc)
General Unified Engine
(2004 – 2013)
(2007 – 2015?)
(2014 – ?)
Mahout
Aug 2009
Source: openhub.net
...in June 2013
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals

Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with

shufflesparksqlspark
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications

This document discusses 5 common mistakes when writing Spark applications: 1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources. 2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this. 3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew. 4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible. 5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh

#apachespark #sparksummit
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink

The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.

10x – 100x
CPUs:
10 GB/s
100 MB/s
0.1 ms random access
$0.45 per GB
600 MB/s
3-12 ms random access
$0.05 per GB
1 Gb/s or
125 MB/s
Network
0.1 Gb/s
Nodes in
another
rack
Nodes in
same rack
1 Gb/s or
125 MB/s
June 2010
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of
machines that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like
parallel operations.
RDDs achieve fault tolerance through a notion of
lineage: if a partition of an RDD is lost, the RDD has
enough information about how it was derived from
other RDDs to be able to rebuild just that partition.”
April 2012
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
“We present Resilient Distributed Datasets (RDDs), a
distributed memory abstraction that lets
programmers perform in-memory computations on
large clusters in a fault-tolerant manner.
RDDs are motivated by two types of applications
that current computing frameworks handle
inefficiently: iterative algorithms and interactive data
mining tools.
In both cases, keeping data in memory can improve
performance by an order of magnitude.”
“Best Paper Award and Honorable Mention for Community Award”
- NSDI 2012
- Cited 392 times!

Recommended for you

Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals

Oracle Real Application Clusters 19c provides best practices and new features for upgrading to Oracle 19c. It discusses upgrading Oracle RAC to Linux 7 with minimal downtime using node draining and relocation techniques. Oracle 19c allows for upgrading the Grid Infrastructure management repository and patching faster using a new Oracle home. The presentation also covers new resource modeling for PDBs in Oracle 19c and improved Clusterware diagnostics.

oracle racracperformance
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu

In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about: • How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )? • How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)? • How to run some Apache Flink example programs? • How to get familiar with Flink's APIs and libraries? • How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)? • How to test and debug your Apache Flink code? • How to deploy your Apache Flink code in local, in a cluster or in the cloud? • How to tune your Apache Flink application (CPU, Memory, I/O)?

apache sparkapache flinkapache storm
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs

Flink Forward San Francisco 2022. Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale. by Hong Teoh & Usamah Jassat

apache flinkstream processing
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
- 2 Streaming Paper(s) have been cited 138 times
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Seemlessly mix SQL queries with Spark programs.
Coming soon!
(Will be published in the upcoming
weeks for SIGMOD 2015)
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}
https://amplab.cs.berkeley.edu/wp-
content/uploads/2013/05/grades-
graphx_with_fonts.pdf
https://www.cs.berkeley.edu/~sameerag/blinkdb
_eurosys13.pdf

Recommended for you

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

Building highly efficient data lakes using Apache Hudi (Incubating) Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake. Speaker: Vinoth Chandar (Uber) Vinoth is Technical Lead at Uber Data Infrastructure Team

sf big analyticsubergopro
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways. However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk. It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.

apache sparksparkaisummit
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

spark; internal; shuffle;
http://shop.oreilly.com/product/0636920028512.do
eBook: $33.99
Print: $39.99
PDF, ePub, Mobi, DAISY
Shipping now!
http://www.amazon.com/Learning-Spark-Lightning-
Fast-Data-Analysis/dp/1449358624
$30 @ Amazon:
Spark Summit East 2015 Advanced Devops Student Slides
http://tinyurl.com/dsesparklab
- 102 pages
- DevOps style
- For complete beginners
- Includes:
- Spark Streaming
- Dangers of
GroupByKey vs.
ReduceByKey
http://tinyurl.com/cdhsparklab
- 109 pages
- DevOps style
- For complete beginners
- Includes:
- PySpark
- Spark SQL
- Spark-submit

Recommended for you

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

big dataopen sourceapache software foundation
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Watch video at: http://youtu.be/Wg2boMqLjCg Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications

apache sparkdatabricksdatabricks cloud
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica

This document discusses Databricks Cloud, a platform for running Apache Spark workloads that aims to accelerate time-to-results from months to days. It provides a unified platform with notebooks, dashboards, and jobs running on Spark clusters managed by Databricks. Key benefits include zero management of clusters, interactive queries and streaming for real-time insights, and the ability to develop models and visualizations in notebooks and deploy them as production jobs or dashboards without code changes. The platform is open source with no vendor lock-in and supports various data sources and third party applications. It is being used by over 3,500 organizations for applications like data preparation, analytics, and machine learning.

apache sparkdatabricks clouddata science
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
(Scala & Python only)
Driver Program
Ex
RDD
W
RDD
T
T
Ex
RDD
W
RDD
T
T
Worker Machine
Worker Machine

Recommended for you

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training

Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries. Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.

apache spark trainingcloudera training
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR. Speaker: Hossein Falaki This talk was originally presented at Spark Summit East 2017.

apache sparkdata sciencesparkr
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training

The document provides an agenda for a DevOps advanced class on Spark being held in June 2015. The class will cover topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, Spark SQL, PySpark, and Spark Streaming. It will include labs on DevOps 101 and 102. The instructor has over 5 years of experience providing Big Data consulting and training, including over 100 classes taught.

spark summit 2015apache spark
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
RDD
Ex
RDD
W
RDD
Ex
RDD
W
RDD
Ex
RDD
W
more partitions = more parallelism
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
RDD w/ 4 partitions
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
An RDD can be created 2 ways:
- Parallelize a collection
- Read data from an external source (S3, C*, HDFS, etc)
logLinesRDD
# Parallelize in Python
wordsRDD = sc.parallelize([“fish", “cats“, “dogs”])
// Parallelize in Scala
val wordsRDD= sc.parallelize(List("fish", "cats", "dogs"))
// Parallelize in Java
JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”));
- Take an existing in-memory
collection and pass it to
SparkContext’s parallelize
method
- Not generally used outside of
prototyping and testing since it
requires entire dataset in
memory on one machine
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
- There are other methods
to read data from HDFS,
C*, S3, HBase, etc.

Recommended for you

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. The document discusses Spark's architecture including its core abstraction of resilient distributed datasets (RDDs), and demos Spark's capabilities for streaming, SQL, machine learning and graph processing on large clusters.

cassandraspark
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training

This document provides an agenda and overview for an introductory Spark development class. The class will cover the history of big data and Spark, RDD fundamentals, the Databricks UI, transformations and actions, DataFrames, Spark UIs, and resource managers. It includes surveys of students' backgrounds and use cases. Databricks is a platform for building data pipelines and advanced analytics with Spark.

spark summit 2015apache spark
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3

Hortonworks Data Platform (HDP) 2.3 includes several new capabilities: 1) It improves the user experience with more guided configuration, customizable dashboards, and improved workload management. 2) It enhances security with new data encryption at rest and extends data governance. 3) It adds proactive cluster monitoring through Hortonworks SmartSense to enhance support.

hadoophbaseapache hadoop
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)
errorsRDD
.coalesce( 2 )
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
.collect( )
Execute DAG!
Driver
.collect( )
Driver
logLinesRDD

Recommended for you

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.

structured streamingapache spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.

apache sparkdatabricksspark summit
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL

Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications. In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs. Speaker: Sameer Agarwal This talk was originally presented at Spark Summit East 2017.

data pipelinespark summit east 2017cloud data
.collect( )
logLinesRDD
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)
Pipelined
Stage-1
Driver
logLinesRDD
errorsRDD
cleanedRDD
Driver
data

Recommended for you

Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1

Ambari 2.1 includes several new features and improvements, including manual Kerberos configuration, customizable dashboards, guided configurations, rack awareness, and views framework enhancements. It adds support for Storm Nimbus HA, Ranger HA, RHEL/CentOS 7, and new JDKs. The views framework now supports auto-configuration using cluster properties and auto-creation of view instances when cluster requirements are met. Standalone Ambari servers allow running views without managing a cluster.

ambaribig data platformhdp
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks

This document compares query performance times between Apache Hive versions 0.10 and 0.13 using a benchmark of 50 SQL queries on a 30TB dataset. The results show that Hive 0.13 was over 100 times faster for 6 queries and averaged 52 times faster for all queries compared to Hive 0.10. Significant performance improvements were achieved through optimizations made during the Stinger Initiative involving 145 developers from 44 companies over 13 months.

Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!

This document provides information about using Scalding on Tez. It begins with prerequisites for using Scalding on Tez, including having a YARN cluster, Cascading 3.0, and the TEZ runtime library in HDFS. It then discusses setting memory and Java heap configuration flags for Tez jobs in Scalding. The document provides a mini-tutorial on using Scalding on Tez, covering build configuration, job flags, and challenges encountered in practice like Guava version mismatches and issues with Cascading's Tez registry. It also presents a word count plus example Scalding application built to run on Tez. The document concludes with some tips for debugging Tez jobs in Scalding using Cascading's

hadoophortonworks data platformbig-data
logLinesRDD
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.saveToCassandra( )
.count( )
5
logLinesRDD
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.count( )
.saveToCassandra( )
5
P-1 logLinesRDD
(HadoopRDD)
P-2 P-3 P-4
P-1 errorsRDD
(filteredRDD)
P-2 P-3 P-4
Task-1
Task-2
Task-3
Task-4
Path = hdfs://. . .
func = _.contains(…)
shouldCache=false
logLinesRDD
errorsRDD
Dataset-level view: Partition-level view:
1) Create some input RDDs from external data or parallelize a
collection in your driver program.
2) Lazily transform them to define new RDDs using
transformations like filter() or map()
3) Ask Spark to cache() any intermediate RDDs that will need to
be reused.
4) Launch actions such as count() and collect() to kick off a
parallel computation, which is then optimized and executed
by Spark.

Recommended for you

YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN

This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.

scalayarnscalding
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN

Part of the Hortonworks YARN Ready Webinar Series, this session is about management of Apache Hadoop and YARN using Apache Ambari. This series targets developers and we will feature a demo on Ambari.

apache hadoophortonworksyarn ready
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark

http://hortonworks.com/hadoop/spark/ Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=03debab5ba04b34a033dc5c2f03c7967 As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously. This is a continuation of our YARN Ready Series, aimed at helping developers learn the different ways to integrate to YARN and Hadoop. Tools and applications that are YARN Ready have been verified to work within YARN.

yarn readyapache hadoophortonworks
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
(lazy)
- Most transformations are element-wise (they work on one element at a time), but this is not
true for all transformations
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
• HadoopRDD
• FilteredRDD
• MappedRDD
• PairRDD
• ShuffledRDD
• UnionRDD
• PythonRDD
• DoubleRDD
• JdbcRDD
• JsonRDD
• SchemaRDD
• VertexRDD
• EdgeRDD
• CassandraRDD (DataStax)
• GeoRDD (ESRI)
• EsSpark (ElasticSearch)
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final

This document discusses how Hortonworks Data Platform (HDP) can enable enterprises to build a modern data architecture centered around Hadoop. It describes how HDP provides a centralized platform for managing all types of data at scale using technologies like YARN. Case studies are presented showing how companies have used HDP to optimize costs, develop new analytics applications, and work towards creating a unified "data lake". The document outlines the key components of HDP including its support for any application, any data, and deployment anywhere. It also highlights how partners extend HDP's capabilities and how Hortonworks provides enterprise-grade support.

data sciencehadoopapache hadoop
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3

The complete deck from Hands on Introduction to Hadoop, HDF, Hive and Pig: Part 1 Meetup held at Hortonworks HQ. hortonworkersR4apache

hadoop hdp pig hive hdfs
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development

The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.

apache sparkspark summit 2015
Spark Summit East 2015 Advanced Devops Student Slides
1) Set of partitions (“splits”)
2) List of dependencies on parent RDDs
3) Function to compute a partition given parents
4) Optional preferred locations
5) Optional partitioning info for k/v RDDs (Partitioner)
This captures all current Spark operations!
*
*
*
*
*
Partitions = one per HDFS block
Dependencies = none
Compute (partition) = read corresponding block
preferredLocations (part) = HDFS block location
Partitioner = none
*
*
*
*
*
Partitions = same as parent RDD
Dependencies = “one-to-one” on parent
Compute (partition) = compute parent and filter it
preferredLocations (part) = none (ask parent)
Partitioner = none
*
*
*
*
*

Recommended for you

Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution

Committed to the goal of building open-source frameworks, tools, and algorithms that make building real-time applications decisions on live data with stronger security, The RISELab is set to innovate and enhance Spark

sparkdatabricksamplab
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark

This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

databrickssparkapache spark
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement

Introduction and background Spark RDD API Introduction to Scala Spark DataFrames API + SparkSQL Spark Execution Model Spark Shell & Application Deployment Spark Extensions (Spark Streaming, MLLib, ML) Spark & DataStax Enterprise Integration Demos

datastaxhadoopcassandra
Partitions = One per reduce task
Dependencies = “shuffle” on each parent
Compute (partition) = read and join shuffled data
preferredLocations (part) = none
Partitioner = HashPartitioner(numTasks)
*
*
*
*
*
val cassandraRDD = sc
.cassandraTable(“ks”, “mytable”)
.select(“col-1”, “col-3”)
.where(“col-5 = ?”, “blue”)
Keyspace Table
{Server side column
& row selection
Start the Spark shell by passing in a custom cassandra.input.split.size:
ubuntu@ip-10-0-53-24:~$ dse spark –Dspark.cassandra.input.split.size=2000
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 0.9.1
/_/
Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
Creating SparkContext...
Created spark context..
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.
scala>
The cassandra.input.split.size parameter defaults to 100,000. This is the approximate
number of physical rows in a single Spark partition. If you have really wide rows
(thousands of columns), you may need to lower this value. The higher the value, the
fewer Spark tasks are created. Increasing the value too much may limit the parallelism
level.”
(for dealing with wide rows)
https://github.com/datastax/spark-cassandra-connector
Spark Executor
Spark-C*
Connector
C* Java Driver
- Open Source
- Implemented mostly in Scala
- Scala + Java APIs
- Does automatic type conversions

Recommended for you

Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark

This document provides an introduction and overview of Apache Spark. It discusses what Spark is, its performance advantages over Hadoop MapReduce, its core abstraction of resilient distributed datasets (RDDs), and how Spark programs are executed. Key features of Spark like its interactive shell, transformations and actions on RDDs, and Spark SQL are explained. Recent new features in Spark like DataFrames, external data sources, and the Tungsten performance optimizer are also covered. The document aims to give attendees an understanding of Spark's capabilities and how it can provide faster performance than Hadoop for certain applications.

big datasparkhadoop
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?

Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.

sparkhadoopanalytics
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

sparkhadoopbig data
https://github.com/datastax/spark-cassandra-connector
“Simple things
should be simple,
complex things
should be possible”
- Alan Kay
DEMO:
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME

Brief introduction to Apache Spark, prior to a demo of Databricks Cloud by Timothy Hunter. Sponsored by Prof. Reza Zadeh @ Stanford ICME

stanfordsparkmachine learning
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark

1. MapReduce Review 2. Introduction to Spark and RDDs 3. Generality of RDDs (e.g. streaming, ML) 4. DataFrames 5. Internals

hadoopbig datastanford
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th

The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.

- Local
- Standalone Scheduler
- YARN
- Mesos
Static Partitioning
Dynamic Partitioning
JobTracker
DNTT
M
M
R M
M
R M
M
R M
M
R
M
M
M
M
M
M
RR R
R
OSOSOSOS
JT
DN DNTT DNTT TT
History:
NameNode
NN
Spark Summit East 2015 Advanced Devops Student Slides
JVM: Ex + Driver
Disk
RDD, P1 Task
3 options:
- local
- local[N]
- local[*]
RDD, P2
RDD, P1
RDD, P2
RDD, P3
Task
Task
Task
Task
Task
CPUs:
Task
Task
Task
Task
Task
Task
Internal
Threads
val conf = new SparkConf()
.setMaster("local[12]")
.setAppName(“MyFirstApp")
.set("spark.executor.memory", “3g")
val sc = new SparkContext(conf)
> ./bin/spark-shell --master local[12]
> ./bin/spark-submit --name "MyFirstApp"
--master local[12] myApp.jar
Worker Machine

Recommended for you

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

qconinfoqqconsf
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming

This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.

big datanosqlspark
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型

The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.

spark
Spark Summit East 2015 Advanced Devops Student Slides
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P4
W
RDD, P6
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P7
W
RDD, P8
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Spark
Master
Ex
RDD, P5
W
RDD, P3
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
T T
T T
different spark-env.sh
- SPARK_WORKER_CORES
vs.> ./bin/spark-submit --name “SecondApp"
--master spark://host1:port1
myApp.jar - SPARK_LOCAL_DIRSspark-env.sh
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P4
W
RDD, P6
RDD, P1
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Ex
RDD, P7
W
RDD, P8
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Spark
Master
Ex
RDD, P5
W
RDD, P3
RDD, P2
T T
T T
T T
Internal
Threads
SSD SSDOS Disk
SSD SSD
Spark
Master
T T
T T
different spark-env.sh
- SPARK_WORKER_CORES
vs.
I’m HA via
ZooKeeper
> ./bin/spark-submit --name “SecondApp"
--master spark://host1:port1,host2:port2
myApp.jar
Spark
Master
More
Masters
can be
added live
- SPARK_LOCAL_DIRSspark-env.sh
W
Driver
SSDOS Disk
W
SSDOS Disk
W
SSDOS Disk
Spark
Master
W
SSDOS Disk
(multiple apps)
Ex Ex Ex Ex
Driver
ExEx Ex Ex

Recommended for you

Spark
SparkSpark
Spark

Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.

sparkspark architecturespark for beginners
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II

This document provides a summary of Spark RDDs and the Spark execution model: - RDDs (Resilient Distributed Datasets) are Spark's fundamental data structure, representing an immutable distributed collection of objects that can be operated on in parallel. RDDs track lineage to support fault tolerance and optimization. - Spark uses a logical plan built from transformations on RDDs, which is then optimized and scheduled into physical stages and tasks by the Spark scheduler. Tasks operate on partitions of RDDs in a data-parallel manner. - The scheduler pipelines transformations where possible, truncates redundant work, and leverages caching and data locality to improve performance. It splits the graph into stages separated by shuffle operations or parent RDD boundaries

big dataspark
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark

Spark is an open-source cluster computing framework. It started as a project in 2009 at UC Berkeley and was open sourced in 2010. It has over 300 contributors from 50+ organizations. Spark uses Resilient Distributed Datasets (RDDs) that allow in-memory cluster computing across clusters. RDDs provide a programming model for distributed datasets that can be created from external storage or by transforming existing RDDs. RDDs support operations like map, filter, reduce to perform distributed computations lazily.

Driver
SSDOS Disk SSDOS Disk SSDOS Disk
Spark
Master
W
SSDOS Disk
(single app)
Ex
W
Ex
W
Ex
W
Ex
W
Ex
W
Ex
W
Ex
W
Ex
SPARK_WORKER_INSTANCES: [default: 1] # of worker instances to run on each machine
SPARK_WORKER_CORES: [default: ALL] # of cores to allow Spark applications to use on the machine
SPARK_WORKER_MEMORY: [default: TOTAL RAM – 1 GB] Total memory to allow Spark applications to use on the machineconf/spark-env.sh
SPARK_DAEMON_MEMORY: [default: 512 MB] Memory to allocate to the Spark master and worker daemons themselves
Standalone settings
- Apps submitted will run in FIFO mode by default
spark.cores.max: maximum amount of CPU cores to request for the
application from across the cluster
spark.executor.memory: Memory for each executor
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup

This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.

hortonworksmeetupned shawa
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general. This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.

#udconf#hdconf
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin

This document provides an overview of Spark's RDD abstraction and the life cycle of a Spark application. It defines RDDs as distributed collections characterized by partitions, dependencies, a compute function, and optional properties like a partitioner. It describes how Spark builds a DAG of stages from transformations and actions, and schedules tasks across executors. The document also covers performance debugging techniques like identifying slow stages, stragglers, garbage collection issues, and profiling tasks locally. Debugging tools discussed include the Spark UI, executor logs, jstack, and YourKit.

spark
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming. Presented at the Desert Code Camp: http://oct2016.desertcodecamp.com/sessions/all

big dataspark sqlapache spark
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming. Presented at the Desert Code Camp: http://oct2016.desertcodecamp.com/sessions/all

sparkspark streamingbig data
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
NodeManager
Resource
Manager
NodeManager
Container
NodeManager
App Master
Client #1 1
2 3 4 5
Container
6 6
7 7
8
NodeManager
Resource
Manager
NodeManager
Container
NodeManager
App Master
Client #1
ContainerApp Master
Container Container
Client #2
I’m HA via
ZooKeeper
Scheduler
Apps Master

Recommended for you

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

NodeManager
Resource
Manager
NodeManager
Container
NodeManager
App Master
Client #1
Executor
RDD T
Container
Executor
RDD T
(client mode)
Driver
NodeManager
Resource
Manager
NodeManager
Container
NodeManager
App Master
Client #1
Executor
RDD T
Container
Executor
RDD T
Driver
(cluster mode)
Container
Executor
RDD T
- Does not support Spark Shells
YARN settings
--num-executors: controls how many executors will be allocated
--executor-memory: RAM for each executor
--executor-cores: CPU cores for each executor
spark.dynamicAllocation.enabled
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout (N)
spark.dynamicAllocation.schedulerBacklogTimeout (M)
spark.dynamicAllocation.executorIdleTimeout (K)
Dynamic Allocation:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
YARN resource manager UI: http://<ip address>:8088
(No apps running)

Recommended for you

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

[ec2-user@ip-10-0-72-36 ~]$ spark-submit --class
org.apache.spark.examples.SparkPi --deploy-mode client --master yarn
/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-
examples-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar 10
App running in client mode
Spark Summit East 2015 Advanced Devops Student Slides
[ec2-user@ip-10-0-72-36 ~]$ spark-submit --class
org.apache.spark.examples.SparkPi --deploy-mode cluster --master
yarn /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-
examples-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar 10

Recommended for you

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

App running in cluster mode
App running in cluster mode
App running in cluster mode
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Spark Central Master Who starts Executors? Tasks run in
Local [none] Human being Executor
Standalone Standalone Master Worker JVM Executor
YARN YARN App Master Node Manager Executor
Mesos Mesos Master Mesos Slave Executor
spark-submit provides a uniform interface for
submitting jobs across all cluster managers
bin/spark-submit --master spark://host:7077
--executor-memory 10g
my_script.py
Source: Learning Spark
Spark Summit East 2015 Advanced Devops Student Slides
Ex
RDD, P1
RDD, P2
RDD, P1
T T
T T
T T
Internal
Threads
Recommended to use at most only 75% of a machine’s memory
for Spark
Minimum Executor heap size should be 8 GB
Max Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and
serialization format

Recommended for you

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

+Vs.
RDD.cache() == RDD.persist(MEMORY_ONLY)
JVM
deserialized
most CPU-efficient option
Spark Summit East 2015 Advanced Devops Student Slides
RDD.persist(MEMORY_ONLY_SER)
JVM
serialized

Recommended for you

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf

Ansys Mechanical enables you to solve complex structural engineering problems and make better, faster design decisions. With the finite element analysis (FEA) solvers available in the suite, you can customize and automate solutions for your structural mechanics problems and parameterize them to analyze multiple design scenarios. Ansys Mechanical is a dynamic tool that has a complete range of analysis tools.

mechanical engineeringsoftware3d software
+
.persist(MEMORY_AND_DISK)
JVM
deserialized Ex
W
RDD-P1
T
T
OS Disk
SSD
RDD-P1
RDD-P2
+
.persist(MEMORY_AND_DISK_SER)
JVM
serialized
.persist(DISK_ONLY)
JVM
RDD.persist(MEMORY_ONLY_2)
JVM on Node X
deserialized deserialized
JVM on Node Y

Recommended for you

Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors

Connectors integrate Apache Kafka® with external data systems, enabling you to move away from a brittle spaghetti architecture to one that is more streamlined, secure, and future-proof. However, if your team still spends multiple dev cycles building and managing connectors using just open source Kafka Connect, it’s time to consider a faster and cost-effective alternative.

WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf

WhatsApp Tracker Software is an effective tool for remotely tracking the target’s WhatsApp activities. It allows users to monitor their loved one’s online behavior to ensure appropriate interactions for responsive device use. Download this PPTX file and share this information to others.

whatsapp trackerwhatsapp tracker for parentswhatsapp tracker for employers
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx

Splunk Presentation

+
.persist(MEMORY_AND_DISK_2)
JVM
deserialized
+
JVM
deserialized
.persist(OFF_HEAP)
JVM-1 / App-1
serialized
Tachyon
JVM-2 / App-1
JVM-7 / App-2
.unpersist()
JVM
JVM
?

Recommended for you

Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud

Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.

coldfusioncfmlwebsite
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024

Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024

dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf

dachnug51 | HCL Domino Roadmap | Thomas Hampel & Tim Clark

dnugdachnug51hcl software
?
- If RDD fits in memory, choose MEMORY_ONLY
- If not, use MEMORY_ONLY_SER w/ fast serialization library
- Don’t spill to disk unless functions that computed the datasets
are very expensive or they filter a large amount of data.
(recomputing may be as fast as reading from disk)
- Use replicated storage levels sparingly and only if you want fast
fault recovery (maybe to serve requests from a web app)
Intermediate data is automatically persisted during shuffle operations
Remember!
PySpark: stored objects will always be serialized with Pickle library, so it does
not matter whether you choose a serialized level.
=
60%20%
20%
Default Memory Allocation in Executor JVM
Cached RDDs
User Programs
(remainder)
Shuffle memory
spark.storage.memoryFraction
FIX THIS

Recommended for you

Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)

Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require. To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals). In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.

softwaremodel-driven engineeringdomain-specific languages
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies

A captivating AI chatbot PowerPoint presentation is made with a striking backdrop in order to attract a wider audience. Select this template featuring several AI chatbot visuals to boost audience engagement and spontaneity. With the aid of this multi-colored template, you may make a compelling presentation and get extra bonuses. To easily elucidate your ideas, choose a typeface with vibrant colors. You can include your data regarding utilizing the chatbot methodology to the remaining half of the template.

chatbot ppt
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You

₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You

RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of
memory used when caching to a certain fraction of the JVM’s overall heap, set by
spark.storage.memoryFraction
Shuffle and aggregation buffers: When performing shuffle operations, Spark will
create intermediate buffers for storing shuffle output data. These buffers are used to
store intermediate results of aggregations in addition to buffering data that is going
to be directly output as part of the shuffle.
User code: Spark executes arbitrary user code, so user functions can themselves
require substantial memory. For instance, if a user application allocates large arrays
or other objects, these will content for overall memory usage. User code has access
to everything “left” in the JVM heap after the space for RDD storage and shuffle
storage are allocated.
Spark uses memory for:
1. Create an RDD
2. Put it into cache
3. Look at SparkContext logs
on the driver program or
Spark UI
INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
logs will tell you how much memory each
partition is consuming, which you can
aggregate to get the total size of the RDD
Spark Summit East 2015 Advanced Devops Student Slides
Serialization is used when:
Transferring data over the network
Spilling data to disk
Caching to memory serialized
Broadcasting variables

Recommended for you

Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile OfferPanvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer

Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer

 
by $A19
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours

Explore the rapid development journey of TryBoxLang, completed in just 48 hours. This session delves into the innovative process behind creating TryBoxLang, a platform designed to showcase the capabilities of BoxLang by Ortus Solutions. Discover the challenges, strategies, and outcomes of this accelerated development effort, highlighting how TryBoxLang provides a practical introduction to BoxLang's features and benefits.

coldfusioncfmladobe
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx

Enhance the top 9 user pain points with effective visual design elements to improve user experience & satisfaction. Learn the best design strategies

#ui visual designrecruitmentux
Java serialization Kryo serializationvs.
• Uses Java’s ObjectOutputStream framework
• Works with any class you create that implements
java.io.Serializable
• You can control the performance of serialization
more closely by extending java.io.Externalizable
• Flexible, but quite slow
• Leads to large serialized formats for many classes
• Recommended serialization for production apps
• Use Kyro version 2 for speedy serialization (10x) and
more compactness
• Does not support all Serializable types
• Requires you to register the classes you’ll use in
advance
• If set, will be used for serializing shuffle data between
nodes and also serializing RDDs to disk
conf.set(“spark.serializer”, "org.apache.spark.serializer.KryoSerializer")
To register your own custom classes with Kryo, use the
registerKryoClasses method:
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
- If your objects are large, you may need to increase
spark.kryoserializer.buffer.mb config property
- The default is 2, but this value needs to be large enough to
hold the largest object you will serialize.
. . .
Ex
RDD
W
RDD
Ex
RDD
W
RDD
High churn Low churn
. . .
Ex
RDD
W
RDD
RDD
High churn
Cost of GC is proportional to the # of
Java objects
(so use an array of Ints instead of a
LinkedList)
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
To measure GC impact:

Recommended for you

mobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdfmobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdf

Drona Infotech is one of the best Mobile App Development Company in Noida. Elevate your business with our professional app development services. Let us help you create user-friendly and high-performing mobile applications. Visit Us For: https://www.dronainfotech.com/mobile-application-development/

mobile app development companyapp development companymobile app development service
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe

@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe rent4kolkata@gmail.com j11

Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe

Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe rent4kolkata@gmail.com j11

Parallel Old GC CMS GC G1 GC
-XX:+UseParallelOldGC -XX:+UseConcMarkSweepGC -XX:+UseG1GC
- Uses multiple threads
to do both young gen
and old gen GC
- Also a multithreading
compacting collector
- HotSpot does
compaction only in
old gen
Parallel GC
-XX:+UseParallelGC
- Uses multiple threads to
do young gen GC
- Will default to Serial on
single core machines
- Aka “throughput
collector”
- Good for when a lot of
work is needed and
long pauses are
acceptable
- Use cases: batch
processing
-XX:ParallelGCThreads=<#> -XX:ParallelCMSThreads=<#>
- Concurrent Mark
Sweep aka
“Concurrent low
pause collector”
- Tries to minimize
pauses due to GC by
doing most of the work
concurrently with
application threads
- Uses same algorithm
on young gen as
parallel collector
- Use cases:
- Garbage First is available
starting Java 7
- Designed to be long term
replacement for CMS
- Is a parallel, concurrent
and incrementally
compacting low-pause
GC
?
Stage 1
Stage 2
Stage 3
Stage 5
.
.
Job #1
.collect( )
Task #1
Task #2
Task #3
.
.
Stage 4
Task
Scheduler
Task threads
Block manager
RDD Objects DAG Scheduler Task Scheduler Executor
Rdd1.join(rdd2)
.groupBy(…)
.filter(…)
- Build operator DAG
- Split graph into
stages of tasks
- Submit each stage as
ready
- Execute tasks
- Store and serve
blocks
DAG TaskSet Task
Agnostic to
operators
Doesn’t know
about stagesStage
failed
- Launches
individual tasks
- Retry failed or
straggling tasks

Recommended for you

AI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdfAI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdf

Discover how generative AI is transforming IT development in this blog. Learn how using AI software development, artificial intelligence tools, and generative AI tools can lead to smarter, faster, and more efficient software creation. Explore real-world applications and see how these technologies are driving innovation and cutting costs in IT development.

chatbot development companyai chatbot development companychatbot development services
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference

In this session, we will be discussing major outages that happened in major enterprises. We will analyse the actual thread dumps, heap dumps, GC logs, and other artifacts captured at the time of the problem. After this session, troubleshooting CPU spikes, OutOfMemoryError, response time degradations, network connectivity issues, and application unresponsiveness may not stump you.

thread dumpheap dump analysis toolgarbagecollectionlogs
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.

Shivam Pandit Php Web Dveloper

phpmysqlsql
Spark Summit East 2015 Advanced Devops Student Slides
“One of the challenges in providing RDDs as an abstraction is
choosing a representation for them that can track lineage across a
wide range of transformations.”
“The most interesting question in designing this interface is how to
represent dependencies between RDDs.”
“We found it both sufficient and useful to classify dependencies
into two types:
• narrow dependencies, where each partition of the parent RDD
is used by at most one partition of the child RDD
• wide dependencies, where multiple child partitions may
depend on it.”
Examples of narrow and wide dependencies.
Each box is an RDD, with partitions shown as shaded rectangles.
Requires
shuffle
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
= lost partition

Recommended for you

@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable

For Ad Post Contact :- adityaroy0215@gmail.com

“This distinction is useful for two reasons:
1) Narrow dependencies allow for pipelined execution on one cluster node,
which can compute all the parent partitions. For example, one can apply a
map followed by a filter on an element-by-element basis.
In contrast, wide dependencies require data from all parent partitions to be
available and to be shuffled across the nodes using a MapReduce-like
operation.
2) Recovery after a node failure is more efficient with a narrow dependency, as
only the lost parent partitions need to be recomputed, and they can be
recomputed in parallel on different nodes. In contrast, in a lineage graph with
wide dependencies, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution.”
Dependencies: Narrow vs Wide
scala> input.toDebugString
res85: String =
(2) data.text MappedRDD[292] at textFile at <console>:13
| data.text HadoopRDD[291] at textFile at <console>:13
scala> counts.toDebugString
res84: String =
(2) ShuffledRDD[296] at reduceByKey at <console>:17
+-(2) MappedRDD[295] at map at <console>:17
| FilteredRDD[294] at filter at <console>:15
| MappedRDD[293] at map at <console>:15
| data.text MappedRDD[292] at textFile at <console>:13
| data.text HadoopRDD[291] at textFile at <console>:13
To display the lineage of an RDD, Spark provides a toDebugString method:
How do you know if a shuffle will be called on a Transformation?
Note that repartition just calls coalese w/ True:
- repartition , join, cogroup, and any of the *By or *ByKey transformations
can result in shuffles
- If you declare a numPartitions parameter, it’ll probably shuffle
- If a transformation constructs a shuffledRDD, it’ll probably shuffle
- combineByKey calls a shuffle (so do other transformations like
groupByKey, which actually end up calling combineByKey)
def repartition(numPartitions: Int)(implicit
ord: Ordering[T] = null): RDD[T] = {
coalesce(numPartitions, shuffle = true)
}
RDD.scala
How do you know if a shuffle will be called on a Transformation?
Transformations that use “numPartitions” like distinct will probably shuffle:
def distinct(numPartitions: Int)(implicit ord: Ordering[T] =
null): RDD[T] =
map(x => (x, null)).reduceByKey((x, y) => x,
numPartitions).map(_._1)

Recommended for you

- An extra parameter you can pass a k/v transformation to let Spark know
that you will not be messing with the keys at all
- All operations that shuffle data over network will benefit from partitioning
- Operations that benefit from partitioning:
cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey,
reduceByKey, combineByKey, lookup, . . .
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L302
Link
Source: Cloudera
Source: Cloudera
sc.textFile("someFile.txt").
map(mapFunc).
flatMap(flatMapFunc).
filter(filterFunc).
count()
How many Stages will this code require?
Source: Cloudera

Recommended for you

Source: Cloudera
How many Stages will this DAG require?
Source: Cloudera
How many Stages will this DAG require?
++ +
Ex
Ex
Ex
x = 5
T
T
x = 5
x = 5
x = 5
x = 5
T
T

Recommended for you

• Broadcast variables – Send a large read-only lookup table to all the nodes, or
send a large feature vector in a ML algorithm to all nodes
• Accumulators – count events that occur during job execution for debugging
purposes. Example: How many lines of the input file were blank? Or how many
corrupt records were in the input dataset?
++ +
Spark supports 2 types of shared variables:
• Broadcast variables – allows your program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations. Like
sending a large, read-only lookup table to all the nodes.
• Accumulators – allows you to aggregate values from worker nodes back to
the driver program. Can be used to count the # of errors seen in an RDD of
lines spread across 100s of nodes. Only the driver can access the value of an
accumulator, tasks cannot. For tasks, accumulators are write-only.
++ +
Broadcast variables let programmer keep a read-
only variable cached on each machine rather than
shipping a copy of it with tasks
For example, to give every node a copy of a large
input dataset efficiently
Spark also attempts to distribute broadcast variables
using efficient broadcast algorithms to reduce
communication cost
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
broadcastVar = sc.broadcast(list(range(1, 4)))
broadcastVar.value
Scala:
Python:

Recommended for you

Spark Summit East 2015 Advanced Devops Student Slides
Link
Ex
Ex
Ex
History:
20 MB file
Uses HTTP
Ex
Ex
Ex
20 MB file
Uses bittorrent
. . .4 MB 4 MB 4 MB 4 MB

Recommended for you

Source: Scott Martin
Ex
Ex
Ex Ex
Ex
Ex ExEx
Accumulators are variables that can only be “added” to through
an associative operation
Used to implement counters and sums, efficiently in parallel
Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend
for new types
Only the driver program can read an accumulator’s value, not the
tasks
++ +
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
accum = sc.accumulator(0)
rdd = sc.parallelize([1, 2, 3, 4])
def f(x):
global accum
accum += x
rdd.foreach(f)
accum.value
Scala:
Python:
++ +

Recommended for you

Spark Summit East 2015 Advanced Devops Student Slides
PySpark at a Glance
Write Spark jobs
in Python
Run interactive
jobs in the shell
Supports C
extensions
Spark Core Engine
(Scala)
Standalone Scheduler YARN MesosLocal
Java API
PySpark
41 files
8,100 loc
6,300 comments
Spark
Context
Controller
Spark
Context
Py4j
Socket
Local Disk
Pipe
Driver JVM
Executor JVM
Executor JVM
Pipe
Worker MachineDriver Machine
F(x)
F(x)
F(x)
F(x)
F(x)
RDD
RDD
RDD
RDD
MLlib, SQL, shuffle
MLlib, SQL, shuffle
daemon.py
daemon.py

Recommended for you

Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD
MappedRDD
PythonRDD RDD[Array[ ] ], , ,
(100 KB – 1MB each picked object)
pypy
• JIT, so faster
• less memory
• CFFI support
CPython
(default python)
Choose Your Python Implementation
Spark
Context
Driver Machine
Worker Machine
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/pyspark
$ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
OR
Job CPython 2.7 PyPy 2.3.1 Speed up
Word Count 41 s 15 s 2.7 x
Sort 46 s 44 s 1.05 x
Stats 174 s 3.6 s 48 x
The performance speed up will depend on work load (from 20% to 3000%).
Here are some benchmarks:
Here is the code used for benchmark:
rdd = sc.textFile("text")
def wordcount():
rdd.flatMap(lambda x:x.split('/'))
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
rdd.sortBy(lambda x:x, 1).count()
def stats():
sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
https://github.com/apache/spark/pull/2144
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

Spark Summit East 2015 Advanced Devops Student Slides
Spark sorted the same data 3X faster
using 10X fewer machines
than Hadoop MR in 2013.
Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia
100TB Daytona Sort Competition 2014
More info:
http://sortbenchmark.org
http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
All the sorting took place on disk (HDFS) without
using Spark’s in-memory cache!
Spark Summit East 2015 Advanced Devops Student Slides
- Stresses “shuffle” which underpins everything from SQL to Mllib
- Sorting is challenging b/c there is no reduction in data
- Sort 100 TB = 500 TB disk I/O and 200 TB network
Engineering Investment in Spark:
- Sort-based shuffle (SPARK-2045)
- Netty native network transport (SPARK-2468)
- External shuffle service (SPARK-3796)
Clever Application level Techniques:
- GC and cache friendly memory layout
- Pipelining

Recommended for you

Ex
RDD
W
RDD
T
T
EC2: i2.8xlarge
(206 workers)
- Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores
- 244 GB of RAM
- 8 x 800 GB SSD and RAID 0 setup formatted with /ext4
- ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes
- Each record: 100 bytes (10 byte key & 90 byte value)
- OpenJDK 1.7
- HDFS 2.4.1 w/ short circuit local reads enabled
- Apache Spark 1.2.0
- Speculative Execution off
- Increased Locality Wait to infinite
- Compression turned off for input, output & network
- Used Unsafe to put all the data off-heap and managed
it manually (i.e. never triggered the GC)
- 32 slots per machine
- 6,592 slots total
=
groupByKey
sortByKey
reduceByKey
spark.shuffle.spill=false
(Affects reducer side and keeps all the data in memory)
- Must turn this on for dynamic allocation in YARN
- Worker JVM serves files
- Node Manager serves files

Recommended for you

- Was slow because it had to copy the data 3 times
Map output file
on local dir
Linux
kernel
buffer
Ex
NIC
buffer
- Uses a technique called zero-copy
- Is a map-side optimization to serve data very
quickly to requesting reducers
Map output file
on local dir
NIC
buffer
Map() Map() Map() Map()
Reduce() Reduce() Reduce()
- Entirely bounded
by I/O reading from
HDFS and writing out
locally sorted files
- Mostly network bound
< 10,000 reducers
- Notice that map
has to keep 3 file
handles open
TimSort
= 5 blocks
Map() Map() Map() Map()
(28,000 unique blocks)
RF = 2
250,000+ reducers!
- Only one file handle
open at a time
= 3.6 GB

Recommended for you

Map() Map() Map() Map()
- 5 waves of maps
- 5 waves of reduces
Reduce() Reduce() Reduce()
RF = 2
250,000+ reducers!
MergeSort!
TimSort
(28,000 unique blocks)
RF = 2
- Actual final run
- Fully saturated
the 10 Gbit link
Link
UserID Name Age Location Pet
28492942 John Galt 32 New York Sea Horse
95829324 Winston Smith 41 Oceania Ant
92871761 Tom Sawyer 17 Mississippi Raccoon
37584932 Carlos Hinojosa 33 Orlando Cat
73648274 Luis Rodriguez 34 Orlando Dogs

Recommended for you

JDBC/ODBC Your App
. . .
SchemaRDD
- RDD of Row objects, each representing a record
- Row objects = type + col. name of each
- Stores data very efficiently by taking advantage of the schema
- SchemaRDDs are also regular RDDs, so you can run
transformations like map() or filter()
- Allows new operations, like running SQL on objects
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Warning!
Only looks at first row

Recommended for you

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Link

Recommended for you

TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Kafka
Flume
HDFS
S3
Kinesis
Twitter
TCP socket
HDFS
Cassandra
Dashboards
Databases
- Scalable
- High-throughput
- Fault-tolerant
Complex algorithms can be expressed using:
- Spark transformations: map(), reduce(), join(), etc
- MLlib + GraphX
- SQL
Batch Realtime
One unified API
Tathagata Das (TD)
- Lead developer of Spark Streaming + Committer
on Apache Spark core
- Helped re-write Spark Core internals in 2012 to
make it 10x faster to support Streaming use cases
- On leave from UC Berkeley PhD program
- Ex: Intern @ Amazon, Intern @ Conviva, Research
Assistant @ Microsoft Research India
- 1 guy; does not scale
- Scales to 100s of nodes
- Batch sizes as small at half a second
- Processing latency as low as 1 second
- Exactly-once semantics no matter what fails

Recommended for you

Page views Kafka for buffering Spark for processing
(live statistics)
Smart meter readings
Live weather data
Join 2 live data
sources
(Anomaly Detection)
Input data stream
Batches of
processed data
Batches every X seconds
(Discretized Stream)
Block #1
RDD @ T=0
Block #2 Block #3
Batch interval = 5 seconds
Block #1
RDD @ T=+5
Block #2 Block #3
T = 0 T = +5
Input
DStream
One RDD is created every 5 seconds

Recommended for you

Block #1 Block #2 Block #3
Part. #1 Part. #2 Part. #3
Part. #1 Part. #2 Part. #3
5 sec
Materialize!
linesDStream
wordsRDD
flatMap()
linesRDD
linesDStream
wordsDStream
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5)
# Create a DStream that will connect to hostname:port, like localhost:9999
linesDStream = ssc.socketTextStream("localhost", 9999)
# Split each line into words
wordsDStream = linesDStream.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairsDStream = wordsDStream.map(lambda word: (word, 1))
wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCountsDStream.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
linesStream
wordsStream
pairsStream
wordCountsStream
Terminal #1 Terminal #2
$ nc -lk 9999
hello world
$ ./network_wordcount.py localhost 9999
. . .
--------------------------
Time: 2015-04-25 15:25:21
--------------------------
(hello, 2)
(world, 1)
Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Batch interval = 600 ms
R

Recommended for you

Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P3
W
RDD, P4
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
200 ms later
Ex
W
block, P2 T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
Ex
RDD, P1
W
Driver
RDD, P2
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
200 ms later
Ex
W
block, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
block, P2
Batch interval = 600 ms
block, P3
block, P3
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3
Ex
RDD, P1
W
Driver
RDD, P2
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
RDD, P1
W
RDD, P2
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P2
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
RDD, P2
Batch interval = 600 ms
RDD, P3
RDD, P3

Recommended for you

Spark Summit East 2015 Advanced Devops Student Slides
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Batch interval = 600 ms
Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
block, P1
2 input DStreams
Ex
W
Driver
block, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
block, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
block, P1 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
block, P1
Batch interval = 600 ms
block, P2
block, P3
block, P2
block, P3
block, P2
block, P3
block, P2 block, P3
Ex
W
Driver
RDD, P1
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
RDD, P1
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P1 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
RDD, P1
Batch interval = 600 ms
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2
RDD, P3
RDD, P2 RDD, P3
Materialize!

Recommended for you

Ex
W
Driver
RDD, P3
T R
Internal
Threads
SSD SSDOS Disk
T T
T T Ex
W
RDD, P4
T
Internal
Threads
SSD SSDOS Disk
T T
T T
T
Ex
W
RDD, P3 T
Internal
Threads
SSD SSDOS Disk
T T
T T
R
RDD, P6
Batch interval = 600 ms
RDD, P4
RDD, P5
RDD, P2
RDD, P2
RDD, P5
RDD, P1
RDD, P1 RDD, P6
Union!
- File systems
- Socket Connections
- Akka Actors
- Kafka
- Flume
- Twitter
Sources directly available
in StreamingContext API
Requires linking against
extra dependencies
- Anywhere
Requires implementing
user-defined receiver
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

Recommended for you

map( )
flatMap( )
filter( )
repartition(numPartitions)
union(otherStream)
count()
reduce( )
countByValue()
reduceAByKey( ,[numTasks])
join(otherStream,[numTasks])
cogroup(otherStream,[numTasks])
transform( )
RDD
RDD
updateStateByKey( )*
updateStateByKey( )
To use:
1) Define the state
(an arbitrary data type)
2) Define the state update function
(specify with a function how to update the state using the
previous state and new values from the input stream)
: allows you to maintain arbitrary state while
continuously updating it with new information.
def updateFunction(newValues, runningCount):
if runningCount is None:
runningCount = 0
return sum(newValues, runningCount) # add the
# new values with the previous running count
# to get the new count
To maintain a running count of each word seen
in a text data stream (here running count is an
integer type of state):
runningCounts = pairs.updateStateByKey(updateFunction)
pairs = (word, 1)
(cat, 1)
*
* Requires a checkpoint directory to be
configured
For example:
- Functionality to join every batch in a
data stream with another dataset is not
directly exposed in the DStream API.
- If you want to do real-time data
cleaning by joining the input data
stream with pre-computed spam
information and then filtering based on it.
: can be used to apply any RDD operation that
is not exposed in the DStream API.
spamInfoRDD = sc.pickleFile(...) # RDD containing spam information
# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd:
rdd.join(spamInfoRDD).filter(...))
transform( )
RDD
RDD
or
MLlib GraphX
Original
DStream
Batch 4 Batch 5 Batch 6
Windowed
DStream
RDD1 RDD 2 Batch 3
RDD 1 Part. 2 Part. 3
time 1 time 2 time 3 time 4 time 5 time 6
Part. 4 Part. 5
RDD @ 3 RDD @ 5
Window Length: 3 time units
Sliding Interval: 2 time units
*
*
*Both of these must be multiples of the
batch interval of the source DSTream
# Reduce last 30 seconds of data, every 10 seconds
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)

Recommended for you

window(windowLength, slideInterval)
countByWindow(windowLength, slideInterval)
reduceByWindow( , windowLength, slideInterval)
reduceByKeyAndWindow( , windowLength, slideInterval,[numTasks])
reduceByKeyAndWindow( , , windowLength, slideInterval,[numTasks])
countByValueAndWindow(windowLength, slideInterval, [numTasks])
- DStream
- PairDStreamFunctions
- JavaDStream
- JavaPairDStream
- DStream
API Docs
print()
saveAsTextFile(prefix, [suffix])
foreachRDD( )
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
Spark Summit East 2015 Advanced Devops Student Slides

More Related Content

What's hot

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Viewers also liked

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
Hortonworks
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Hortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
Hortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
Spark Summit
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Databricks
 

Viewers also liked (20)

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 

Similar to Spark Summit East 2015 Advanced Devops Student Slides

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Spark
SparkSpark
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 

Similar to Spark Summit East 2015 Advanced Devops Student Slides (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark
SparkSpark
Spark
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
onemonitarsoftware
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
DNUG e.V.
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
MaisnamLuwangPibarel
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
shristi verma
 
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile OfferPanvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
$A19
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
Ortus Solutions, Corp
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Sparity1
 
mobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdfmobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdf
Mobile App Development Company in Noida - Drona Infotech
 
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Misti Soneji
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Misti Soneji
 
AI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdfAI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdf
ayushiqss
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
Tier1 app
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
DiyaSharma6551
 

Recently uploaded (20)

ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud ConnectorsBreak data silos with real-time connectivity using Confluent Cloud Connectors
Break data silos with real-time connectivity using Confluent Cloud Connectors
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
 
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile OfferPanvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
 
mobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdfmobile-app-development-company-in-noida.pdf
mobile-app-development-company-in-noida.pdf
 
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
AI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdfAI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdf
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 

Spark Summit East 2015 Advanced Devops Student Slides

  • 1. DEVOPS ADVANCED CLASS March 2015: Spark Summit East 2015 http://slideshare.net/databricks www.linkedin.com/in/blueplastic
  • 2. making big data simple Databricks Cloud: “A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.” • Founded in late 2013 • by the creators of Apache Spark • Original team from UC Berkeley AMPLab • Raised $47 Million in 2 rounds • ~50 employees • We’re hiring! • Level 2/3 support partnerships with • Cloudera • Hortonworks • MapR • DataStax (http://databricks.workable.com)
  • 3. The Databricks team contributed more than 75% of the code added to Spark in the past year
  • 4. AGENDA • History of Spark • RDD fundamentals • Spark Runtime Architecture Integration with Resource Managers (Standalone, YARN) • GUIs • Lab: DevOps 101 Before Lunch • Memory and Persistence • Jobs -> Stages -> Tasks • Broadcast Variables and Accumulators • PySpark • DevOps 102 • Shuffle • Spark Streaming After Lunch
  • 5. Some slides will be skipped Please keep Q&A low during class (5pm – 5:30pm for Q&A with instructor) 2 anonymous surveys: Pre and Post class Lunch: noon – 1pm 2 breaks (before lunch and after lunch)
  • 6. @ • AMPLab project was launched in Jan 2011, 6 year planned duration • Personnel: ~65 students, postdocs, faculty & staff • Funding from Government/Industry partnership, NSF Award, Darpa, DoE, 20+ companies • Created BDAS, Mesos, SNAP. Upcoming projects: Succinct & Velox. “Unknown to most of the world, the University of California, Berkeley’s AMPLab has already left an indelible mark on the world of information technology, and even the web. But we haven’t yet experienced the full impact of the group[…] Not even close” - Derrick Harris, GigaOm, Aug 2014 Algorithms Machines People
  • 8. RDBMS Streaming SQL GraphX Hadoop Input Format Apps Distributions: - CDH - HDP - MapR - DSE Tachyon MLlib DataFrames API
  • 10. General Batch Processing Pregel Dremel Impala GraphLab Giraph Drill Tez S4 Storm Specialized Systems (iterative, interactive, ML, streaming, graph, SQL, etc) General Unified Engine (2004 – 2013) (2007 – 2015?) (2014 – ?) Mahout
  • 14. CPUs: 10 GB/s 100 MB/s 0.1 ms random access $0.45 per GB 600 MB/s 3-12 ms random access $0.05 per GB 1 Gb/s or 125 MB/s Network 0.1 Gb/s Nodes in another rack Nodes in same rack 1 Gb/s or 125 MB/s
  • 15. June 2010 http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf “The main abstraction in Spark is that of a resilient dis- tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”
  • 16. April 2012 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf “We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude.” “Best Paper Award and Honorable Mention for Community Award” - NSDI 2012 - Cited 392 times!
  • 18. sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name) Seemlessly mix SQL queries with Spark programs. Coming soon! (Will be published in the upcoming weeks for SIGMOD 2015)
  • 19. graph = Graph(vertices, edges) messages = spark.textFile("hdfs://...") graph2 = graph.joinVertices(messages) { (id, vertex, msg) => ... } https://amplab.cs.berkeley.edu/wp- content/uploads/2013/05/grades- graphx_with_fonts.pdf
  • 21. http://shop.oreilly.com/product/0636920028512.do eBook: $33.99 Print: $39.99 PDF, ePub, Mobi, DAISY Shipping now! http://www.amazon.com/Learning-Spark-Lightning- Fast-Data-Analysis/dp/1449358624 $30 @ Amazon:
  • 23. http://tinyurl.com/dsesparklab - 102 pages - DevOps style - For complete beginners - Includes: - Spark Streaming - Dangers of GroupByKey vs. ReduceByKey
  • 24. http://tinyurl.com/cdhsparklab - 109 pages - DevOps style - For complete beginners - Includes: - PySpark - Spark SQL - Spark-submit
  • 30. Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 RDD w/ 4 partitions Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 An RDD can be created 2 ways: - Parallelize a collection - Read data from an external source (S3, C*, HDFS, etc) logLinesRDD
  • 31. # Parallelize in Python wordsRDD = sc.parallelize([“fish", “cats“, “dogs”]) // Parallelize in Scala val wordsRDD= sc.parallelize(List("fish", "cats", "dogs")) // Parallelize in Java JavaRDD<String> wordsRDD = sc.parallelize(Arrays.asList(“fish", “cats“, “dogs”)); - Take an existing in-memory collection and pass it to SparkContext’s parallelize method - Not generally used outside of prototyping and testing since it requires entire dataset in memory on one machine
  • 32. # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Scala val linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Java JavaRDD<String> lines = sc.textFile("/path/to/README.md"); - There are other methods to read data from HDFS, C*, S3, HBase, etc.
  • 33. Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 logLinesRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 errorsRDD .filter( ) (input/base RDD)
  • 34. errorsRDD .coalesce( 2 ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 .collect( ) Driver
  • 37. .collect( ) logLinesRDD errorsRDD cleanedRDD .filter( ) .coalesce( 2 ) Driver Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1
  • 41. logLinesRDD errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RDD .collect( ) .saveToCassandra( ) .count( ) 5
  • 42. logLinesRDD errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RDD .collect( ) .count( ) .saveToCassandra( ) 5
  • 43. P-1 logLinesRDD (HadoopRDD) P-2 P-3 P-4 P-1 errorsRDD (filteredRDD) P-2 P-3 P-4 Task-1 Task-2 Task-3 Task-4 Path = hdfs://. . . func = _.contains(…) shouldCache=false logLinesRDD errorsRDD Dataset-level view: Partition-level view:
  • 44. 1) Create some input RDDs from external data or parallelize a collection in your driver program. 2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache() any intermediate RDDs that will need to be reused. 4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.
  • 45. map() intersection() cartesion() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... (lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations
  • 46. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() saveToCassandra() ...
  • 47. • HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD • CassandraRDD (DataStax) • GeoRDD (ESRI) • EsSpark (ElasticSearch)
  • 50. 1) Set of partitions (“splits”) 2) List of dependencies on parent RDDs 3) Function to compute a partition given parents 4) Optional preferred locations 5) Optional partitioning info for k/v RDDs (Partitioner) This captures all current Spark operations! * * * * *
  • 51. Partitions = one per HDFS block Dependencies = none Compute (partition) = read corresponding block preferredLocations (part) = HDFS block location Partitioner = none * * * * *
  • 52. Partitions = same as parent RDD Dependencies = “one-to-one” on parent Compute (partition) = compute parent and filter it preferredLocations (part) = none (ask parent) Partitioner = none * * * * *
  • 53. Partitions = One per reduce task Dependencies = “shuffle” on each parent Compute (partition) = read and join shuffled data preferredLocations (part) = none Partitioner = HashPartitioner(numTasks) * * * * *
  • 54. val cassandraRDD = sc .cassandraTable(“ks”, “mytable”) .select(“col-1”, “col-3”) .where(“col-5 = ?”, “blue”) Keyspace Table {Server side column & row selection
  • 55. Start the Spark shell by passing in a custom cassandra.input.split.size: ubuntu@ip-10-0-53-24:~$ dse spark –Dspark.cassandra.input.split.size=2000 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 0.9.1 /_/ Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Type in expressions to have them evaluated. Type :help for more information. Creating SparkContext... Created spark context.. Spark context available as sc. Type in expressions to have them evaluated. Type :help for more information. scala> The cassandra.input.split.size parameter defaults to 100,000. This is the approximate number of physical rows in a single Spark partition. If you have really wide rows (thousands of columns), you may need to lower this value. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.” (for dealing with wide rows)
  • 56. https://github.com/datastax/spark-cassandra-connector Spark Executor Spark-C* Connector C* Java Driver - Open Source - Implemented mostly in Scala - Scala + Java APIs - Does automatic type conversions
  • 58. “Simple things should be simple, complex things should be possible” - Alan Kay
  • 59. DEMO:
  • 61. - Local - Standalone Scheduler - YARN - Mesos Static Partitioning Dynamic Partitioning
  • 62. JobTracker DNTT M M R M M R M M R M M R M M M M M M RR R R OSOSOSOS JT DN DNTT DNTT TT History: NameNode NN
  • 64. JVM: Ex + Driver Disk RDD, P1 Task 3 options: - local - local[N] - local[*] RDD, P2 RDD, P1 RDD, P2 RDD, P3 Task Task Task Task Task CPUs: Task Task Task Task Task Task Internal Threads val conf = new SparkConf() .setMaster("local[12]") .setAppName(“MyFirstApp") .set("spark.executor.memory", “3g") val sc = new SparkContext(conf) > ./bin/spark-shell --master local[12] > ./bin/spark-submit --name "MyFirstApp" --master local[12] myApp.jar Worker Machine
  • 66. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P4 W RDD, P6 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P7 W RDD, P8 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Spark Master Ex RDD, P5 W RDD, P3 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD T T T T different spark-env.sh - SPARK_WORKER_CORES vs.> ./bin/spark-submit --name “SecondApp" --master spark://host1:port1 myApp.jar - SPARK_LOCAL_DIRSspark-env.sh
  • 67. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P4 W RDD, P6 RDD, P1 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Ex RDD, P7 W RDD, P8 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Spark Master Ex RDD, P5 W RDD, P3 RDD, P2 T T T T T T Internal Threads SSD SSDOS Disk SSD SSD Spark Master T T T T different spark-env.sh - SPARK_WORKER_CORES vs. I’m HA via ZooKeeper > ./bin/spark-submit --name “SecondApp" --master spark://host1:port1,host2:port2 myApp.jar Spark Master More Masters can be added live - SPARK_LOCAL_DIRSspark-env.sh
  • 68. W Driver SSDOS Disk W SSDOS Disk W SSDOS Disk Spark Master W SSDOS Disk (multiple apps) Ex Ex Ex Ex Driver ExEx Ex Ex
  • 69. Driver SSDOS Disk SSDOS Disk SSDOS Disk Spark Master W SSDOS Disk (single app) Ex W Ex W Ex W Ex W Ex W Ex W Ex W Ex SPARK_WORKER_INSTANCES: [default: 1] # of worker instances to run on each machine SPARK_WORKER_CORES: [default: ALL] # of cores to allow Spark applications to use on the machine SPARK_WORKER_MEMORY: [default: TOTAL RAM – 1 GB] Total memory to allow Spark applications to use on the machineconf/spark-env.sh SPARK_DAEMON_MEMORY: [default: 512 MB] Memory to allocate to the Spark master and worker daemons themselves
  • 70. Standalone settings - Apps submitted will run in FIFO mode by default spark.cores.max: maximum amount of CPU cores to request for the application from across the cluster spark.executor.memory: Memory for each executor
  • 84. NodeManager Resource Manager NodeManager Container NodeManager App Master Client #1 ContainerApp Master Container Container Client #2 I’m HA via ZooKeeper Scheduler Apps Master
  • 86. NodeManager Resource Manager NodeManager Container NodeManager App Master Client #1 Executor RDD T Container Executor RDD T Driver (cluster mode) Container Executor RDD T - Does not support Spark Shells
  • 87. YARN settings --num-executors: controls how many executors will be allocated --executor-memory: RAM for each executor --executor-cores: CPU cores for each executor spark.dynamicAllocation.enabled spark.dynamicAllocation.minExecutors spark.dynamicAllocation.maxExecutors spark.dynamicAllocation.sustainedSchedulerBacklogTimeout (N) spark.dynamicAllocation.schedulerBacklogTimeout (M) spark.dynamicAllocation.executorIdleTimeout (K) Dynamic Allocation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
  • 88. YARN resource manager UI: http://<ip address>:8088 (No apps running)
  • 89. [ec2-user@ip-10-0-72-36 ~]$ spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark- examples-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar 10
  • 90. App running in client mode
  • 92. [ec2-user@ip-10-0-72-36 ~]$ spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark- examples-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar 10
  • 93. App running in cluster mode
  • 94. App running in cluster mode
  • 95. App running in cluster mode
  • 97. Spark Central Master Who starts Executors? Tasks run in Local [none] Human being Executor Standalone Standalone Master Worker JVM Executor YARN YARN App Master Node Manager Executor Mesos Mesos Master Mesos Slave Executor
  • 98. spark-submit provides a uniform interface for submitting jobs across all cluster managers bin/spark-submit --master spark://host:7077 --executor-memory 10g my_script.py Source: Learning Spark
  • 100. Ex RDD, P1 RDD, P2 RDD, P1 T T T T T T Internal Threads Recommended to use at most only 75% of a machine’s memory for Spark Minimum Executor heap size should be 8 GB Max Executor heap size depends… maybe 40 GB (watch GC) Memory usage is greatly affected by storage level and serialization format
  • 101. +Vs.
  • 108. RDD.persist(MEMORY_ONLY_2) JVM on Node X deserialized deserialized JVM on Node Y
  • 112. JVM ?
  • 113. ? - If RDD fits in memory, choose MEMORY_ONLY - If not, use MEMORY_ONLY_SER w/ fast serialization library - Don’t spill to disk unless functions that computed the datasets are very expensive or they filter a large amount of data. (recomputing may be as fast as reading from disk) - Use replicated storage levels sparingly and only if you want fast fault recovery (maybe to serve requests from a web app)
  • 114. Intermediate data is automatically persisted during shuffle operations Remember!
  • 115. PySpark: stored objects will always be serialized with Pickle library, so it does not matter whether you choose a serialized level. =
  • 116. 60%20% 20% Default Memory Allocation in Executor JVM Cached RDDs User Programs (remainder) Shuffle memory spark.storage.memoryFraction FIX THIS
  • 117. RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle. User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated. Spark uses memory for:
  • 118. 1. Create an RDD 2. Put it into cache 3. Look at SparkContext logs on the driver program or Spark UI INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB) logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD
  • 120. Serialization is used when: Transferring data over the network Spilling data to disk Caching to memory serialized Broadcasting variables
  • 121. Java serialization Kryo serializationvs. • Uses Java’s ObjectOutputStream framework • Works with any class you create that implements java.io.Serializable • You can control the performance of serialization more closely by extending java.io.Externalizable • Flexible, but quite slow • Leads to large serialized formats for many classes • Recommended serialization for production apps • Use Kyro version 2 for speedy serialization (10x) and more compactness • Does not support all Serializable types • Requires you to register the classes you’ll use in advance • If set, will be used for serializing shuffle data between nodes and also serializing RDDs to disk conf.set(“spark.serializer”, "org.apache.spark.serializer.KryoSerializer")
  • 122. To register your own custom classes with Kryo, use the registerKryoClasses method: val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) - If your objects are large, you may need to increase spark.kryoserializer.buffer.mb config property - The default is 2, but this value needs to be large enough to hold the largest object you will serialize.
  • 124. . . . Ex RDD W RDD RDD High churn Cost of GC is proportional to the # of Java objects (so use an array of Ints instead of a LinkedList) -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps To measure GC impact:
  • 125. Parallel Old GC CMS GC G1 GC -XX:+UseParallelOldGC -XX:+UseConcMarkSweepGC -XX:+UseG1GC - Uses multiple threads to do both young gen and old gen GC - Also a multithreading compacting collector - HotSpot does compaction only in old gen Parallel GC -XX:+UseParallelGC - Uses multiple threads to do young gen GC - Will default to Serial on single core machines - Aka “throughput collector” - Good for when a lot of work is needed and long pauses are acceptable - Use cases: batch processing -XX:ParallelGCThreads=<#> -XX:ParallelCMSThreads=<#> - Concurrent Mark Sweep aka “Concurrent low pause collector” - Tries to minimize pauses due to GC by doing most of the work concurrently with application threads - Uses same algorithm on young gen as parallel collector - Use cases: - Garbage First is available starting Java 7 - Designed to be long term replacement for CMS - Is a parallel, concurrent and incrementally compacting low-pause GC
  • 126. ?
  • 127. Stage 1 Stage 2 Stage 3 Stage 5 . . Job #1 .collect( ) Task #1 Task #2 Task #3 . . Stage 4
  • 128. Task Scheduler Task threads Block manager RDD Objects DAG Scheduler Task Scheduler Executor Rdd1.join(rdd2) .groupBy(…) .filter(…) - Build operator DAG - Split graph into stages of tasks - Submit each stage as ready - Execute tasks - Store and serve blocks DAG TaskSet Task Agnostic to operators Doesn’t know about stagesStage failed - Launches individual tasks - Retry failed or straggling tasks
  • 130. “One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.” “The most interesting question in designing this interface is how to represent dependencies between RDDs.” “We found it both sufficient and useful to classify dependencies into two types: • narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies, where multiple child partitions may depend on it.”
  • 131. Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles. Requires shuffle
  • 132. = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map = lost partition
  • 133. “This distinction is useful for two reasons: 1) Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation. 2) Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.” Dependencies: Narrow vs Wide
  • 134. scala> input.toDebugString res85: String = (2) data.text MappedRDD[292] at textFile at <console>:13 | data.text HadoopRDD[291] at textFile at <console>:13 scala> counts.toDebugString res84: String = (2) ShuffledRDD[296] at reduceByKey at <console>:17 +-(2) MappedRDD[295] at map at <console>:17 | FilteredRDD[294] at filter at <console>:15 | MappedRDD[293] at map at <console>:15 | data.text MappedRDD[292] at textFile at <console>:13 | data.text HadoopRDD[291] at textFile at <console>:13 To display the lineage of an RDD, Spark provides a toDebugString method:
  • 135. How do you know if a shuffle will be called on a Transformation? Note that repartition just calls coalese w/ True: - repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles - If you declare a numPartitions parameter, it’ll probably shuffle - If a transformation constructs a shuffledRDD, it’ll probably shuffle - combineByKey calls a shuffle (so do other transformations like groupByKey, which actually end up calling combineByKey) def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = { coalesce(numPartitions, shuffle = true) } RDD.scala
  • 136. How do you know if a shuffle will be called on a Transformation? Transformations that use “numPartitions” like distinct will probably shuffle: def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  • 137. - An extra parameter you can pass a k/v transformation to let Spark know that you will not be messing with the keys at all - All operations that shuffle data over network will benefit from partitioning - Operations that benefit from partitioning: cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, lookup, . . . https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L302
  • 141. Source: Cloudera How many Stages will this DAG require?
  • 142. Source: Cloudera How many Stages will this DAG require?
  • 143. ++ +
  • 144. Ex Ex Ex x = 5 T T x = 5 x = 5 x = 5 x = 5 T T
  • 145. • Broadcast variables – Send a large read-only lookup table to all the nodes, or send a large feature vector in a ML algorithm to all nodes • Accumulators – count events that occur during job execution for debugging purposes. Example: How many lines of the input file were blank? Or how many corrupt records were in the input dataset? ++ +
  • 146. Spark supports 2 types of shared variables: • Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes. • Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only. ++ +
  • 147. Broadcast variables let programmer keep a read- only variable cached on each machine rather than shipping a copy of it with tasks For example, to give every node a copy of a large input dataset efficiently Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost
  • 148. val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value broadcastVar = sc.broadcast(list(range(1, 4))) broadcastVar.value Scala: Python:
  • 150. Link
  • 152. Ex Ex Ex 20 MB file Uses bittorrent . . .4 MB 4 MB 4 MB 4 MB
  • 155. Accumulators are variables that can only be “added” to through an associative operation Used to implement counters and sums, efficiently in parallel Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types Only the driver program can read an accumulator’s value, not the tasks ++ +
  • 156. val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value accum = sc.accumulator(0) rdd = sc.parallelize([1, 2, 3, 4]) def f(x): global accum accum += x rdd.foreach(f) accum.value Scala: Python: ++ +
  • 158. PySpark at a Glance Write Spark jobs in Python Run interactive jobs in the shell Supports C extensions
  • 159. Spark Core Engine (Scala) Standalone Scheduler YARN MesosLocal Java API PySpark 41 files 8,100 loc 6,300 comments
  • 160. Spark Context Controller Spark Context Py4j Socket Local Disk Pipe Driver JVM Executor JVM Executor JVM Pipe Worker MachineDriver Machine F(x) F(x) F(x) F(x) F(x) RDD RDD RDD RDD MLlib, SQL, shuffle MLlib, SQL, shuffle daemon.py daemon.py
  • 161. Data is stored as Pickled objects in an RDD[Array[Byte]]HadoopRDD MappedRDD PythonRDD RDD[Array[ ] ], , , (100 KB – 1MB each picked object)
  • 162. pypy • JIT, so faster • less memory • CFFI support CPython (default python) Choose Your Python Implementation Spark Context Driver Machine Worker Machine $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/pyspark $ PYSPARK_DRIVER_PYTHON=pypy PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py OR
  • 163. Job CPython 2.7 PyPy 2.3.1 Speed up Word Count 41 s 15 s 2.7 x Sort 46 s 44 s 1.05 x Stats 174 s 3.6 s 48 x The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Here is the code used for benchmark: rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/')) .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats() https://github.com/apache/spark/pull/2144
  • 166. Spark sorted the same data 3X faster using 10X fewer machines than Hadoop MR in 2013. Work by Databricks engineers: Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia 100TB Daytona Sort Competition 2014 More info: http://sortbenchmark.org http://databricks.com/blog/2014/11/05/spark- officially-sets-a-new-record-in-large-scale-sorting.html All the sorting took place on disk (HDFS) without using Spark’s in-memory cache!
  • 168. - Stresses “shuffle” which underpins everything from SQL to Mllib - Sorting is challenging b/c there is no reduction in data - Sort 100 TB = 500 TB disk I/O and 200 TB network Engineering Investment in Spark: - Sort-based shuffle (SPARK-2045) - Netty native network transport (SPARK-2468) - External shuffle service (SPARK-3796) Clever Application level Techniques: - GC and cache friendly memory layout - Pipelining
  • 169. Ex RDD W RDD T T EC2: i2.8xlarge (206 workers) - Intel Xeon CPU E5 2670 @ 2.5 GHz w/ 32 cores - 244 GB of RAM - 8 x 800 GB SSD and RAID 0 setup formatted with /ext4 - ~9.5 Gbps (1.1 GBps) bandwidth between 2 random nodes - Each record: 100 bytes (10 byte key & 90 byte value) - OpenJDK 1.7 - HDFS 2.4.1 w/ short circuit local reads enabled - Apache Spark 1.2.0 - Speculative Execution off - Increased Locality Wait to infinite - Compression turned off for input, output & network - Used Unsafe to put all the data off-heap and managed it manually (i.e. never triggered the GC) - 32 slots per machine - 6,592 slots total
  • 171. spark.shuffle.spill=false (Affects reducer side and keeps all the data in memory)
  • 172. - Must turn this on for dynamic allocation in YARN - Worker JVM serves files - Node Manager serves files
  • 173. - Was slow because it had to copy the data 3 times Map output file on local dir Linux kernel buffer Ex NIC buffer
  • 174. - Uses a technique called zero-copy - Is a map-side optimization to serve data very quickly to requesting reducers Map output file on local dir NIC buffer
  • 175. Map() Map() Map() Map() Reduce() Reduce() Reduce() - Entirely bounded by I/O reading from HDFS and writing out locally sorted files - Mostly network bound < 10,000 reducers - Notice that map has to keep 3 file handles open TimSort = 5 blocks
  • 176. Map() Map() Map() Map() (28,000 unique blocks) RF = 2 250,000+ reducers! - Only one file handle open at a time = 3.6 GB
  • 177. Map() Map() Map() Map() - 5 waves of maps - 5 waves of reduces Reduce() Reduce() Reduce() RF = 2 250,000+ reducers! MergeSort! TimSort (28,000 unique blocks) RF = 2
  • 178. - Actual final run - Fully saturated the 10 Gbit link
  • 179. Link
  • 180. UserID Name Age Location Pet 28492942 John Galt 32 New York Sea Horse 95829324 Winston Smith 41 Oceania Ant 92871761 Tom Sawyer 17 Mississippi Raccoon 37584932 Carlos Hinojosa 33 Orlando Cat 73648274 Luis Rodriguez 34 Orlando Dogs
  • 182. SchemaRDD - RDD of Row objects, each representing a record - Row objects = type + col. name of each - Stores data very efficiently by taking advantage of the schema - SchemaRDDs are also regular RDDs, so you can run transformations like map() or filter() - Allows new operations, like running SQL on objects
  • 188. Link
  • 190. Kafka Flume HDFS S3 Kinesis Twitter TCP socket HDFS Cassandra Dashboards Databases - Scalable - High-throughput - Fault-tolerant Complex algorithms can be expressed using: - Spark transformations: map(), reduce(), join(), etc - MLlib + GraphX - SQL
  • 192. Tathagata Das (TD) - Lead developer of Spark Streaming + Committer on Apache Spark core - Helped re-write Spark Core internals in 2012 to make it 10x faster to support Streaming use cases - On leave from UC Berkeley PhD program - Ex: Intern @ Amazon, Intern @ Conviva, Research Assistant @ Microsoft Research India - 1 guy; does not scale - Scales to 100s of nodes - Batch sizes as small at half a second - Processing latency as low as 1 second - Exactly-once semantics no matter what fails
  • 193. Page views Kafka for buffering Spark for processing (live statistics)
  • 194. Smart meter readings Live weather data Join 2 live data sources (Anomaly Detection)
  • 195. Input data stream Batches of processed data Batches every X seconds
  • 196. (Discretized Stream) Block #1 RDD @ T=0 Block #2 Block #3 Batch interval = 5 seconds Block #1 RDD @ T=+5 Block #2 Block #3 T = 0 T = +5 Input DStream One RDD is created every 5 seconds
  • 197. Block #1 Block #2 Block #3 Part. #1 Part. #2 Part. #3 Part. #1 Part. #2 Part. #3 5 sec Materialize! linesDStream wordsRDD flatMap() linesRDD linesDStream wordsDStream
  • 198. from pyspark import SparkContext from pyspark.streaming import StreamingContext # Create a local StreamingContext with two working thread and batch interval of 1 second sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 5) # Create a DStream that will connect to hostname:port, like localhost:9999 linesDStream = ssc.socketTextStream("localhost", 9999) # Split each line into words wordsDStream = linesDStream.flatMap(lambda line: line.split(" ")) # Count each word in each batch pairsDStream = wordsDStream.map(lambda word: (word, 1)) wordCountsDStream = pairsDStream.reduceByKey(lambda x, y: x + y) # Print the first ten elements of each RDD generated in this DStream to the console wordCountsDStream.pprint() ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate linesStream wordsStream pairsStream wordCountsStream
  • 199. Terminal #1 Terminal #2 $ nc -lk 9999 hello world $ ./network_wordcount.py localhost 9999 . . . -------------------------- Time: 2015-04-25 15:25:21 -------------------------- (hello, 2) (world, 1)
  • 200. Ex RDD, P1 W Driver RDD, P2 block, P1 T Internal Threads SSD SSDOS Disk T T T T Ex RDD, P3 W RDD, P4 block, P1 T Internal Threads SSD SSDOS Disk T T T T T Batch interval = 600 ms R
  • 201. Ex RDD, P1 W Driver RDD, P2 block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P3 W RDD, P4 block, P1 T Internal Threads SSD SSDOS Disk T T T T T 200 ms later Ex W block, P2 T Internal Threads SSD SSDOS Disk T T T T T block, P2 Batch interval = 600 ms
  • 202. Ex RDD, P1 W Driver RDD, P2 block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 block, P1 T Internal Threads SSD SSDOS Disk T T T T T 200 ms later Ex W block, P2 T Internal Threads SSD SSDOS Disk T T T T T block, P2 Batch interval = 600 ms block, P3 block, P3
  • 203. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P2 T Internal Threads SSD SSDOS Disk T T T T T RDD, P2 Batch interval = 600 ms RDD, P3 RDD, P3
  • 204. Ex RDD, P1 W Driver RDD, P2 RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex RDD, P1 W RDD, P2 RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P2 T Internal Threads SSD SSDOS Disk T T T T T RDD, P2 Batch interval = 600 ms RDD, P3 RDD, P3
  • 206. Ex W Driver block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T T Batch interval = 600 ms Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T R block, P1 2 input DStreams
  • 207. Ex W Driver block, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W block, P1 T Internal Threads SSD SSDOS Disk T T T T R block, P1 Batch interval = 600 ms block, P2 block, P3 block, P2 block, P3 block, P2 block, P3 block, P2 block, P3
  • 208. Ex W Driver RDD, P1 T R Internal Threads SSD SSDOS Disk T T T T Ex W RDD, P1 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P1 T Internal Threads SSD SSDOS Disk T T T T R RDD, P1 Batch interval = 600 ms RDD, P2 RDD, P3 RDD, P2 RDD, P3 RDD, P2 RDD, P3 RDD, P2 RDD, P3 Materialize!
  • 209. Ex W Driver RDD, P3 T R Internal Threads SSD SSDOS Disk T T T T Ex W RDD, P4 T Internal Threads SSD SSDOS Disk T T T T T Ex W RDD, P3 T Internal Threads SSD SSDOS Disk T T T T R RDD, P6 Batch interval = 600 ms RDD, P4 RDD, P5 RDD, P2 RDD, P2 RDD, P5 RDD, P1 RDD, P1 RDD, P6 Union!
  • 210. - File systems - Socket Connections - Akka Actors - Kafka - Flume - Twitter Sources directly available in StreamingContext API Requires linking against extra dependencies - Anywhere Requires implementing user-defined receiver
  • 213. map( ) flatMap( ) filter( ) repartition(numPartitions) union(otherStream) count() reduce( ) countByValue() reduceAByKey( ,[numTasks]) join(otherStream,[numTasks]) cogroup(otherStream,[numTasks]) transform( ) RDD RDD updateStateByKey( )*
  • 214. updateStateByKey( ) To use: 1) Define the state (an arbitrary data type) 2) Define the state update function (specify with a function how to update the state using the previous state and new values from the input stream) : allows you to maintain arbitrary state while continuously updating it with new information. def updateFunction(newValues, runningCount): if runningCount is None: runningCount = 0 return sum(newValues, runningCount) # add the # new values with the previous running count # to get the new count To maintain a running count of each word seen in a text data stream (here running count is an integer type of state): runningCounts = pairs.updateStateByKey(updateFunction) pairs = (word, 1) (cat, 1) * * Requires a checkpoint directory to be configured
  • 215. For example: - Functionality to join every batch in a data stream with another dataset is not directly exposed in the DStream API. - If you want to do real-time data cleaning by joining the input data stream with pre-computed spam information and then filtering based on it. : can be used to apply any RDD operation that is not exposed in the DStream API. spamInfoRDD = sc.pickleFile(...) # RDD containing spam information # join data stream with spam information to do data cleaning cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...)) transform( ) RDD RDD or MLlib GraphX
  • 216. Original DStream Batch 4 Batch 5 Batch 6 Windowed DStream RDD1 RDD 2 Batch 3 RDD 1 Part. 2 Part. 3 time 1 time 2 time 3 time 4 time 5 time 6 Part. 4 Part. 5 RDD @ 3 RDD @ 5 Window Length: 3 time units Sliding Interval: 2 time units * * *Both of these must be multiples of the batch interval of the source DSTream # Reduce last 30 seconds of data, every 10 seconds windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
  • 217. window(windowLength, slideInterval) countByWindow(windowLength, slideInterval) reduceByWindow( , windowLength, slideInterval) reduceByKeyAndWindow( , windowLength, slideInterval,[numTasks]) reduceByKeyAndWindow( , , windowLength, slideInterval,[numTasks]) countByValueAndWindow(windowLength, slideInterval, [numTasks]) - DStream - PairDStreamFunctions - JavaDStream - JavaPairDStream - DStream API Docs