Introduction to Spark (Intern Event Presentation)

•

15 likes•2,946 views

Databricks

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, …
• 100x faster than Hadoop MapReduce for some apps

About Databricks
Founded by creators of Spark in 2013
Oﬀers a hosted cloud service built on Spark
• Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project in
big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
• Collections of objects stored in memory or disk across a cluster
• Built via parallel transformations (map, filter, …)
• Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines
=
spark.textFile(“hdfs://...”)

errors
=
lines.filter(lambda
s:
s.startswith(“ERROR”))

messages
=
errors.map(lambda
s:
s.split(‘t’)[2])

messages.cache()

Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda
s:
“MySQL”
in
s).count()

messages.filter(lambda
s:
“Redis”
in
s).count()

.
.
.

tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Iterative algorithm used in machine learning

Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
//
Load
data
using
SQL

points
=
ctx.sql(“select
latitude,
longitude
from
tweets”)

//
Train
a
machine
learning
model

model
=
KMeans.train(points,
10)

//
Apply
it
to
a
stream

sc.twitterStream(...)

.map(lambda
t:
(model.predict(t.location),
1))

.reduceByWindow(“5s”,
lambda
a,
b:
a
+
b)

Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community

Introduction to Spark (Intern Event Presentation)

Ongoing Work
Speeding up Spark through code generation and
binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

What's hot

Introduction to Spark Internals

Pietro Michiardi

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

Introduction to Apache Spark

Anastasios Skarlatidis

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Simplilearn

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture. YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

Simplilearn

This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail. Below topics are explained in this Spark SQL presentation: 1. What is Spark SQL? 2. Spark SQL features 3. Spark SQL architecture 4. Spark SQL - Dataframe API 5. Spark SQL - Data source API 6. Spark SQL - Catalyst optimizer 7. Running SQL queries 8. Spark SQL demo This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer. What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Spark with Delta Lake

Knoldus Inc.

Delta Lake is an open source storage layer that sits on top of data lakes and brings ACID transactions and reliability to Apache Spark. It addresses challenges with data lakes like lack of schema enforcement and transactions. Delta Lake provides features like ACID transactions, scalable metadata handling, schema enforcement and evolution, time travel/data versioning, and unified batch and streaming processing. Delta Lake stores data in Apache Parquet format and uses a transaction log to track changes and ensure consistency even for large datasets. It allows for updates, deletes, and merges while enforcing schemas during writes.

Apache Spark Introduction

sudhakara st

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

Spark SQL

Joud Khattab

The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

From Data Warehouse to Lakehouse

Modern Data Stack France

Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology. Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus. Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.

Spark overview

Lisa Hua

Apache Spark Core – Practical Optimization

Databricks

Apache spark

shima jafari

Apache Spark is a cluster computing framework designed for fast, general-purpose processing of large datasets. It uses in-memory computing to improve processing speeds. Spark operations include transformations that create new datasets and actions that return values. The Spark stack includes Resilient Distributed Datasets (RDDs) for fault-tolerant data sharing across a cluster. Spark Streaming processes live data streams using a discretized stream model.

Presto on Apache Spark: A Tale of Two Computation Engines

Databricks

The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Spark streaming , Spark SQL

Yousun Jeong

Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.

Processing Large Data with Apache Spark -- HasGeek

Venkata Naga Ravi

Learn Apache Spark: A Comprehensive Guide

Whizlabs

PySpark dataframe

Jaemun Jung

This document discusses PySpark DataFrames. It notes that DataFrames can be constructed from various data sources and are conceptually similar to tables in a relational database. The document explains that DataFrames allow richer optimizations than RDDs due to avoiding context switching between Java and Python. It provides links to resources that demonstrate how to create DataFrames, perform queries using DataFrame APIs and Spark SQL, and use an example flight data DataFrame.

Apache Spark Fundamentals

Zahra Eskandari

This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.

Catalyst optimizer

Ayub Mohammad

Catalyst optimizer optimizes queries written in Spark SQL and DataFrame API to run faster. It uses both rule-based and cost-based optimization. Rule-based optimization applies rules to determine query execution, while cost-based generates multiple plans and selects the most efficient. Catalyst optimizer transforms logical plans through four phases - analysis, logical optimization, physical planning, and code generation. It represents queries as trees that can be manipulated using pattern matching rules to optimize queries.

What's hot (20)

Introduction to Spark Internals

Introduction to Apache Spark

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

Spark with Delta Lake

Apache Spark Introduction

Spark SQL

Apache Spark in Depth: Core Concepts, Architecture & Internals

From Data Warehouse to Lakehouse

Spark overview

Apache Spark Core – Practical Optimization

Apache spark

Presto on Apache Spark: A Tale of Two Computation Engines

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark streaming , Spark SQL

Processing Large Data with Apache Spark -- HasGeek

Learn Apache Spark: A Comprehensive Guide

PySpark dataframe

Apache Spark Fundamentals

Catalyst optimizer

Viewers also liked

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Databricks

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Spark Under the Hood - Meetup @ Data Science London

Databricks

The document summarizes a meetup on Apache Spark hosted by Data Science London. It introduces the speakers - Sameer Farooqui, Doug Bateman, and Jon Bates - and their backgrounds in data science and Spark training. The agenda includes talks on a power plant predictive modeling demo using Spark and different approaches to parallelizing machine learning algorithms in Spark like model, divide and conquer, and data parallelism. It also provides overviews of Spark's machine learning library MLlib and common algorithms. The goal is for attendees to learn about Spark's unified engine and how to apply different machine learning techniques at scale.

2016 Spark Summit East Keynote: Matei Zaharia

Databricks

Introduction to Apache Spark

datamantra

Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.

Apache Spark 2.0: Faster, Easier, and Smarter

Databricks

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

Parallelizing Existing R Packages with SparkR

Databricks

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR. Speaker: Hossein Falaki This talk was originally presented at Spark Summit East 2017.

Internship presentation

samcrosier

This document summarizes Sam's experience as an event management intern at Lickety Split. It outlines Sam's work history starting as a dishwasher in 2005 and progressing to event management intern in 2012. As an intern, Sam's tasks included preparing and delivering catering orders, serving performers at MASS MoCA, and managing a grab-and-go cooler to increase revenue. The document also provides background on Lickety Split which opened in 1996 and details Sam's development of professional skills like time management, organization, and multi-tasking through the internship.

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Data Con LA

Spark and the Berkeley Data Analytics Stack (BDAS) represent a unified, distributed, and parallel high-performance big data processing and analytics platform. Written in Scala, Spark supports multiple languages including Python, Java, Scala, and even R. Commonly seen as the successor to Hadoop, Spark is fully compatible with Hadoop including UDFs, SerDe’s, file formats, and compression algorithms. The high-level Spark libraries include stream processing, machine learning, graph processing, approximating, sampling - and every combination therein. The most active big data open source project in existence, Spark boasts ~500 of contributors and 10,000 commits to date. Spark recently broke the Daytona GraySort 100 TB record with almost 3 times the throughput, 1/3rd less time, and 1/10th of the resources!

Spark - The beginnings

Daniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Chris Fregly

This document summarizes Chris Fregly's presentation on how Apache Spark beat Hadoop at sorting 100 TB of data. Key points include: - Spark set a new record in the Daytona GraySort benchmark by sorting 100 TB of data in 23 minutes using 250,000 partitions on EC2. - Optimizations that contributed to Spark's win included using CPU cache locality with (Key, Pointer) pairs, an optimized sorting algorithm, reducing network overhead with Netty, and reducing OS resources with a sort-based shuffle. - The sort-based shuffle merges mapper outputs into a single file per partition to minimize disk seeks during the shuffle.

Apache Spark

Mahdi Esmailoghli

This document discusses Resilient Distributed Datasets (RDD), a fault-tolerant abstraction in Apache Spark for cluster computing. RDDs allow data to be reused across computations and support transformations like map, filter, and join. RDDs can be created from stable storage or other RDDs, and Spark computes them lazily for efficiency. The document provides examples of how RDDs can express algorithms like MapReduce, SQL queries, and graph processing. Benchmarks show Spark is 20x faster than Hadoop for iterative algorithms due to RDDs enabling data reuse in memory across jobs.

Intro to Apache Spark

Cloudera, Inc.

Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.

Apache spark linkedin

Yukti Kaura

New directions for Apache Spark in 2015

Databricks

This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

StampedeCon

At StampedeCon 2014, Tom Wheeler (Cloudera) presented, "The Evolution of Data Analysis with Hadoop." This session will lead the audience through the evolution of data analysis in Hadoop to illustrate its progression from the original low-level, batch-oriented MapReduce approach to today’s higher-level interactive tools that require very little technical knowledge. We’ll discuss Apache Crunch, Hive, Impala and Solr. While the nature of this talk is somewhat technical, no prior knowledge of Hadoop or any specific programming language is required. Frequent live demonstrations of the tools discussed will emphasize that analyzing data in Hadoop can be as easy as using a relational database or Internet search engine.

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Chris Fregly

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Spark Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Databricks

R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets. In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.

Apache spark - Spark's distributed programming model

Martin Zapletal

Spark's distributed programming model uses resilient distributed datasets (RDDs) and a directed acyclic graph (DAG) approach. RDDs support transformations like map, filter, and actions like collect. Transformations are lazy and form the DAG, while actions execute the DAG. RDDs support caching, partitioning, and sharing state through broadcasts and accumulators. The programming model aims to optimize the DAG through operations like predicate pushdown and partition coalescing.

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Chris Fregly

1. The document discusses various techniques for generating high-quality recommendations using Apache Spark including parallelism, performance optimizations, real-time streaming, and machine learning algorithms. 2. It demonstrates Spark's high-level libraries like Spark Streaming, Spark SQL, GraphX, and MLlib for tasks such as generating recommendations, computing page rank, and training word embedding models. 3. The goals of the talk are to show how to build a recommendation engine in Spark that can perform personalized recommendations using techniques like collaborative filtering, content-based filtering, and similarity joins.

Viewers also liked (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Under the Hood - Meetup @ Data Science London

2016 Spark Summit East Keynote: Matei Zaharia

Introduction to Apache Spark

Apache Spark 2.0: Faster, Easier, and Smarter

Parallelizing Existing R Packages with SparkR

Internship presentation

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

Intro to Apache Spark

Apache spark linkedin

New directions for Apache Spark in 2015

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Apache spark - Spark's distributed programming model

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Similar to Introduction to Spark (Intern Event Presentation)

Unified Big Data Processing with Apache Spark

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

Unified Big Data Processing with Apache Spark (QCON 2014)

Databricks

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

End-to-end Data Pipeline with Apache Spark

Databricks

This document discusses Apache Spark, a fast and general cluster computing system. It summarizes Spark's capabilities for machine learning workflows, including feature preparation, model training, evaluation, and production use. It also outlines new high-level APIs for data science in Spark, including DataFrames, machine learning pipelines, and an R interface, with the goal of making Spark more similar to single-machine libraries like SciKit-Learn. These new APIs are designed to make Spark easier to use for machine learning and interactive data analysis.

20170126 big data processing

Vienna Data Science Group

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

Simplifying Big Data Analytics with Apache Spark

Databricks

Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.

Artigo 81 - spark_tutorial.pdf

WalmirCouto3

This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.

Apache Spark Overview @ ferret

Andrii Gakhov

Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.

Paris Data Geek - Spark Streaming

Djamel Zouaoui

This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.

Jump Start with Apache Spark 2.0 on Databricks

Databricks

Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: What’s new in Spark 2.0 SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

20130912 YTC_Reynold Xin_Spark and Shark

YahooTechConference

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

DataStax Academy

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

An introduction To Apache Spark

Amir Sedighi

Apache Spark RDDs

Dean Chen

Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.

Apache Spark & Hadoop

MapR Technologies

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Apache spark-melbourne-april-2015-meetup

Ned Shawa

This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.

In Memory Analytics with Apache Spark

Venkata Naga Ravi

Spark Study Notes

Richard Kuo

This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.

Azure Databricks is Easier Than You Think

Ike Ellis

Spark is a fast and general engine for large-scale data processing. It supports Scala, Python, Java, SQL, R and more. Spark applications can access data from many sources and perform tasks like ETL, machine learning, and SQL queries. Azure Databricks provides a managed Spark service on Azure that makes it easier to set up clusters and share notebooks across teams for data analysis. Databricks also integrates with many Azure services for storage and data integration.

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

IT Event

In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general. This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Inhacking

Similar to Introduction to Spark (Intern Event Presentation) (20)

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

End-to-end Data Pipeline with Apache Spark

20170126 big data processing

Simplifying Big Data Analytics with Apache Spark

Artigo 81 - spark_tutorial.pdf

Apache Spark Overview @ ferret

Paris Data Geek - Spark Streaming

Jump Start with Apache Spark 2.0 on Databricks

20130912 YTC_Reynold Xin_Spark and Shark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

An introduction To Apache Spark

Apache Spark RDDs

Apache Spark & Hadoop

Apache spark-melbourne-april-2015-meetup

In Memory Analytics with Apache Spark

Spark Study Notes

Azure Databricks is Easier Than You Think

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

cristine510

Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...

Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

IN Dubai [WHATSAPP:Only (+971588192166**)] Abortion Pills For Sale In Dubai** UAE** Mifepristone and Misoprostol Tablets Available In Dubai** UAE CONTACT DR. SINDY Whatsapp +971588192166* We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai** Sharjah** Abudhabi** Ajman** Alain** Fujairah** Ras Al Khaimah** Umm Al Quwain** UAE** Buy cytotec in Dubai +971588192166* '''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol** Cytotec” +971588192166* ' Dr.SINDY ''BUY ABORTION PILLS MIFEGEST KIT** MISOPROSTOL** CYTOTEC PILLS IN DUBAI** ABU DHABI**UAE'' Contact me now via What's App… abortion pills in dubai Mtp-Kit Prices abortion pills available in dubai/abortion pills for sale in dubai/abortion pills in uae/cytotec dubai/abortion pills in abu dhabi/abortion pills available in abu dhabi/abortion tablets in uae … abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all** Cytotec Abortion Pills are Available In Dubai / UAE** you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pills in Dubai** UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if it's beyond 6 months. Our Abu Dhabi** Ajman** Al Ain** Dubai** Fujairah** Ras Al Khaimah (RAK)** Sharjah** Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical** medical and surgical abortion methods for early through late second trimester** including the Abortion By Pill Procedure (RU 486** Mifeprex** Mifepristone** early options French Abortion Pill)** Tamoxifen** Methotrexate and Cytotec (Misoprostol). The Abu Dhabi** United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used** 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need for surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi** United Arab Emirates** uses the latest medications for medical abortions (RU-486** Mifeprex** Mifegyne** Mifepristone** early options French abortion pill)** Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi** United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our

Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple Steps

Estuary Flow

Unlock the full potential of your data by effortlessly migrating from PostgreSQL to Snowflake, the leading cloud data warehouse. This comprehensive guide presents an easy-to-follow 8-step process using Estuary Flow, an open-source data operations platform designed to simplify data pipelines. Discover how to seamlessly transfer your PostgreSQL data to Snowflake, leveraging Estuary Flow's intuitive interface and powerful real-time replication capabilities. Harness the power of both platforms to create a robust data ecosystem that drives business intelligence, analytics, and data-driven decision-making. Key Takeaways: 1. Effortless Migration: Learn how to migrate your PostgreSQL data to Snowflake in 8 simple steps, even with limited technical expertise. 2. Real-Time Insights: Achieve near-instantaneous data syncing for up-to-the-minute analytics and reporting. 3. Cost-Effective Solution: Lower your total cost of ownership (TCO) with Estuary Flow's efficient and scalable architecture. 4. Seamless Integration: Combine the strengths of PostgreSQL's transactional power with Snowflake's cloud-native scalability and data warehousing features. Don't miss out on this opportunity to unlock the full potential of your data. Read & Download this comprehensive guide now and embark on a seamless data journey from PostgreSQL to Snowflake with Estuary Flow! Try it Free: https://dashboard.estuary.dev/register

@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable

DiyaSharma6551

Software development... for all? (keynote at ICSOFT'2024)

miso_uam

Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require. To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals). In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.

dachnug51 - HCL Domino Roadmap .pdf

DNUG e.V.

@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...

JoyaBansal

ANSYS Mechanical APDL Introductory Tutorials.pdf

sachin chaurasia

Ansys Mechanical enables you to solve complex structural engineering problems and make better, faster design decisions. With the finite element analysis (FEA) solvers available in the suite, you can customize and automate solutions for your structural mechanics problems and parameterize them to analyze multiple design scenarios. Ansys Mechanical is a dynamic tool that has a complete range of analysis tools.

Migrate your Infrastructure to the AWS Cloud

Ortus Solutions, Corp

Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.

Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet

lovelykumarilk789

ENISA Threat Landscape 2023 documentation

sofiafernandezon

@Call @Girls in Ahmedabad 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Ahmedabad Ava...

DiyaSharma6551

@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...

AlinaDevecerski

@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...

Mona Rathore

₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You

shristi verma

一比一原版英国牛津大学毕业证（oxon毕业证书）如何办理

avufu

原版一模一样【微信：741003700 】【英国牛津大学毕业证（oxon毕业证书）成绩单】【微信：741003700 】学位证，留信学历认证（真实可查，永久存档）原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理英国牛津大学毕业证（oxon毕业证书）【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理英国牛津大学毕业证（oxon毕业证书）【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理英国牛津大学毕业证（oxon毕业证书）【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理英国牛津大学毕业证（oxon毕业证书）【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Kolkata @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet

lovelykumarilk789

Ghatkopar @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

aviva54

Top 10 Tips To Get Google AdSense For Your Website

e-Definers Technology

Folding Cheat Sheet #7 - seventh in a series

Philip Schwarz

Recently uploaded (20)

Dombivli @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...

Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple Steps

@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable

Software development... for all? (keynote at ICSOFT'2024)

dachnug51 - HCL Domino Roadmap .pdf

@Call @Girls in Tirunelveli 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Cla...

ANSYS Mechanical APDL Introductory Tutorials.pdf

Migrate your Infrastructure to the AWS Cloud

Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet

ENISA Threat Landscape 2023 documentation

@Call @Girls in Ahmedabad 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Ahmedabad Ava...

@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...

@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...

₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You

一比一原版英国牛津大学毕业证（oxon毕业证书）如何办理

Kolkata @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet

Ghatkopar @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

Top 10 Tips To Get Google AdSense For Your Website

Folding Cheat Sheet #7 - seventh in a series

Introduction to Spark (Intern Event Presentation)

1. Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015

2. What is Apache Spark? Fast and general computing engine for clusters Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

3. About Databricks Founded by creators of Spark in 2013 Oﬀers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs

4. 0 20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors Contributors / Month to Spark Community Growth Most active open source project in big data

5. Spark Programming Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

6. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

7. Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s Iterative algorithm used in machine learning

8. Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes On-Disk Performance Time to sort 100TB

9. Higher-Level Libraries Spark Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph

10. Higher-Level Libraries // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

11. Demo

12. Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org Spark Community

14. Ongoing Work Speeding up Spark through code generation and binary processing (Project Tungsten) R interface to Spark (SparkR) Real-time machine learning library Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

15. Thank you. We’re hiring!

Editor's Notes

Add “variables” to the “functions” in functional programming
100 GB of data on 50 m1.xlarge EC2 machines
Alibab, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.

Introduction to Spark (Intern Event Presentation)

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Spark (Intern Event Presentation)

Similar to Introduction to Spark (Intern Event Presentation) (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark (Intern Event Presentation)

Editor's Notes