Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
DataFrames for Large-scale
Data Science
Reynold Xin @rxin
Feb 17, 2015 (Spark User Meetup)
2
Year of the lamb, goat, sheep, and ram …?
A slide from 2013 …
3
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Recommended for you

Spark SQL
Spark SQLSpark SQL
Spark SQL

The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.

hadooprnosql
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek

Apache Spark presentation at HasGeek FifthElelephant https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries

ml pipelinedata framesdstream
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

analyticsbig dataapache spark
Spark’s Growth
5
Google Trends for “Apache Spark”
Beyond Hadoop Users
6
Early adopters
Data Scientists
Statisticians
R users …
PyData
Users
Understands
MapReduce
& functional APIs
RDD API
• Most data is structured (JSON, CSV, Avro, Parquet, Hive …)
–  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)
• Functional transformations (e.g. map/reduce) are not as
intuitive
7
8

Recommended for you

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Spark Overview 2) Hadoop Overview 3) Spark vs Hadoop 4) Why Spark Hadoop? 5) Using Hadoop With Spark 6) Use Case - Sports Analytics (NBA)

hadoop training certificationapache sparkspark vs hadoop
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail. Below topics are explained in this Spark SQL presentation: 1. What is Spark SQL? 2. Spark SQL features 3. Spark SQL architecture 4. Spark SQL - Dataframe API 5. Spark SQL - Data source API 6. Spark SQL - Catalyst optimizer 7. Running SQL queries 8. Spark SQL demo This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer. What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

spark sqlspark sql tutorialspark sql tutorial for beginners
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture

This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark

apache sparkdistributed systemtungsten
DataFrames in Spark
• Distributed collection of data grouped into named
columns (i.e. RDD with schema)
• Domain-specific functions designed for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
• Available in Python, Scala, Java, and R (via SparkR)
9
10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime performance of aggregating 10 million int pairs
(secs)
Agenda
• Introduction
• Learn by demo
• Design & internals
–  API design
–  Plan optimization
–  Integration with data sources
11
Learn by Demo (in a Databricks Cloud
Notebook)
• Creation
• Project
• Filter
• Aggregations
• Join
• SQL
• UDFs
• Pandas
12
For the purpose of distributing the slides online,
I’m attaching screenshots of the notebooks.

Recommended for you

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

hadoopapache hadoopspark
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide

Want to learn Apache Spark and become big data expert in 2018? This guide will help you learn everything you need to know about Apache Spark!

#bigdata#apachespark#apachesparkguide
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training** This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.

pyspark tutorialpysaprk dataframespyspark
13
14
15
16

Recommended for you

How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller

stream processingbig dataapache flink
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark

Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.

big datapythondata
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

apache sparksparkaisummit
17
18
19
20

Recommended for you

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup

This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.

hadoopspark sqlspark
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg

These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.

sqlmaterialized viewsdbt
21
22
23
24

Recommended for you

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

apache flinkbig datastream processing
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming

Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark. These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.

apache sparkdatabricksstrata london
25
26
Machine Learning Integration
27
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="words", outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = context.load("/path/to/data")
model = pipeline.fit(df)
Design Philosophy
Simple tasks easy
-  DSL for common operations
-  Infer schema automatically (CSV,
Parquet, JSON, …)
-  MLlib pipeline integration
Performance
-  Catalyst optimizer
-  Code generation
Complex tasks possible
-  RDD API
-  Full expression library
Interoperability
-  Various data sources and formats
-  Pandas, R, Hive …
28

Recommended for you

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.

databricksdataframesspark summit
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids

This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy. The talk was held at the Helsinki Data Science meetup on January 9th 2014.

monoidsscalascalding
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

This document summarizes Spark's structured APIs including SQL, DataFrames, and Datasets. It discusses how structuring computation in Spark enables optimizations by limiting what can be expressed. The structured APIs provide type safety, avoid errors, and share an optimization and execution pipeline. Functions allow expressing complex logic on columns. Encoders map between objects and Spark's internal data format. Structured streaming provides a high-level API to continuously query streaming data similar to batch queries.

apache spark
DataFrame Internals
• Represented internally as a “logical plan”
• Execution is lazy, allowing it to be optimized by Catalyst
29
Plan Optimization & Execution
30
SQL	
  AST	
  
DataFrame	
  
Unresolved	
  
Logical	
  Plan	
  
Logical	
  Plan	
  
Op;mized	
  
Logical	
  Plan	
  
Physical	
  Plans	
  Physical	
  Plans	
   RDDs	
  
Selected	
  
Physical	
  Plan	
  
Analysis	
  
Logical	
  
Op;miza;on	
  
Physical	
  
Planning	
  
Cost	
  Model	
  
Physical	
  Plans	
  
Code	
  
Genera;on	
  
Catalog	
  
DataFrames and SQL share the same optimization/execution pipeline
31
32
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à

Recommended for you

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop

The document introduces Hadoop and provides an overview of its key components. It discusses how Hadoop uses a distributed file system (HDFS) and the MapReduce programming model to process large datasets across clusters of computers. It also provides an example of how the WordCount algorithm can be implemented in Hadoop using mappers to extract words from documents and reducers to count word frequencies.

datasalthadoopbig data
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming

This document discusses how Spark provides structured APIs like SQL, DataFrames, and Datasets to organize data and computation. It describes how these APIs allow Spark to optimize queries by understanding their structure. The document outlines how Spark represents data internally and how encoders translate between this format and user objects. It also introduces Spark's new structured streaming functionality, which allows batch queries to run continuously on streaming data using the same API.

spark summitapache sparkspark
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing

This document provides a 3 sentence summary of the key points: Apache Spark is an open source cluster computing framework that is faster than Hadoop MapReduce by running computations in memory through RDDs, DataFrames and Datasets. It provides high-level APIs for batch, streaming and interactive queries along with libraries for machine learning. Spark's performance is improved through techniques like Catalyst query optimization, Tungsten in-memory columnar formats, and whole stage code generation.

business intelligenceapacheapache spark
Data Sources supported by DataFrames
33
{ JSON }
built-in external
JDBC
and more …
More Than Naïve Scans
• Data Sources API can automatically prune columns and
push filters to the source
–  Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
–  JDBC: Rewrite queries to push predicates down
34
35
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
DataFrames in Spark
• APIs in Python, Java, Scala, and R (via SparkR)
• For new users: make it easier to program Big Data
• For existing users: make Spark programs simpler & easier to
understand, while improving performance
• Experimental API in Spark 1.3 (early March)
36

Recommended for you

Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop

This document provides an overview of MapReduce programming with Hadoop, including descriptions of HDFS architecture, examples of common MapReduce algorithms (word count, mean, sorting, inverted index, distributed grep), and how to write MapReduce clients and customize parts of the MapReduce job like input/output formats, partitioners, and distributed caching of files.

mapreducehadoophdfs
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming

This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses: 1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date. 2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs. 3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.

databricksstrata san joseapache spark
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

sparkhadoop
Our Vision
37
Thank you! Questions?
More Information
Blog post introducing DataFrames:
http://tinyurl.com/spark-dataframes
Build from source:
http://github.com/apache/spark (branch-1.3)

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 

Similar to Introducing DataFrames in Spark for Large Scale Data Science

Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型
dhlzj
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 

Similar to Introducing DataFrames in Spark for Large Scale Data Science (20)

Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Map reduce模型
Map reduce模型Map reduce模型
Map reduce模型
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

What is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for FreeWhat is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for Free
TwisterTools
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
msriya3
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
lovelykumarilk789
 
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
DiyaSharma6551
 
AI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdfAI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdf
ayushiqss
 
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile ServiceMumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
kolkata dolls
 
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ... @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
Mona Rathore
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
MaisnamLuwangPibarel
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
onemonitarsoftware
 
@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
 @Call @Girls in Solapur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S... @Call @Girls in Solapur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
Mona Rathore
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
Tier1 app
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
shristi verma
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Misti Soneji
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
Ortus Solutions, Corp
 
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
avufu
 

Recently uploaded (20)

What is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for FreeWhat is OCR Technology and How to Extract Text from Any Image for Free
What is OCR Technology and How to Extract Text from Any Image for Free
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
 
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable @Call @Girls in Kolkata 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Kolkata Avaulable
@Call @Girls in Kolkata 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Kolkata Avaulable
 
AI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdfAI Chatbot Development – A Comprehensive Guide  .pdf
AI Chatbot Development – A Comprehensive Guide  .pdf
 
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile ServiceMumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
 
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ... @Call @Girls in Tiruppur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
@Call @Girls in Tiruppur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class ...
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
 
@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
 @Call @Girls in Solapur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S... @Call @Girls in Solapur 🤷‍♂️  XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
@Call @Girls in Solapur 🤷‍♂️ XXXXXXXX 🤷‍♂️ Tanisha Sharma Best High Class S...
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe@ℂall @Girls Kolkata  ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
@ℂall @Girls Kolkata ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
 
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
一比一原版英国牛津大学毕业证(oxon毕业证书)如何办理
 

Introducing DataFrames in Spark for Large Scale Data Science

  • 1. DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup)
  • 2. 2 Year of the lamb, goat, sheep, and ram …?
  • 3. A slide from 2013 … 3
  • 4. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 5. Spark’s Growth 5 Google Trends for “Apache Spark”
  • 6. Beyond Hadoop Users 6 Early adopters Data Scientists Statisticians R users … PyData Users Understands MapReduce & functional APIs
  • 7. RDD API • Most data is structured (JSON, CSV, Avro, Parquet, Hive …) –  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …) • Functional transformations (e.g. map/reduce) are not as intuitive 7
  • 8. 8
  • 9. DataFrames in Spark • Distributed collection of data grouped into named columns (i.e. RDD with schema) • Domain-specific functions designed for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs • Available in Python, Scala, Java, and R (via SparkR) 9
  • 10. 10 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime performance of aggregating 10 million int pairs (secs)
  • 11. Agenda • Introduction • Learn by demo • Design & internals –  API design –  Plan optimization –  Integration with data sources 11
  • 12. Learn by Demo (in a Databricks Cloud Notebook) • Creation • Project • Filter • Aggregations • Join • SQL • UDFs • Pandas 12 For the purpose of distributing the slides online, I’m attaching screenshots of the notebooks.
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. Machine Learning Integration 27 tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = context.load("/path/to/data") model = pipeline.fit(df)
  • 28. Design Philosophy Simple tasks easy -  DSL for common operations -  Infer schema automatically (CSV, Parquet, JSON, …) -  MLlib pipeline integration Performance -  Catalyst optimizer -  Code generation Complex tasks possible -  RDD API -  Full expression library Interoperability -  Various data sources and formats -  Pandas, R, Hive … 28
  • 29. DataFrame Internals • Represented internally as a “logical plan” • Execution is lazy, allowing it to be optimized by Catalyst 29
  • 30. Plan Optimization & Execution 30 SQL  AST   DataFrame   Unresolved   Logical  Plan   Logical  Plan   Op;mized   Logical  Plan   Physical  Plans  Physical  Plans   RDDs   Selected   Physical  Plan   Analysis   Logical   Op;miza;on   Physical   Planning   Cost  Model   Physical  Plans   Code   Genera;on   Catalog   DataFrames and SQL share the same optimization/execution pipeline
  • 31. 31
  • 32. 32 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= ”2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  • 33. Data Sources supported by DataFrames 33 { JSON } built-in external JDBC and more …
  • 34. More Than Naïve Scans • Data Sources API can automatically prune columns and push filters to the source –  Parquet: skip irrelevant columns and blocks of data; turn string comparison into integer comparisons for dictionary encoded data –  JDBC: Rewrite queries to push predicates down 34
  • 35. 35 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date > ”2015-01-01”) logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimized plan with intelligent data sources join scan (users) filter scan (events)
  • 36. DataFrames in Spark • APIs in Python, Java, Scala, and R (via SparkR) • For new users: make it easier to program Big Data • For existing users: make Spark programs simpler & easier to understand, while improving performance • Experimental API in Spark 1.3 (early March) 36
  • 39. More Information Blog post introducing DataFrames: http://tinyurl.com/spark-dataframes Build from source: http://github.com/apache/spark (branch-1.3)