100% found this document useful (1 vote)

835 views

Apache Spark Tutorial

Apache Spark is a data analytics engine that provides distributed task processing, a job scheduler, and basic I/O functionality. It exposes these components through APIs for Java, Python, Scala, and R. This document provides an overview of Spark concepts like RDDs and DataFrames/Datasets as well as Spark MLlib machine learning capabilities. It also discusses how Spark fits into the big data ecosystem as a faster alternative to Hadoop for both batch and streaming data processing.

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

835 views

Apache Spark Tutorial

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Apache Spark Tutorial

Apache Spark is a data analytics engine. These series of Spark Tutorials deal with Apache Spark Basics and
Libraries : Spark MLlib, GraphX, Streaming, SQL with detailed explaination and examples.

Apache Spark Tutorial

Following are an overview of the concepts and examples that we shall go through in these Apache Spark
Tutorials.

Spark Core
Spark Core is the base framework of Apache Spark. It contains distributed task Dispatcher, Job Scheduler and
Basic I/O functionalities handler. It exposes these components and their functionalities through APIs available
in programming languages Java, Python, Scala and R.

To get started with Apache Spark Core concepts and setup :

Install Spark on Mac OS – Tutorial to install Apache Spark on computer with Mac OS.
Setup Java Project with Apache Spark – Apache Spark Tutorial to setup a Java Project in Eclipse with Apache Spark
Libraries and get started.
Spark Shell is an interactive shell through which we can access Spark’s API. Spark provides the shell in two programming
languages : Scala and Python.
Scala Spark Shell – Tutorial to understand the usage of Scala Spark Shell with Word Count Example.
Python Spark Shell – Tutorial to understand the usage of Python Spark Shell with Word Count Example.
Setup Apache Spark to run in Standalone cluster mode
Example Spark Application using Python to get started with programming Spark Applications.
Configure Apache Spark Ecosystem
Configure Spark Application – Apache Spark Tutorial to learn how to configure a Spark Application like number of
Spark Driver Cores, Spark Master, Deployment Mode etc.
Configuring Spark Environment
Configure Logger

Spark RDD
Spark is built on RDD (Resilient Distributed Database). RDD is the framework that provides Spark the ability to
do parallel data processing on a cluster. We shall go through following RDD Transformations and Actions.
About Spark RDD
Create Spark RDD
Print RDD Elements
Read text file to Spark RDD
Spark – Read multiple text files to a single RDD
Spark – RDD with custom class objects
Spark RDD Map
Spark RDD Reduce
Spark RDD FlatMap
Spark RDD Filter
Spark RDD Distinct
Spark RDD with Custom Class Objects
Spark RDD foreach to iterate over each element of Distributed Dataset.
Read JSON File to RDD

Spark DataSet
Read JSON File to Spark DataSet
Write Spark DataSet to JSON File
Add new column to Spark DataSet
Concatenate Spark Datasets

Spark MLlib – Apache Spark Tutorial

A detailed explanation with an example for each of the available machine learning algorithms is provided below :

Classification using Logistic Regression – Apache Spark Tutorial to understand the usage of Logistic Regression in Spark
MLlib.
Classification using Naive Bayes – Apache Spark Tutorial to understand the usage of Naive Bayes Classifier in Spark
MLlib.
Generalized Regression
Survival Regression
Decision Trees – Apache Spark Tutorial to understand the usage of Decision Trees Algorithm in Spark MLlib.
Random Forests – Apache Spark Tutorial to understand the usage of Random Forest algorithm in Spark MLlib.
Gradient Boosted Trees
Recommendation using Alternating Least Squares (ALS)
Clustering using KMeans – Apache Spark Tutorial to understand the usage of KMean Algorithm in Spark MLlib for
Clustering.
Clustering using Gaussian Mixtures
Topic Modelling in Spark using Latent Dirichlet Conditions
Frequent Itemsets
Association Rules
Sequential Pattern Mining

How Spark came into Big Data Ecosystem

When Apache Software Foundation has started Hadoop, it has two important ideas for implementation :
MapReduce and Scale-out Storage system. With institutional data, sensor data(IOT), social networking data
etc., growing exponentially, there was a need to store vast amount of data with very less expenses. The answer
was HDFS (Hadoop Distributed File System). In order to process and analyze these huge amounts of
information from HDFS very efficiently, Apache Hadoop saw the need for a new engine called MapReduce.
And soon MapReduce has become the only way of data processing and analysis with Hadoop Ecosystem.
MapReduce being the only option, soon led to the evolution of new engines to process and analyse such huge
information stores. And Apache Spark has become one of the interesting engine of those evolved.

Spark was originally designed and developed by the developers at Berkeley AMPLab. To take the benefit of
wide open community at Apache and take Spark to all of those interested in data analytics, the developers have
donated the codebase to Apache Software Foundation and Apache Spark is born. Hence, Apache Spark is an
open source project from Apache Software Foundation.

Hadoop vs Spark

Following are some of the differences between Hadoop and Spark :

Data Processing
Hadoop is only capable of batch processing.

Apache Spark’s flexible memory framework enables it to work with both batches and real time streaming data.
This makes it suitable for big data analytics and real-time processing. Hence Apache Spark made, continuous
processing of streaming data, rescoring of model and delivering the results in real time possible in the big data
ecosystem.

Job Handling
In Hadoop, one has to break their whole job into smaller jobs and chain them together to go along with
MapReduce. Also APIs are complex to understand. This makes building long processing MapReduce jobs
difficult.

In Spark, APIs are well designed by the developers for the developers and did a great job in keeping them
simple. Spark lets you describe the entire job and handles the job very efficiently to execute in parallel form.

Support to existing databases

Hadoop can process only the data present in a distributed file system (HDFS).

Spark in addition to the distributed file systems, also provides support to using much popular databases like
MySQL, PostgreSQL, etc., with the help of its SQL library.

Features of Apache Spark

Apache Spark engine is fast for large-scale data processing and has the following notable features :

High Speed
Spark run programs faster than Hadoop MapReduce : 100 times faster with in-memory and 10 times faster with
disk memory

Ease of Use
Spark provides more than 80 high level operations to build parallel apps easily.

Ease of Programming
Spark programs could be developed using various programming languages like Java, Scala, Python, R.

Stack of Libraries
Spark combines SQL, Streaming, Graph computation andMLlib (Machine Learning) together to bring in
generality for applications.

Support to data sources

Spark can access data in HDFS, HBase, Cassandra, Tachyon, Hive and any Hadoop data source.

Running Environments
Spark can run on : Standalone machine in cluster mode, Hadoop, Apache Mesos or in the cloud.

Apache Spark’s Runtime Architecture

Apache Spark works on master-slave architecture. When a client submits spark application code to the Spark
Driver, Spark Driver implicitly converts the transformations and actions to (DAG)Directed Acyclic Graph and
submits it to a DAG Scheduler (During this conversion to DAG, it also performs optimization such as pipe-line
transformations). Now, DAG scheduler converts logical graph (DAG) into physical action plan containing stages
of tasks. These tasks are bundled to be sent to cluster.

Cluster Manager keeps track of the available resources in the cluster. Once Driver has created and bundled the
tasks, it negotiates with the Cluster Manager for Worker nodes. After the negotiation (which results in allocation
of resources for executing spark application), Cluster Manager launches Executors on Worker nodes and let
driver know about the Executors on Workers. Based on the placement of Executors and their reachability to
data, Driver distributes them the tasks. Once the Executors are ready to start with the task, they register
themselves with the Driver, so that Driver can have whole view of Executors and monitor them during task
execution. Some of the tasks are dependent on the output data from other tasks. In such scenarios, Driver is
responsible for scheduling these future tasks in appropriate locations based on location where data might get
cached or persisted.

While Spark Application is running in the driver, it exposes information through Web UI to the user. Once
SparkContext is stopped, the Executors get terminated.

Usage of Apache Spark

Apache Spark is being used in solving some of the interesting real-time production problems and following are
few of the scenarios :

1. Financial Services
Identifying fraudulent transactions and adapting to the new fraud techniques and updating the model in real time is
required.
In identifying the customer’s buying pattern of stocks and making the predictions for stock sales etc.
2. Online Retail Market
Online Retail giants like Alibaba, Amazon, eBay use Spark for customer analytics like suggesting a product based on
the buying product browsing history, transaction logging etc.
3. Expense Analytics
Concur is using spark for personalization and travel and expenses analytics.

A huge number of companies and organisations are using Apache Spark. The whole list is available here.

Summary

This article provides a good introduction about what Apache Spark is; features of Apache Spark; its
differences with Apache Spark; which modules are present in Apache Spark; different operations available in
the modules and finally some of the use cases in real-time.

Learn Apache Spark

⊩ Apache Spark Tutorial

⊩ Install Spark on Ubuntu

⊩ Install Spark on Mac OS

⊩ Scala Spark Shell - Example

⊩ Python Spark Shell - PySpark

⊩ Setup Java Project with Spark

⊩ Spark Scala Application - WordCount Example

⊩ Spark Python Application

⊩ Spark DAG & Physical Execution Plan

⊩ Setup Spark Cluster

⊩ Configure Spark Ecosystem

⊩ Configure Spark Application

⊩ Spark Cluster Managers

Spark RDD

⊩ Spark RDD

⊩ Spark RDD - Print Contents of RDD

⊩ Spark RDD - foreach

⊩ Spark RDD - Create RDD

⊩ Spark Parallelize

⊩ Spark RDD - Read Text File to RDD

⊩ Spark RDD - Read Multiple Text Files to Single RDD

⊩ Spark RDD - Read JSON File to RDD

⊩ Spark RDD - Containing Custom Class Objects

⊩ Spark RDD - Map

⊩ Spark RDD - FlatMap

⊩ Spark RDD - Filter

⊩ Spark RDD - Distinct

⊩ Spark RDD - Reduce

Spark Dataseet

⊩ Spark - Read JSON file to Dataset

⊩ Spark - Write Dataset to JSON file

⊩ Spark - Add new Column to Dataset

⊩ Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library)

⊩ Spark MLlib Tutorial

⊩ KMeans Clustering & Classification

⊩ Decision Tree Classification

⊩ Random Forest Classification

⊩ Naive Bayes Classification

⊩ Logistic Regression Classification

⊩ Topic Modelling

Spark SQL

⊩ Spark SQL Tutorial

⊩ Spark SQL - Load JSON file and execute SQL Query

Spark Others

⊩ Spark Interview Questions

Pythons Basics
No ratings yet
Pythons Basics
104 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Unit 5
100% (1)
Unit 5
109 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Machine Learning with Spark - Second Edition
From Everand
Machine Learning with Spark - Second Edition
Rajdeep Dua
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
SPARK
No ratings yet
SPARK
125 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
No ratings yet
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
9 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Hadoop HDFS Commands
No ratings yet
Hadoop HDFS Commands
6 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Mastering Apache Spark
100% (2)
Mastering Apache Spark
1,831 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
ch_5 JS
No ratings yet
ch_5 JS
109 pages
Clustering
No ratings yet
Clustering
43 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Tutorial 1 What is Cucumber-BDD.docx
No ratings yet
Tutorial 1 What is Cucumber-BDD.docx
9 pages
Spring+Boot+eCommerce+Masterclass
No ratings yet
Spring+Boot+eCommerce+Masterclass
337 pages
Tutorial_8_DataTable_asLists_in_Cucumber.docx
No ratings yet
Tutorial_8_DataTable_asLists_in_Cucumber.docx
13 pages
Spring Slides
No ratings yet
Spring Slides
63 pages
UDEMY_SK_SelectorsHub Tutorial- A Free Next Gen XPath & Locators Tool
No ratings yet
UDEMY_SK_SelectorsHub Tutorial- A Free Next Gen XPath & Locators Tool
20 pages
UDEMY_SK_XPath Tutorial From Basic to Advance Level
No ratings yet
UDEMY_SK_XPath Tutorial From Basic to Advance Level
9 pages
Youtube PavanKumar Manual Testing 02 (Practical)
No ratings yet
Youtube PavanKumar Manual Testing 02 (Practical)
21 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Tutorial_6_BackgroundKeyword.docx
No ratings yet
Tutorial_6_BackgroundKeyword.docx
9 pages
IPD Checklist
No ratings yet
IPD Checklist
1 page
Slides for Windows OS
No ratings yet
Slides for Windows OS
43 pages
Testing - Log4J
No ratings yet
Testing - Log4J
7 pages
Xpath+vs+CSS+-+Everything+you+need+to+know+about+XPath+and+CSS.docx
No ratings yet
Xpath+vs+CSS+-+Everything+you+need+to+know+about+XPath+and+CSS.docx
11 pages
Testing - Apache POI
No ratings yet
Testing - Apache POI
12 pages
Tutorial_10_Data_Driven_Testing_in_Cucumber_Scenario_Outline.docx
No ratings yet
Tutorial_10_Data_Driven_Testing_in_Cucumber_Scenario_Outline.docx
10 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
List+Interface
No ratings yet
List+Interface
8 pages
Iterator+in+Java+Collection+ Iterator
No ratings yet
Iterator+in+Java+Collection+ Iterator
8 pages
BC Contact Numbers Emails All
No ratings yet
BC Contact Numbers Emails All
1 page
TreeSet+in+Java
No ratings yet
TreeSet+in+Java
6 pages
HashSet+in+Java
No ratings yet
HashSet+in+Java
7 pages
ArrayList+in+Java
No ratings yet
ArrayList+in+Java
3 pages
ParsingJson
No ratings yet
ParsingJson
1 page
Set+in+Java
No ratings yet
Set+in+Java
3 pages
d7f26b7a-e813-4ea1-baad-16602c729c44
No ratings yet
d7f26b7a-e813-4ea1-baad-16602c729c44
40 pages
Intelligent Expense Management: Yashwanth Madhusudan Co-Founder and CEO, Fyle
No ratings yet
Intelligent Expense Management: Yashwanth Madhusudan Co-Founder and CEO, Fyle
15 pages
A PROJECT REPORT ON Hotel Managment Usin
No ratings yet
A PROJECT REPORT ON Hotel Managment Usin
137 pages
Kung-Kiu Lau, Simone Di Cola-An Introduction To Component-Based Software Development-World Scientific (2016)
No ratings yet
Kung-Kiu Lau, Simone Di Cola-An Introduction To Component-Based Software Development-World Scientific (2016)
145 pages
Module 2 in EmTech 12
No ratings yet
Module 2 in EmTech 12
6 pages
FortiPAM Workshop Hands-On LAB Guide v1.3
No ratings yet
FortiPAM Workshop Hands-On LAB Guide v1.3
73 pages
PRAVEEN
No ratings yet
PRAVEEN
2 pages
Kronos WF Analytics System Administrators Guide-Analytics v6.0
No ratings yet
Kronos WF Analytics System Administrators Guide-Analytics v6.0
30 pages
Netspi
No ratings yet
Netspi
3 pages
01-Lecture 1
No ratings yet
01-Lecture 1
29 pages
Thrivenkumarv Aem Developer26
No ratings yet
Thrivenkumarv Aem Developer26
5 pages
Online Banking System: A Project Report On
No ratings yet
Online Banking System: A Project Report On
99 pages
Computer Wares
No ratings yet
Computer Wares
11 pages
ITE 3106 - Lesson 03 - Application Architectures
No ratings yet
ITE 3106 - Lesson 03 - Application Architectures
14 pages
Crash Recovery: Transaction
No ratings yet
Crash Recovery: Transaction
11 pages
Wireshark Assignment 05
100% (1)
Wireshark Assignment 05
19 pages
5 8 18
No ratings yet
5 8 18
3 pages
Data Science
No ratings yet
Data Science
7 pages
Weather Forecasting Project
No ratings yet
Weather Forecasting Project
8 pages
Software Testing
No ratings yet
Software Testing
8 pages
Cloud Computing: Presentation On
No ratings yet
Cloud Computing: Presentation On
14 pages
Bitcoin New Hack Scripttxt PDF Free
No ratings yet
Bitcoin New Hack Scripttxt PDF Free
21 pages
Application Portfolio Management
No ratings yet
Application Portfolio Management
3 pages
Reference - Chinas Certification System For Equipment - TLC Ling Dabing - 202108
No ratings yet
Reference - Chinas Certification System For Equipment - TLC Ling Dabing - 202108
42 pages
Microsoft Access Exam 2021
0% (1)
Microsoft Access Exam 2021
2 pages
SE Models
No ratings yet
SE Models
68 pages
SAP Data Archiving-Decision Making Guide
No ratings yet
SAP Data Archiving-Decision Making Guide
17 pages
Credit Card New
No ratings yet
Credit Card New
69 pages
Unit-13 Managing IT Function
No ratings yet
Unit-13 Managing IT Function
26 pages
Chapter 16 - Monitor and Log
No ratings yet
Chapter 16 - Monitor and Log
44 pages

Apache Spark Tutorial

Uploaded by

Apache Spark Tutorial

Uploaded by

Apache Spark Tutorial

Apache Spark Tutorial

To get started with Apache Spark Core concepts and setup :

Spark MLlib – Apache Spark Tutorial

How Spark came into Big Data Ecosystem

Following are some of the differences between Hadoop and Spark :

Support to existing databases

Features of Apache Spark

Support to data sources

Apache Spark’s Runtime Architecture

Usage of Apache Spark

Learn Apache Spark

⊩ Apache Spark Tutorial

⊩ Install Spark on Ubuntu

⊩ Install Spark on Mac OS

⊩ Scala Spark Shell - Example

⊩ Python Spark Shell - PySpark

⊩ Setup Java Project with Spark

⊩ Spark Scala Application - WordCount Example

⊩ Spark Python Application

⊩ Spark DAG & Physical Execution Plan

⊩ Setup Spark Cluster

⊩ Configure Spark Ecosystem

⊩ Configure Spark Application

⊩ Spark Cluster Managers

⊩ Spark RDD - Print Contents of RDD

⊩ Spark RDD - foreach

⊩ Spark RDD - Create RDD

⊩ Spark RDD - Read Text File to RDD

⊩ Spark RDD - Read Multiple Text Files to Single RDD

⊩ Spark RDD - Read JSON File to RDD

⊩ Spark RDD - Containing Custom Class Objects

⊩ Spark RDD - Map

⊩ Spark RDD - FlatMap

⊩ Spark RDD - Filter

⊩ Spark RDD - Distinct

⊩ Spark RDD - Reduce

⊩ Spark - Read JSON file to Dataset

⊩ Spark - Write Dataset to JSON file

⊩ Spark - Add new Column to Dataset

⊩ Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library)

⊩ Spark MLlib Tutorial

⊩ KMeans Clustering & Classification

⊩ Decision Tree Classification

⊩ Random Forest Classification

⊩ Naive Bayes Classification

⊩ Logistic Regression Classification

⊩ Spark SQL Tutorial

⊩ Spark SQL - Load JSON file and execute SQL Query

⊩ Spark Interview Questions

You might also like