Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Tutorial: Scalable Data Analytics
using Apache Spark
Dr.Ahmet Bulut
@kral
http://www.linkedin.com/in/ahmetbulut
Intro to Spark
Cluster Computing
• Apache Spark is a cluster computing platform designed
to be fast and general-purpose.
• Running computational tasks across many worker
machines, or a computing cluster.
Unified Computing
• In Spark, you can write one application that uses
machine learning to classify data in real time as it is
ingested from streaming sources.
• Simultaneously, analysts can query the resulting data,
also in real time, via SQL (e.g., to join the data with
unstructured log-files).
• More sophisticated data engineers and data scientists
can access the same data via the Python shell for ad
hoc analysis.

Recommended for you

Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130

This slide introduces Hadoop Spark. Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming. Not all technical details are included.

sparkhadoop
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

sparkhadoopbig data
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford

Here are the steps to complete the assignment: 1. Create RDDs to filter each file for lines containing "Spark": val readme = sc.textFile("README.md").filter(_.contains("Spark")) val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark")) 2. Perform WordCount on each: val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _) val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _) 3. Join the two RDDs: val joined = readmeCounts.join(changes

Spark Stack
Spark Core
• Spark core:“computational engine” that is responsible
for scheduling, distributing, and monitoring applications
consisting of many computational tasks on a computing
cluster.
Spark Stack
• Spark Core: the basic functionality of Spark, including
components for task scheduling, memory management,
fault recovery, interacting with storage systems, and
more.
• Spark SQL: Spark’s package for working with
structured data.
• Spark Streaming: Spark component that enables
processing of live streams of data.
Spark Stack
• MLlib: library containing common machine learning
(ML) functionality including classification, regression,
clustering, and collaborative filtering, as well as
supporting functionality such as model evaluation and
data import.
• GraphX: library for manipulating graphs (e.g., a social
network’s friend graph) and performing graph-parallel
computations.
• Cluster Managers: Standalone Scheduler,Apache
Mesos, HadoopYARN.

Recommended for you

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek

Apache Spark presentation at HasGeek FifthElelephant https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries

ml pipelinedata framesdstream
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python

In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.

distributed computingdistrict data labspython
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab

Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System This is the first session in the series of "Apache Spark Hands-on" Topics Covered + Introduction to Apache Spark + Introduction to RDD (Resilient Distributed Datasets) + Loading data into an RDD + RDD Operations - Transformation + RDD Operations - Actions + Hands-on demos using CloudxLab

spark cloudxlab hadoop bigdata
“Data Scientist: a person, who is better
at statistics than a computer engineer, 

and better at computer engineering 

than a statistician.”
I do not believe in this new job role.

Data Science is embracing all stakeholders.
Data Scientists of Spark age
• Data scientists use their skills to analyze data with the
goal of answering a question or discovering insights.
• Data science workflow involves ad hoc analysis.
• Data scientists use interactive shells (vs. building
complex applications) for seeing the results to their
queries and for writing snippets of code quickly.
Data Scientists of Spark age
• Spark’s speed and simple APIs shine for data science, and
its built-in libraries mean that many useful algorithms
are available out of the box.
Storage Layer
• Spark can create distributed datasets from any file
stored in the Hadoop distributed filesystem (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local filesystem,Amazon S3, Cassandra,
Hive, HBase, etc.).
• Spark does not require Hadoop; it simply has support
for storage systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro, Parquet,
and any other Hadoop InputFormat.

Recommended for you

Scala and spark
Scala and sparkScala and spark
Scala and spark

This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.

scala apache spark
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing. Table of contents: 1. The Big Data triangle 2. Hadoop stack and its limitations 3. Spark: An Overview 3.a. Spark Streaming 3.b. GraphX: Graph processing 3.c. MLib: Machine Learning 4. Performance characteristics of Spark

apache sparkapache stormmllib
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101

This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.

data analyticsapache sparkbig data
Downloading Spark
• The first step to using Spark is to download and unpack
it.
• For a recent precompiled released version of Spark.
• Visit http://spark.apache.org/downloads.html
• Select the package type of “Pre-built for Hadoop 2.4 and
later,” and click “Direct Download.”
• This will download a compressed TAR file, or tarball,
called spark-1.2.0-bin-hadoop2.4.tgz.
Directory structure
• README.md

Contains short instructions for getting started with Spark.
• bin 

Contains executable files that can be used to interact with
Spark in various ways.
Directory structure
• core, streaming, python, ... 

Contains the source code of major components of the Spark
project.
• examples 

Contains some helpful Spark standalone jobs that you can
look at and run to learn about the Spark API.
PySpark
• The first step is to open up one of Spark’s shells.To
open the Python version of the Spark shell, which we
also refer to as the PySpark Shell, go into your Spark
directory and type: 



$ bin/pyspark

Recommended for you

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

hadoopapache hadoopspark
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark

This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it. Was presented on Morning@Lohika tech talks in Lviv. Design by Yarko Filevych: http://www.filevych.com/

spark sqldata framebig data
Logging verbosity
• You can control the verbosity of the logging, create a file
in the conf directory called log4j.properties.
• To make the logging less verbose, make a copy of conf/
log4j.properties.template called conf/log4j.properties and
find the following line: 

log4j.rootCategory=INFO, console



Then lower the log level to

log4j.rootCategory=WARN, console

IPython
• IPython is an enhanced Python shell that offers features
such as tab completion. Instructions for installing it is at 

http://ipython.org.
• You can use IPython with Spark by setting the
IPYTHON environment variable to 1: 



IPYTHON=1 ./bin/pyspark
IPython
• To use the IPython Notebook, which is a web-browser-
based version of IPython, use
IPYTHON_OPTS="notebook" ./bin/pyspark
• On Windows, set the variable and run the shell as
follows: 

set IPYTHON=1 

binpyspark
Script #1
•# Create an RDD

>>> lines = sc.textFile("README.md")
•# Count the number of items in the RDD

>>> lines.count()
•# Show the first item in the RDD

>>> lines.first()

Recommended for you

Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide

Want to learn Apache Spark and become big data expert in 2018? This guide will help you learn everything you need to know about Apache Spark!

#bigdata#apachespark#apachesparkguide
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals

This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.

apache spark
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark

Quick Introduction of Hadoop & it's Limitation Introduction of Spark Spark Architecture Programming model of Spark Demo Spark Use Cases

analyticsapache sparkhadoop
Resilient Distributed 

Dataset
• The variable lines is an RDD: Resilient Distributed
Dataset.
• on RDDs, you can run parallel operations.
Intro to
Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
• Spark Shell is a driver program itself.
• Driver programs access Spark through SparkContext
object, which represents a connection to a computing
cluster.
• In the Spark shell, the context is automatically created
as the variable sc.
Architecture
Intro to
Core Spark Concepts
• Driver programs manage a number of nodes called
executors.
• For example, running the count() on a cluster would
translate into different nodes counting the different
ranges of the input file.

Recommended for you

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski

This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.

apache spark
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

apache sparkshufflingrdds
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark

The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.

data scienceapache sparkanalytics
Script #2
•>>> lines = sc.textFile(“README.md”)
•>>> pythonLines = lines.filter(lambda line:“Python” in
line)
•>>> pythonLines.first()
Standalone applications
• Apart from running interactively, Spark can be linked
into standalone applications in either Python, Scala, or
Java.
• The main difference is that you need to initialize your
own SparkContext.
• How to py it: 

Write your applications as Python scripts as you
normally do, but to run them with cluster aware logic,
use spark-submit script.
Standalone applications
•$ bin/spark-submit my_script.py
• The spark-submit script sets up the environment for
Spark’s Python API to function by including Spark
dependencies.
Initializing Spark in Python
• # Excerpt from your driver program



from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster(“local”).setAppName(“My App”)

sc = SparkContext(conf=conf)

Recommended for you

Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)

The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.

spark summit 2015apache spark
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2

Discover insight about car manufacturers from Twitter Data using a Python Notebook connected to Apache Spark

apache sparkdashdbtwitter
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)

Slideset of the training we gave at the Spark Summit East. Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/ Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ

data scienceapache sparkanalytics
Operations
Operations on RDDs
• Transformations and Actions.
• Transformations construct a new RDD from a previous
one.
• “Filtering data that matches a predicate” is an example
transformation.
Transformations
• Let’s create an RDD that holds strings containing the
word Python.
•>>> pythonLines = lines.filter(lambda line:“Python” in
line)
Actions
• Actions compute a result based on an RDD.
• They can return the result to the driver, or to an
external storage system (e.g., HDFS).
•>>> pythonLines.first()

Recommended for you

Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1

This document outlines steps for developing analytic applications using Apache Spark and Python. It covers prerequisites for accessing flight and weather data, deploying a simple data pipe tool to build training, test, and blind datasets, and using an IPython notebook to train predictive models on flight delay data. The agenda includes accessing necessary services on Bluemix, preparing the data, training models in the notebook, evaluating model accuracy, and deploying models.

pythonsparknotebook
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark

This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points: - Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster. - RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation. - Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler. - Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing

big datascalakafka
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)

This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.

sparkpyspark
Transformations & Actions
• You can create RDDs at any time using transformations.
• But, Spark will materialize them once they are used in an
action.
• This is a lazy approach to RDD creation.
Lazy …
• Assume that you want to work with a Big Data file.
• But you are only interested in the lines that contain
Python.
• were Spark to load and save all the lines in the file as
soon as sc.textFile(…) is called, it would waste storage
space.
• Therefore, Spark chooses to see all transformations
first, and then compute the result to an action.
Persistence of RDDs
• RDDs are re-computed each time you run an action on
them.
• In order to re-use an RDD in multiple actions, you can
ask Spark to persist it using RDD.persist().
Resilience of RDDs
• Once computed, RDD is materialized in memory.
• Persistence to disk is also possible.
• Persistence is optional, and not a default behavior.The
reason is that if you are not going to re-use an RDD,
there is no point in wasting storage space by persisting
it.
• The ability to re-compute is what makes RDDs resilient
to node failures.

Recommended for you

Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark

Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.

apache sparkakkaplay framework
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.

distributed systemssparkooyala
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015

- The document discusses a presentation given by Jongwook Woo on introducing Spark and its uses for big data analysis. It includes information on Woo's background and experience with big data, an overview of Spark and its components like RDDs and task scheduling, and examples of using Spark for different types of data analysis and use cases.

sparkbig datascala
Pair RDDs
Working with Key/Value 

Pairs
• Most often you ETL your data into a key/value format.
• Key/value RDDs let you 

count up reviews for each product,

group together data with the same key,

group together two different RDDs.
Pair RDD
• RDDs containing key/value pairs are called pair RDDs.
• Pair RDDs are a useful building block in many programs
as they expose operations that allow you to act on each
key in parallel or regroup data across the network.
• For example, pair RDDs have a reduceByKey() method
that can aggregate data separately for each key.
• join() method merges two RDDs together by grouping
elements with the same key.
Creating Pair RDDs
• Use a map() function that returns key/value pairs.
•pairs = lines.map(lambda x: (x.split(“ ”)[0], x))

Recommended for you

Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015

This document provides an overview of Near Real Time Analysis of Web Scale Social Data. It discusses how SpotRight collects and organizes publicly available user-generated social data at web scale, including connections between users, actions, events, profiles, demographics, and more from sources like Twitter, Pinterest, blogs and articles. It describes SpotRight's goals, architecture, algorithms, and tools used to perform real-time analysis and deliver timely insights to clients, including graph building, profile creation, and delivery of results. Key aspects involve collecting petabytes of social data, performing distributed graph algorithms at scale, and querying and delivering insights from the organized data.

NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides

This document provides an overview of a machine learning workshop including tutorials on decision tree classification for flight delays, clustering news articles with k-means clustering, and collaborative filtering for movie recommendations using Spark. The tutorials demonstrate loading and preparing data, training models, evaluating performance, and making predictions or recommendations. They use Spark MLlib and are run in Apache Zeppelin notebooks.

An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup

- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone. - Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing. - Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.

apache sparkatlanta meetup
Transformations on Pair
RDDs
• Let the rdd be [(1,2),(3,4),(3,6)]
• reduceByKey(func) combines values with the same key.
•>>> rdd.reduceByKey(lambda x,y: x+y) —> [(1,2),(3,10)]
•groupByKey() group values with the same key.
•>>> rdd.groupByKey() —> [(1,[2]),(3,[4,6])]
Transformations on Pair
RDDs
• mapValues(func) applies a function to each value of a
pair RDD without changing the key.
•>>> rdd.mapValues(lambda x: x+1)
•keys() returns an rdd of just the keys.
•>>> rdd.keys()
•values() returns an rdd of just the values.
•>>> rdd.values()
Transformations on Pair
RDDs
• sortByKey() returns an rdd, which has the same contents
as the original rdd, but sorted by its keys.
•>>> rdd.sortByKey()
Transformations on Pair
RDDs
•join() performs an inner join between two RDDs.
•let rdd1 be [(1,2),(3,4),(3,6)] and rdd2 be [(3,9)].
•>>> rdd1.join(rdd2) —> [(3,(4,9)),(3,(6,9))]

Recommended for you

Scala in practice
Scala in practiceScala in practice
Scala in practice

This document provides an overview of Scala and compares it to Java. It discusses Scala's object-oriented and functional capabilities, how it compiles to JVM bytecode, and benefits like less boilerplate code and support for functional programming. Examples are given of implementing a simple Property class in both Java and Scala to illustrate concepts like case classes, immutable fields, and less lines of code in Scala. The document also touches on Java interoperability, learning Scala gradually, XML processing capabilities, testing frameworks, and tool/library support.

scala
Scala+RDD
Scala+RDDScala+RDD
Scala+RDD

A slide used to introduce scala and rdd in china mobile

sparkscalardd
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec

This document provides an introduction to the Scala programming language. It discusses how Scala runs on the Java Virtual Machine, supports both object-oriented and functional programming paradigms, and provides features like pattern matching, immutable data structures, lazy evaluation, and parallel collections. Scala aims to be concise, expressive, and extensible.

scala
Pair RDDs are still RDDs
you can also filter by value! try.
Pair RDDs are still RDDs
• Given that pairs is an RDD with the key being an
integer:
•>>> filteredRDD = pairs.filter(lambda x: x[0]>5)
Lets do a word count
•>>> rdd = sc.textFile(“README.md”)
•>>> words = rdd.flatMap(lambda x: x.split(“ ”))
•>>> result = 

words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
Lets identify the top words
•>>> sc.textFile("README.md")

.flatMap(lambda x: x.split(" "))

.map(lambda x: (x.lower(),1))

.reduceByKey(lambda x,y: x+y)

.map(lambda x: (x[1],x[0]))

.sortByKey(ascending=False)

.take(5)

Recommended for you

Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration

This document discusses using Apache Spark and the Spark Cassandra connector to perform analytics on data stored in Apache Cassandra. It provides an overview of Spark and its components, describes how the Spark Cassandra connector allows reading and writing Cassandra data as Spark RDDs, and gives examples of migrating data from a relational database to Cassandra, performing aggregations with Spark SQL, and using Spark Streaming to process streaming Cassandra data.

cassandraspark
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey

In July 2016, we conducted our Apache Spark Survey to identify insights on how organizations are using Spark and highlight growth trends since our last Spark Survey 2015. The 2016 survey results reflect answers from 900 distinct organizations and 1615 respondents, who were predominantly Apache Spark users. The results show that the Spark community is... https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html

apache spark
ScalaTrainings
ScalaTrainingsScalaTrainings
ScalaTrainings

Here are the answers to your questions: 1. The main differences between a Trait and Abstract Class in Scala are: - Traits can be mixed in to classes using with, while Abstract Classes can only be extended. - Traits allow for multiple inheritance as they can be mixed in, while Abstract Classes only allow single inheritance. - Abstract Classes can have fields and constructor parameters while Traits cannot. - Abstract Classes can extend other classes, while Traits can only extend other Traits. 2. abstract class Animal { def isMammal: Boolean def isFriendly: Boolean = true def summarize: Unit = { println("Characteristics of animal:") }

Apache Spark Tutorial
Per key aggregation
•>>> aggregateRDD = rdd.mapValues(lambda x: (x,
1)).reduceByKey(lambda x, y: x[0]+y[0], x[1]+y[1])
Grouping data
• On an RDD consisting of keys of type K and values of
type V, we get back an RDD of type [K, Iterable[V]].
• >>> rdd.groupByKey()
• We can group data from multiple RDDs using cogroup().
• Given two RDDs sharing the same key type K, with the
respective value types asV and W, the resulting RDD is
of type [K, (Iterable[V], Iterable[W])].
• >>> rdd1.cogroup(rdd2)
Joins
• There are two types of joins as inner joins and outer
joins.
• Inner joins require a key to be present in both RDDs.
There is a join() call.
• Outer joins do not require a key to be present in both
RDDs.There is a leftOuterJoin() and rightOuterJoin().
None is used as the value for the RDD which has the
key missing.

Recommended for you

Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN

This document provides information about running Spark on YARN including: - Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs). - When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine. - Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.

hadoop summitapache hadoop
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example

Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.

Spark core
Spark coreSpark core
Spark core

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

sparkhadoop
Joins
•>>> rdd1,rdd2=[(‘A',1),('B',2),('C',1)],[('A',3),('C',2),('D',
4)]
•>>> rdd1,rdd2=sc.parallelize(rdd1),sc.parallelize(rdd2)
•>>> rdd1.leftOuterJoin(rdd2).collect()

[('A', (1, 3)), ('C', (1, 2)), ('B', (2, None))]
•>>> rdd1.rightOuterJoin(rdd2).collect()

[('A', (1, 3)), ('C', (1, 2)), ('D', (None, 4))]
Sorting data
• We can sort an RDD with Key/Value pairs provided that
there is an ordering defined on the key.
• Once we sorted our data, subsequent calls, e.g., collect(),
return ordered data.
•>>> rdd.sortByKey(ascending=True,
numPartitions=None, keyfunc=lambda x: str(x))
Actions on pair RDDs
•>>> rdd1=[(‘A',1),('B',2),('C',1)]
•>>> rdd1.collectAsMap()

{'A': 1, 'B': 2, 'C': 1}
•>>> rdd1.countByKey()[‘A’]

1
Advanced Concepts

Recommended for you

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.

sparkapache spark
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session

- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing. - Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible. - Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.

big data
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2

This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.

apache sparkscalahadoop
Accumulators
• Accumulators are shared variables.
• They are used to aggregate values from worker nodes
back to the driver program.
• One of the most common uses of accumulators is to
count events that occur during job execution for
debugging purposes.
Accumulators
•>>> inputfile = sc.textFile(inputFile)
• ## Lets create an Accumulator[Int] initialized to 0
•>>> blankLines = sc.accumulator(0)
Accumulators
•>>> def parseOutAndCount(line):

# Make the global variable accessible

global blankLines

if (line == ""): blankLines += 1 

return line.split(" ")
•>>> rdd = inputfile.flatMap(parseOutAndCount)
• Do an action so that the workers do real work!
•>>> rdd.saveAsTextFile(outputDir + "/xyz")
•>>> blankLines.value
Accumulators & 

Fault Tolerance
• Spark automatically deals with failed or slow machines
by re-executing failed or slow tasks.
• For example, if the node running a partition of a map()
operation crashes, Spark will rerun it on another node.
• If the node does not crash but is simply much slower
than other nodes, Spark can preemptively launch a
“speculative” copy of the task on another node, and
take its result instead if that finishes earlier.

Recommended for you

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark

A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.

apache sparkapache kafkaanalytics
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx

Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.

sparkspark documentationspark kt
Accumulators & 

Fault Tolerance
• Even if no nodes fail, Spark may have to rerun a task to
rebuild a cached value that falls out of memory. 





“The net result is therefore that the same function may
run multiple times on the same data depending on
what happens on the cluster.”
Accumulators & 

Fault Tolerance
• For accumulators used in actions, Spark applies each
task’s update to each accumulator only once.
• For accumulators used in RDD transformations
instead of actions, this guarantee does not exist.
• Bottomline: use accumulators only in actions.
BroadcastVariables
• Spark’s second type of shared variable, broadcast
variables, allows the program to efficiently send a large,
read-only value to all the worker nodes for use in one
or more Spark operations.
• Use it if your application needs to send a large, read-
only lookup table or a large feature vector in a
machine learning algorithm to all the nodes.
Yahoo SEM Click Data
• Dataset:Yahoo’s Search Marketing Advertiser Bid-
Impression-Click data, version 1.0
• 77,850,272 rows, 8.1GB in total.
• Data fields:

0 day

1 anonymized account_id

2 rank

3 anonymized keyphrase (list of anonymized keywords)

4 avg bid

5 impressions

6 clicks

Recommended for you

Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky

Spark is a domain-specific language for working with collections that is implemented in Scala and runs on a cluster. While similar to Scala collections, Spark differs in that it is lazy and supports additional functionality for paired data. Scala can learn from Spark by adding views to make laziness clearer, caching for persistence, and pairwise operations. Types are important for Spark as they prevent logic errors and help with programming complex functional operations across a cluster.

apache sparkspark summit eu
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx

CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx

engineering
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training

An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.

apache sparksynergetics-indiahdinsight
Sample data rows
1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0
29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0
76 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
48 08bade48-1081-488f-b459-6c75d75312ae 3 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0
97 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0
123 08bade48-1081-488f-b459-6c75d75312ae 5 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
119 08bade48-1081-488f-b459-6c75d75312ae 3 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0
73 08bade48-1081-488f-b459-6c75d75312ae 1 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0
• Primary key: date, account_id, rank and keyphrase.
• Average bid, impressions and clicks information is 

aggregated over the primary key.
Feeling clicky?
keyphrase impressions clicks
iphone 6 plus for cheap 100 2
new samsung tablet 10 1
iphone 5 refurbished 2 0
learn how to program for iphone 200 0
Getting Clicks = Popularity
• Click Through Rate (CTR) = —————————

# of impressions
• If CTR > 0, it is a popular keyphrase.
• If CTR == 0, it is an unpopular keyphrase.
# of clicks
Keyphrase = {terms}
• Given keyphrase “iphone 6 plus for cheap”, its terms are: 



iphone

6

plus

for

cheap

Recommended for you

APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx

Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.

Apache Spark
Apache SparkApache Spark
Apache Spark

Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.

big dataapache sparkreal-time processing
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2

Abstract – Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications. Target Audience Architects, Java/Scala developers, Big Data engineers, team leaders Prerequisites Java/Scala knowledge and SQL knowledge Contents: - Spark internals - Architecture - RDD - Shuffle explained - Dataset API - Spark SQL - Spark Streaming

sparkscalaspark 2
Contingency table
Keyphrases got clicks no clicks Total
term t present s n-s n
term t absent
S-s (N-S)-(n-s) N-n
Total S N-S N
Clickiness of a term
• For the term presence to click reception contingency
table shown previously, we can compute a given term t’s
clickiness value ct as follows:
• ct = log ——————————

(n-s+0.5)/(N-n-S+s+0.5)





(s+0.5)/(S-s+0.5)
Clickiness of a keyphrase
• Given a keyphrase K that consists of terms t1 t2 … tn, 

its clickiness can be computed by summing up the
clickiness of the terms present in it.
• That is, cK = ct1 + ct2 + … + ctn
Feeling clicky?
keyphrase impressions clicks clickiness
iphone 6 plus for cheap 100 2 1
new samsung tablet 10 1 1
iphone 5 refurbished 2 0 0
learn how to program for iphone 200 0 0

Recommended for you

Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations

This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.

sparkbigdatascala
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core

Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface

Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).

apachesparkrdd
Clickiness of iphone
Keyphrases got clicks no clicks Total
term iphone present 1 2 3
term iphone absent
1 0 1
Total 2 2 4
Clickiness of iphone
ciphone = log ———————

(2+0.5)/(0+0.5)





(1+0.5)/(1+0.5)
• Given keyphrases and their clickiness



k1 = t12 t23 … t99 1 

k2 = t19 t201 … t1 0

k3 = t1 t2 … t101 1

…

…

kn = t1 t2 … t101 1
Mapping
MappingYahoo’s click data
•>>> import math
•>>> rdd = sc.textFile("yahoo_keywords_bids_clicks")

.map(lambda line: (line.split("t")[3], 

(float(line.split(“t")[-2]),float(line.split("t")
[-1]))))
•>>> rdd = 

rdd.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1]))

.mapValues(lambda x: 1 if (x[1]/x[0])>0 else 0)

Recommended for you

Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming. Presented at the Desert Code Camp: http://oct2016.desertcodecamp.com/sessions/all

big dataspark sqlapache spark
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming. Presented at the Desert Code Camp: http://oct2016.desertcodecamp.com/sessions/all

sparkspark streamingbig data
• Given keyphrases and their clickiness



k1 = t12 t23 … t99 1 

k2 = t19 t201 … t1 0

k3 = t1 t2 … t101 1

…

…

kn = t1 t2 … t101 1
flatMapping
(t19, 0), (t201, 0),…, (t1, 0)
flatMap it to
• Given keyphrases and their clickiness



k1 = t12 t23 … t99 1 

k2 = t19 t201 … t1 0

k3 = t1 t2 … t101 1

…

…

kn = t1 t2 … t101 1
flatMapping
(t19, 0), (t201, 0),…, (t1, 0)
flatMap it to
(t1, 1), (t2, 1),…, (t101, 1)
flatMap it to
flatMapping
•>>> keyphrases0 = rdd.filter(lambda x: x[1]==0)
•>>> keyphrases1 = rdd.filter(lambda x: x[1]==1)
•>>> rdd0 = 

keyphrases0.flatMap(lambda x: [(e,1) for e in x[0].split()])
•>>> rdd1 = 

keyphrases1.flatMap(lambda x: [(e,1) for e in x[0].split()])
•>>> iR = keyphrases0.count()
•>>> R = keyphrases1.count()
Reducing
(t1, 19)
(t12, 19)
(t101, 19)
…
…
(t1, 200)
(t12, 11)
(t101, 1)
…
…
rdd0 rdd1

Recommended for you

Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML

This document discusses Apache Spark machine learning (ML) workflows for recommendation systems using collaborative filtering. It describes loading rating data from users on items into a DataFrame and splitting it into training and test sets. An ALS model is fit on the training data and used to make predictions on the test set. The root mean squared error is calculated to evaluate prediction accuracy.

data scienceapache sparkscalable machine learning
Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!

Trading Privacy for Value In the start-up culture of the 21st century, we live by the motto “move fast and break things.” What if what gets broken is society*? how can we build data products and services that use data ethically & responsibly? how do companies take a data (science) project from lab to production successfully? Systems that can explain their decisions. how can we interconnect the web of data, its agents, and their decisions to enlarge the pie?

data sciencedata engineeringmarketing and advertising
A Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenA Few Tips for the CS Freshmen
A Few Tips for the CS Freshmen

Slides are from my welcome speech to the 2014-2015 Freshmen at Computer Science Department of Istanbul Sehir University. I emphasize the command of English, building trust, and being self-organized as three key takeaways.

self-organizationbuild trustlean in
Reducing by Key and 

MappingValues
•>>> t_rdd0 = rdd0.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(iR-x+0.5))
•>>> t_rdd1 = rdd1.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(R-x+0.5))
MappingValues
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
t_rdd0 t_rdd1
Joining to compute ct
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
t_rdd0 t_rdd1
Joining to compute ct
•>>> ct_rdd = t_rdd0.join(t_rdd1).mapValues(lambda x:
math.log(x[1]/x[0]))

Recommended for you

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science

This document discusses the need for data science skills and proposes a curriculum to address the skills gap. It notes that the web has evolved from static HTML to user-generated content and now machines understanding information. Current jobs require data analysis, idea generation, and hypothesis testing skills. A study found enterprises have major skills gaps in mobile, cloud, social and analytics technologies. The proposed curriculum aims to directly teach needed skills while keeping students engaged. Core classes focus on algorithms, systems, architecture, and machine intelligence. The curriculum is designed to bridge undergraduate and graduate programs and use Python to keep students engaged with hands-on projects. A future data science graduate program is outlined focusing on data engineering, networks, visualization, scalable systems, big data

higher educationdata science
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development

This document provides an overview of the CS 361 Software Engineering course. It outlines attendance rules, instructors, required coursebooks, and key topics that will be covered including Agile development methodologies, Waterfall methodology, the Agile Manifesto, enabling technologies for Agile development, pair programming, user stories, system metaphors, on-site customers, and more. The document aims to introduce students to the structure and content of the course.

pair programmingpythonagile manifesto
What is open source?
What is open source?What is open source?
What is open source?

Open source refers to the process by which software is created, not the software itself. The open source process involves voluntary participation where anyone can contribute code freely and choose what tasks to work on. It relies on collaboration between many developers worldwide who are motivated to scratch an itch, avoid reinventing the wheel, solve problems in parallel, and leverage the law of large numbers through continuous beta testing. Documentation and frequent releases are also important aspects of open source development.

linuxopen sourcesource code
Broadcasting to all workers
the look-up table ct
•>>> cts = sc.broadcast(dict(ct_rdd.collect()))
Measuring the accuracy of
clickiness prediction
•>>> def accuracy(rdd, cts, threshold):

csv_rdd = rdd.map(lambda x: (x[0],x[1],sum([

cts.value[t] for t in x[0].split() if t in cts.value])))

results = csv_rdd.map(lambda x: 

(x[1] == (1 if x[2] > threshold else 0),1))

.reduceByKey(lambda x,y: x+y).collect()

print float(results[1][1]) / 

(results[0][1]+results[1][1])
•>>> accuracy(rdd,cts,10)
•>>> accuracy(rdd,cts,-10)
Spark SQL
Spark SQL
• Spark’s interface to work with structured and
semistructured data.
• Structured data is any data that has a schema, i.e., a
know set of fields for each record.

Recommended for you

Programming with Python - Week 3
Programming with Python - Week 3Programming with Python - Week 3
Programming with Python - Week 3

This document summarizes Week 3 of a Python programming course. It discusses introspection, which allows code to examine and manipulate other code as objects. It covers optional and named function arguments, built-in functions like type and str, and filtering lists with comprehensions. It also explains lambda functions and how and and or work in Python.

python
Programming with Python - Week 2
Programming with Python - Week 2Programming with Python - Week 2
Programming with Python - Week 2

This document provides a summary of Week 2 of a Python programming course. It discusses dictionaries, including defining, modifying, and deleting dictionary items. It also covers lists, such as defining and slicing lists, as well as adding, searching, and deleting list elements. Finally, it introduces tuples as immutable lists and discusses variable declaration and string formatting in Python.

python
Liselerde tanıtım sunumu
Liselerde tanıtım sunumuLiselerde tanıtım sunumu
Liselerde tanıtım sunumu
sanal tketim ve somut tketim
Spark SQL
• Spark SQL can load data from a variety of structured
sources (e.g., JSON, Hive and Parquet).
• Spark SQL lets you query the data using SQL both
inside a Spark program and from external tools that
connect to Spark SQL through standard database
connectors (JDBC/ODBC), such as business intelligence
tools like Tableau.
• You can join RDDs and SQL Tables using Spark SQL.
Spark SQL
• Spark SQL provides a special type of RDD called
SchemaRDD.
• A SchemaRDD is an RDD of Row objects, each
representing a record.
• A SchemaRDD knows the schema of its rows.
• You can run SQL queries on SchemaRDDs.
• You can create SchemaRDD from external data sources,
from the result of queries, or from regular RDDs.
Spark SQL
Spark SQL
• Spark SQL can be used via SQLContext or HiveContext.
• SQLContext supports a subset of Spark SQL
functionality excluding Hive support.
• Use HiveContext.
• If you have an existing Hive installation, you need to
copy your hive-site.xml to Spark’s configuration
directory.

Recommended for you

Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1

This set of slides makes an introduction to Python PL. We start with the language basics: code blocks, indentation, definition of objects.

python pl
Ecosystem for Scholarly Work
Ecosystem for Scholarly WorkEcosystem for Scholarly Work
Ecosystem for Scholarly Work

In this presentation, we provide the details of an ecosystem to foster scholarly work at an educational institution. Various research and funding processes are outlined to set up and execute a successful operational model.

working capital managementintellectual property managementcommercialization
Startup Execution Models
Startup Execution ModelsStartup Execution Models
Startup Execution Models

This presentation outlines two main startup/business development models: product development model, customer development model. The right methodology is to use both at the same time with constant feedback and learning.

technology adoption lifecyclecustomer development modelwebvan story
Spark SQL
• Spark will create its own Hive metastore (metadata DB)
called metastore_db in your program’s work directory.
• The tables you create will be placed underneath 

/user/hive/warehouse on your default file system:



- local FS, or



- HDFS if you have hdfs-site.xml on your classpath.
Creating a HiveContext
• >>> ## Assuming that sc is our SparkContext
•>>> from pyspark.sql import HiveContext, Row
•>>> hiveCtx = HiveContext(sc)
Basic Query Example
• ## Assume that we have an input JSON file.
•>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”)
•>>> rdd.registerTempTable(“reviews”)
•>>> topterms = hiveCtx.sql(“SELECT * FROM reviews
LIMIT 10").collect()
SchemaRDD
• Both loading data and executing queries return a
SchemaRDD.
• A SchemaRDD is an RDD composed of Row objects
with additional schema information of the types in each
column.
• Row objects are wrappers around arrays of basic types
(e.g., integers and strings).
• In most recent Spark versions, SchemaRDD is renamed
to DataFrame.

Recommended for you

I feel dealsy
I feel dealsyI feel dealsy
I feel dealsy

The document discusses the potential of group buying deals and collective discounts, noting that people are more likely to purchase items if they feel they are getting a good deal as part of a group. It proposes that a company can leverage their user base and merchant relationships to create dedicated group deal pages and use marketing techniques like emails and pop-ups to promote the deals in order to benefit both consumers and merchants through a commission-based sales model.

likecomcollective buyinggrouponcom
Kaihl 2010
Kaihl 2010Kaihl 2010
Kaihl 2010
kisisel gelisimkariyer
Bilisim 2010 @ bura
Bilisim 2010 @ buraBilisim 2010 @ bura
Bilisim 2010 @ bura
SchemaRDD
• A SchemaRDD is also an RDD, and you can run regular
RDD transformations (e.g., map(), and filter()) on them
as well.
• You can register any SchemaRDD as a temporary table
to query it a via hiveCtx.sql.
Working with Row objects
• In Python, you access the ith row element using row[i] or
using the column name as row.column_name.
•>>> topterms.map(lambda row: row.Keyword)
Caching
• If you expect to run multiple tasks or queries agains the
same data, you can cache it.
•>>> hiveCtx.cacheTable(“mysearchterms”)
• When caching a table, Spark SQL represents the data in
an in-memory columnar format.
• The cached table will be destroyed once the driver
exits.
Printing schema
•>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”)
•>>> rdd.printSchema()

Recommended for you

ESX Server from VMware
ESX Server from VMwareESX Server from VMware
ESX Server from VMware

VMware ESX Server provides a virtualization platform for mission-critical environments. It utilizes hardware virtualization to present virtual machines with direct access to resources, allowing multiple guest operating systems to run in isolation on a single physical server. ESX Server offers a bare-metal architecture for high performance, as well as granular resource management and hardware support from major vendors to maximize utilization and flexibility.

esx servervirtualization
Virtualization @ Sehir
Virtualization @ SehirVirtualization @ Sehir
Virtualization @ Sehir

In this presentation, I describe the basics of virtualization, what major players has done so far, what we can do today with available tools.

iaasistanbul sehir universityvmware vcenter server
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...

@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Dolle come here

Converting an RDD to a
SchemaRDD
• First create an RDD of Row objects and then call
inferSchema() on it.
•>>> rdd = sc.parallelize([Row(name=“hero”,
favouritecoffee=“industrial blend”)])
•>>> srdd = hiveCtx.inferSchema(rdd)
•>>> srdd.registerTempTable(“myschemardd”)
Working with nested data
•>>> a = [{'name': 'mickey'}, {'name': 'pluto', 'knows':
{'friends': ['mickey',‘donald']}}]
•>>> rdd = sc.parallelize(a)
•>>> rdd.map(lambda x:
json.dumps(x)).saveAsTextFile(“test")
•>>> srdd = sqlContext.jsonFile(“test")
Working with nested data
• >>> srdd.printSchema() 

root

|-- knows: struct (nullable = true)

| |-- friends: array (nullable = true)

| | |-- element: string (containsNull = true)

|-- name: string (nullable = true)
Working with nested data
•>>> srdd.registerTempTable("test")
• >>> sqlContext.sql("SELECT knows.friends FROM
test").collect()

Recommended for you

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatka #matka ##kalyanmatka
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile OfferHiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer

Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer

 
by $A19
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here

@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here

MLlib
MLlib
• Spark’s library of machine learning functions.
• The design philosophy is simple:

- Invoke ML algorithms on RDDs.
Learning in a nutshell
Learning in a nutshell

Recommended for you

Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries

The wide-ranging applications of data science across industries. From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights. Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.

data science applicationsapplications of data sciencehealthcare
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatka #matka ##kalyanmatka
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money

Seamlessly Pay Online, Pay In Stores or Send Money

Learning in a nutshell
Text Classification
• Step 1. Start with an RDD of strings representing your
messages.
• Step 2. Run one of MLlib’s feature extraction algorithms
to convert text into numerical features (suitable for
learning algorithms).The result is an RDD of vectors.
• Step 3. Call a classification algorithm (e.g., logistic
regression) on the RDD of vectors.The result is a
model.
Text Classification
• Step 4.You can evaluate the model on a test set.
• Step 5.You can use the model for point shooting. Given
a new data sample, you can classify it using the model.
System requirements
• MLlib requires gfortran runtime library for your OS.
• MLlib needs NumPy.

Recommended for you

Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho

Maruti Wagon R on road price in Faridabad - CarDekho

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

kalyan matka results main baza#sattamatka #matka #dpboss#kalyanmatka #matka #
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx

LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time. It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.

Spam Classification
•>>> from pyspark.mllib.regression import LabeledPoint
•>>> from pyspark.mllib.feature import HashingTF
•>>> from pyspark.mllib.classification import
LogisticRegressionWithSGD
•>>> spamRows = sc.textFile(“spam.txt”)
•>>> hamRows = sc.textFile(“ham.txt”)
Spam Classification
• ### for mapping emails to vectors of 10000 features.
•>>> tf = HashingTF(numFeatures=10000)
Spam Classification
• ## Feature Extraction, email —> word features
•>>> spamFeatures = spamRows.map(lambda email:
tf.transform(email.split(“ ”)))
•>>> hamFeatures = hamRows.map(lambda email:
tf.transform(email.split(“ ”)))
Spam Classification
• ### Label feature vectors
•>>> spamExamples = spamFeatures.map(lambda
features: LabeledPoint(1, features))
•>>> hamExamples = hamFeatures.map(lambda features:
LabeledPoint(0, features))

Recommended for you

BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx

test

Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...

Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf

jjj

Spam Classification
•>>> trainingData = spamExamples.union(hamExamples)
• ### Since learning via Logistic Regression is iterative
•>>> trainingData.cache()
Spam Classification
•>>> model =
LogisticRegressionWithSGD.train(trainingData)
Spam Classification
• ### Lets test it!
•>>> posTest = tf.transform(“O M G GET cheap
stuff”.split(“ ”))
•>>> negTest = tf.transform(“Enjoy Spark on Machine
Learning”.split(“ ”))
•>>> print model.predict(posTest)
•>>> print model.predict(negTest)
Data Types
• MLlib contains a few specific data types located in
pyspark.mllib.
•Vector : a mathematical vector (sparse or dense).
•LabeledPoint : a pair of feature vector and its label.
•Rating : a rating of a product by a user.
• Various Model classes : the resulting model from
training. It has a predict() function for ad-hoc querying.

Recommended for you

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatka#kalyanmatka #matka #
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatka#kalyanmatka #matka #
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Spark it!

More Related Content

What's hot

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
Thu Hiền
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 

What's hot (20)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 

Viewers also liked

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
David Taieb
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
David Taieb
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
Jongwook Woo
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
Nathan Halko
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
Nathan Halko
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
andyrobinson8
 
Scala+RDD
Scala+RDDScala+RDD
Scala+RDD
Yuanhang Wang
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
Loïc Descotte
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
Abhishek Choudhary
 
ScalaTrainings
ScalaTrainingsScalaTrainings
ScalaTrainings
Chinedu Ekwunife
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
DataWorks Summit
 

Viewers also liked (20)

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
Scala+RDD
Scala+RDDScala+RDD
Scala+RDD
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
ScalaTrainings
ScalaTrainingsScalaTrainings
ScalaTrainings
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 

Similar to Apache Spark Tutorial

spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 

Similar to Apache Spark Tutorial (20)

spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

More from Ahmet Bulut

Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!
Ahmet Bulut
 
A Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenA Few Tips for the CS Freshmen
A Few Tips for the CS Freshmen
Ahmet Bulut
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Ahmet Bulut
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development
Ahmet Bulut
 
What is open source?
What is open source?What is open source?
What is open source?
Ahmet Bulut
 
Programming with Python - Week 3
Programming with Python - Week 3Programming with Python - Week 3
Programming with Python - Week 3
Ahmet Bulut
 
Programming with Python - Week 2
Programming with Python - Week 2Programming with Python - Week 2
Programming with Python - Week 2
Ahmet Bulut
 
Liselerde tanıtım sunumu
Liselerde tanıtım sunumuLiselerde tanıtım sunumu
Liselerde tanıtım sunumu
Ahmet Bulut
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
Ahmet Bulut
 
Ecosystem for Scholarly Work
Ecosystem for Scholarly WorkEcosystem for Scholarly Work
Ecosystem for Scholarly Work
Ahmet Bulut
 
Startup Execution Models
Startup Execution ModelsStartup Execution Models
Startup Execution Models
Ahmet Bulut
 
I feel dealsy
I feel dealsyI feel dealsy
I feel dealsy
Ahmet Bulut
 
Bilisim 2010 @ bura
Bilisim 2010 @ buraBilisim 2010 @ bura
Bilisim 2010 @ bura
Ahmet Bulut
 
ESX Server from VMware
ESX Server from VMwareESX Server from VMware
ESX Server from VMware
Ahmet Bulut
 
Virtualization @ Sehir
Virtualization @ SehirVirtualization @ Sehir
Virtualization @ Sehir
Ahmet Bulut
 

More from Ahmet Bulut (17)

Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!
 
A Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenA Few Tips for the CS Freshmen
A Few Tips for the CS Freshmen
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development
 
What is open source?
What is open source?What is open source?
What is open source?
 
Programming with Python - Week 3
Programming with Python - Week 3Programming with Python - Week 3
Programming with Python - Week 3
 
Programming with Python - Week 2
Programming with Python - Week 2Programming with Python - Week 2
Programming with Python - Week 2
 
Liselerde tanıtım sunumu
Liselerde tanıtım sunumuLiselerde tanıtım sunumu
Liselerde tanıtım sunumu
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
 
Ecosystem for Scholarly Work
Ecosystem for Scholarly WorkEcosystem for Scholarly Work
Ecosystem for Scholarly Work
 
Startup Execution Models
Startup Execution ModelsStartup Execution Models
Startup Execution Models
 
I feel dealsy
I feel dealsyI feel dealsy
I feel dealsy
 
Kaihl 2010
Kaihl 2010Kaihl 2010
Kaihl 2010
 
Bilisim 2010 @ bura
Bilisim 2010 @ buraBilisim 2010 @ bura
Bilisim 2010 @ bura
 
ESX Server from VMware
ESX Server from VMwareESX Server from VMware
ESX Server from VMware
 
Virtualization @ Sehir
Virtualization @ SehirVirtualization @ Sehir
Virtualization @ Sehir
 

Recently uploaded

@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
Disha Mukharji
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile OfferHiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
$A19
 
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
SARITA PANDEY
 
Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
IABAC
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
1258strict
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
RealDarrah
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
Miguel Ángel Rodríguez Anticona
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
adityaroy0215
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 

Recently uploaded (20)

@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile OfferHiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
 
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
 
Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 

Apache Spark Tutorial

  • 1. Tutorial: Scalable Data Analytics using Apache Spark Dr.Ahmet Bulut @kral http://www.linkedin.com/in/ahmetbulut
  • 3. Cluster Computing • Apache Spark is a cluster computing platform designed to be fast and general-purpose. • Running computational tasks across many worker machines, or a computing cluster.
  • 4. Unified Computing • In Spark, you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. • Simultaneously, analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured log-files). • More sophisticated data engineers and data scientists can access the same data via the Python shell for ad hoc analysis.
  • 6. Spark Core • Spark core:“computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks on a computing cluster.
  • 7. Spark Stack • Spark Core: the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. • Spark SQL: Spark’s package for working with structured data. • Spark Streaming: Spark component that enables processing of live streams of data.
  • 8. Spark Stack • MLlib: library containing common machine learning (ML) functionality including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import. • GraphX: library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. • Cluster Managers: Standalone Scheduler,Apache Mesos, HadoopYARN.
  • 9. “Data Scientist: a person, who is better at statistics than a computer engineer, 
 and better at computer engineering 
 than a statistician.” I do not believe in this new job role.
 Data Science is embracing all stakeholders.
  • 10. Data Scientists of Spark age • Data scientists use their skills to analyze data with the goal of answering a question or discovering insights. • Data science workflow involves ad hoc analysis. • Data scientists use interactive shells (vs. building complex applications) for seeing the results to their queries and for writing snippets of code quickly.
  • 11. Data Scientists of Spark age • Spark’s speed and simple APIs shine for data science, and its built-in libraries mean that many useful algorithms are available out of the box.
  • 12. Storage Layer • Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem,Amazon S3, Cassandra, Hive, HBase, etc.). • Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. • Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.
  • 13. Downloading Spark • The first step to using Spark is to download and unpack it. • For a recent precompiled released version of Spark. • Visit http://spark.apache.org/downloads.html • Select the package type of “Pre-built for Hadoop 2.4 and later,” and click “Direct Download.” • This will download a compressed TAR file, or tarball, called spark-1.2.0-bin-hadoop2.4.tgz.
  • 14. Directory structure • README.md
 Contains short instructions for getting started with Spark. • bin 
 Contains executable files that can be used to interact with Spark in various ways.
  • 15. Directory structure • core, streaming, python, ... 
 Contains the source code of major components of the Spark project. • examples 
 Contains some helpful Spark standalone jobs that you can look at and run to learn about the Spark API.
  • 16. PySpark • The first step is to open up one of Spark’s shells.To open the Python version of the Spark shell, which we also refer to as the PySpark Shell, go into your Spark directory and type: 
 
 $ bin/pyspark
  • 17. Logging verbosity • You can control the verbosity of the logging, create a file in the conf directory called log4j.properties. • To make the logging less verbose, make a copy of conf/ log4j.properties.template called conf/log4j.properties and find the following line: 
 log4j.rootCategory=INFO, console
 
 Then lower the log level to
 log4j.rootCategory=WARN, console

  • 18. IPython • IPython is an enhanced Python shell that offers features such as tab completion. Instructions for installing it is at 
 http://ipython.org. • You can use IPython with Spark by setting the IPYTHON environment variable to 1: 
 
 IPYTHON=1 ./bin/pyspark
  • 19. IPython • To use the IPython Notebook, which is a web-browser- based version of IPython, use IPYTHON_OPTS="notebook" ./bin/pyspark • On Windows, set the variable and run the shell as follows: 
 set IPYTHON=1 
 binpyspark
  • 20. Script #1 •# Create an RDD
 >>> lines = sc.textFile("README.md") •# Count the number of items in the RDD
 >>> lines.count() •# Show the first item in the RDD
 >>> lines.first()
  • 21. Resilient Distributed 
 Dataset • The variable lines is an RDD: Resilient Distributed Dataset. • on RDDs, you can run parallel operations.
  • 22. Intro to Core Spark Concepts • Every Spark application consists of a driver program that launches various parallel operations on a cluster. • Spark Shell is a driver program itself. • Driver programs access Spark through SparkContext object, which represents a connection to a computing cluster. • In the Spark shell, the context is automatically created as the variable sc.
  • 24. Intro to Core Spark Concepts • Driver programs manage a number of nodes called executors. • For example, running the count() on a cluster would translate into different nodes counting the different ranges of the input file.
  • 25. Script #2 •>>> lines = sc.textFile(“README.md”) •>>> pythonLines = lines.filter(lambda line:“Python” in line) •>>> pythonLines.first()
  • 26. Standalone applications • Apart from running interactively, Spark can be linked into standalone applications in either Python, Scala, or Java. • The main difference is that you need to initialize your own SparkContext. • How to py it: 
 Write your applications as Python scripts as you normally do, but to run them with cluster aware logic, use spark-submit script.
  • 27. Standalone applications •$ bin/spark-submit my_script.py • The spark-submit script sets up the environment for Spark’s Python API to function by including Spark dependencies.
  • 28. Initializing Spark in Python • # Excerpt from your driver program
 
 from pyspark import SparkConf, SparkContext
 conf = SparkConf().setMaster(“local”).setAppName(“My App”)
 sc = SparkContext(conf=conf)
  • 30. Operations on RDDs • Transformations and Actions. • Transformations construct a new RDD from a previous one. • “Filtering data that matches a predicate” is an example transformation.
  • 31. Transformations • Let’s create an RDD that holds strings containing the word Python. •>>> pythonLines = lines.filter(lambda line:“Python” in line)
  • 32. Actions • Actions compute a result based on an RDD. • They can return the result to the driver, or to an external storage system (e.g., HDFS). •>>> pythonLines.first()
  • 33. Transformations & Actions • You can create RDDs at any time using transformations. • But, Spark will materialize them once they are used in an action. • This is a lazy approach to RDD creation.
  • 34. Lazy … • Assume that you want to work with a Big Data file. • But you are only interested in the lines that contain Python. • were Spark to load and save all the lines in the file as soon as sc.textFile(…) is called, it would waste storage space. • Therefore, Spark chooses to see all transformations first, and then compute the result to an action.
  • 35. Persistence of RDDs • RDDs are re-computed each time you run an action on them. • In order to re-use an RDD in multiple actions, you can ask Spark to persist it using RDD.persist().
  • 36. Resilience of RDDs • Once computed, RDD is materialized in memory. • Persistence to disk is also possible. • Persistence is optional, and not a default behavior.The reason is that if you are not going to re-use an RDD, there is no point in wasting storage space by persisting it. • The ability to re-compute is what makes RDDs resilient to node failures.
  • 38. Working with Key/Value 
 Pairs • Most often you ETL your data into a key/value format. • Key/value RDDs let you 
 count up reviews for each product,
 group together data with the same key,
 group together two different RDDs.
  • 39. Pair RDD • RDDs containing key/value pairs are called pair RDDs. • Pair RDDs are a useful building block in many programs as they expose operations that allow you to act on each key in parallel or regroup data across the network. • For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key. • join() method merges two RDDs together by grouping elements with the same key.
  • 40. Creating Pair RDDs • Use a map() function that returns key/value pairs. •pairs = lines.map(lambda x: (x.split(“ ”)[0], x))
  • 41. Transformations on Pair RDDs • Let the rdd be [(1,2),(3,4),(3,6)] • reduceByKey(func) combines values with the same key. •>>> rdd.reduceByKey(lambda x,y: x+y) —> [(1,2),(3,10)] •groupByKey() group values with the same key. •>>> rdd.groupByKey() —> [(1,[2]),(3,[4,6])]
  • 42. Transformations on Pair RDDs • mapValues(func) applies a function to each value of a pair RDD without changing the key. •>>> rdd.mapValues(lambda x: x+1) •keys() returns an rdd of just the keys. •>>> rdd.keys() •values() returns an rdd of just the values. •>>> rdd.values()
  • 43. Transformations on Pair RDDs • sortByKey() returns an rdd, which has the same contents as the original rdd, but sorted by its keys. •>>> rdd.sortByKey()
  • 44. Transformations on Pair RDDs •join() performs an inner join between two RDDs. •let rdd1 be [(1,2),(3,4),(3,6)] and rdd2 be [(3,9)]. •>>> rdd1.join(rdd2) —> [(3,(4,9)),(3,(6,9))]
  • 45. Pair RDDs are still RDDs you can also filter by value! try.
  • 46. Pair RDDs are still RDDs • Given that pairs is an RDD with the key being an integer: •>>> filteredRDD = pairs.filter(lambda x: x[0]>5)
  • 47. Lets do a word count •>>> rdd = sc.textFile(“README.md”) •>>> words = rdd.flatMap(lambda x: x.split(“ ”)) •>>> result = 
 words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
  • 48. Lets identify the top words •>>> sc.textFile("README.md")
 .flatMap(lambda x: x.split(" "))
 .map(lambda x: (x.lower(),1))
 .reduceByKey(lambda x,y: x+y)
 .map(lambda x: (x[1],x[0]))
 .sortByKey(ascending=False)
 .take(5)
  • 50. Per key aggregation •>>> aggregateRDD = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: x[0]+y[0], x[1]+y[1])
  • 51. Grouping data • On an RDD consisting of keys of type K and values of type V, we get back an RDD of type [K, Iterable[V]]. • >>> rdd.groupByKey() • We can group data from multiple RDDs using cogroup(). • Given two RDDs sharing the same key type K, with the respective value types asV and W, the resulting RDD is of type [K, (Iterable[V], Iterable[W])]. • >>> rdd1.cogroup(rdd2)
  • 52. Joins • There are two types of joins as inner joins and outer joins. • Inner joins require a key to be present in both RDDs. There is a join() call. • Outer joins do not require a key to be present in both RDDs.There is a leftOuterJoin() and rightOuterJoin(). None is used as the value for the RDD which has the key missing.
  • 53. Joins •>>> rdd1,rdd2=[(‘A',1),('B',2),('C',1)],[('A',3),('C',2),('D', 4)] •>>> rdd1,rdd2=sc.parallelize(rdd1),sc.parallelize(rdd2) •>>> rdd1.leftOuterJoin(rdd2).collect()
 [('A', (1, 3)), ('C', (1, 2)), ('B', (2, None))] •>>> rdd1.rightOuterJoin(rdd2).collect()
 [('A', (1, 3)), ('C', (1, 2)), ('D', (None, 4))]
  • 54. Sorting data • We can sort an RDD with Key/Value pairs provided that there is an ordering defined on the key. • Once we sorted our data, subsequent calls, e.g., collect(), return ordered data. •>>> rdd.sortByKey(ascending=True, numPartitions=None, keyfunc=lambda x: str(x))
  • 55. Actions on pair RDDs •>>> rdd1=[(‘A',1),('B',2),('C',1)] •>>> rdd1.collectAsMap()
 {'A': 1, 'B': 2, 'C': 1} •>>> rdd1.countByKey()[‘A’]
 1
  • 57. Accumulators • Accumulators are shared variables. • They are used to aggregate values from worker nodes back to the driver program. • One of the most common uses of accumulators is to count events that occur during job execution for debugging purposes.
  • 58. Accumulators •>>> inputfile = sc.textFile(inputFile) • ## Lets create an Accumulator[Int] initialized to 0 •>>> blankLines = sc.accumulator(0)
  • 59. Accumulators •>>> def parseOutAndCount(line):
 # Make the global variable accessible
 global blankLines
 if (line == ""): blankLines += 1 
 return line.split(" ") •>>> rdd = inputfile.flatMap(parseOutAndCount) • Do an action so that the workers do real work! •>>> rdd.saveAsTextFile(outputDir + "/xyz") •>>> blankLines.value
  • 60. Accumulators & 
 Fault Tolerance • Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. • For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node. • If the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result instead if that finishes earlier.
  • 61. Accumulators & 
 Fault Tolerance • Even if no nodes fail, Spark may have to rerun a task to rebuild a cached value that falls out of memory. 
 
 
 “The net result is therefore that the same function may run multiple times on the same data depending on what happens on the cluster.”
  • 62. Accumulators & 
 Fault Tolerance • For accumulators used in actions, Spark applies each task’s update to each accumulator only once. • For accumulators used in RDD transformations instead of actions, this guarantee does not exist. • Bottomline: use accumulators only in actions.
  • 63. BroadcastVariables • Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. • Use it if your application needs to send a large, read- only lookup table or a large feature vector in a machine learning algorithm to all the nodes.
  • 64. Yahoo SEM Click Data • Dataset:Yahoo’s Search Marketing Advertiser Bid- Impression-Click data, version 1.0 • 77,850,272 rows, 8.1GB in total. • Data fields:
 0 day
 1 anonymized account_id
 2 rank
 3 anonymized keyphrase (list of anonymized keywords)
 4 avg bid
 5 impressions
 6 clicks
  • 65. Sample data rows 1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0 76 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 48 08bade48-1081-488f-b459-6c75d75312ae 3 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 97 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0 123 08bade48-1081-488f-b459-6c75d75312ae 5 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 119 08bade48-1081-488f-b459-6c75d75312ae 3 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0 73 08bade48-1081-488f-b459-6c75d75312ae 1 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 1.0 0.0 • Primary key: date, account_id, rank and keyphrase. • Average bid, impressions and clicks information is 
 aggregated over the primary key.
  • 66. Feeling clicky? keyphrase impressions clicks iphone 6 plus for cheap 100 2 new samsung tablet 10 1 iphone 5 refurbished 2 0 learn how to program for iphone 200 0
  • 67. Getting Clicks = Popularity • Click Through Rate (CTR) = —————————
 # of impressions • If CTR > 0, it is a popular keyphrase. • If CTR == 0, it is an unpopular keyphrase. # of clicks
  • 68. Keyphrase = {terms} • Given keyphrase “iphone 6 plus for cheap”, its terms are: 
 
 iphone
 6
 plus
 for
 cheap
  • 69. Contingency table Keyphrases got clicks no clicks Total term t present s n-s n term t absent S-s (N-S)-(n-s) N-n Total S N-S N
  • 70. Clickiness of a term • For the term presence to click reception contingency table shown previously, we can compute a given term t’s clickiness value ct as follows: • ct = log ——————————
 (n-s+0.5)/(N-n-S+s+0.5)
 
 
 (s+0.5)/(S-s+0.5)
  • 71. Clickiness of a keyphrase • Given a keyphrase K that consists of terms t1 t2 … tn, 
 its clickiness can be computed by summing up the clickiness of the terms present in it. • That is, cK = ct1 + ct2 + … + ctn
  • 72. Feeling clicky? keyphrase impressions clicks clickiness iphone 6 plus for cheap 100 2 1 new samsung tablet 10 1 1 iphone 5 refurbished 2 0 0 learn how to program for iphone 200 0 0
  • 73. Clickiness of iphone Keyphrases got clicks no clicks Total term iphone present 1 2 3 term iphone absent 1 0 1 Total 2 2 4
  • 74. Clickiness of iphone ciphone = log ———————
 (2+0.5)/(0+0.5)
 
 
 (1+0.5)/(1+0.5)
  • 75. • Given keyphrases and their clickiness
 
 k1 = t12 t23 … t99 1 
 k2 = t19 t201 … t1 0
 k3 = t1 t2 … t101 1
 …
 …
 kn = t1 t2 … t101 1 Mapping
  • 76. MappingYahoo’s click data •>>> import math •>>> rdd = sc.textFile("yahoo_keywords_bids_clicks")
 .map(lambda line: (line.split("t")[3], 
 (float(line.split(“t")[-2]),float(line.split("t") [-1])))) •>>> rdd = 
 rdd.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1]))
 .mapValues(lambda x: 1 if (x[1]/x[0])>0 else 0)
  • 77. • Given keyphrases and their clickiness
 
 k1 = t12 t23 … t99 1 
 k2 = t19 t201 … t1 0
 k3 = t1 t2 … t101 1
 …
 …
 kn = t1 t2 … t101 1 flatMapping (t19, 0), (t201, 0),…, (t1, 0) flatMap it to
  • 78. • Given keyphrases and their clickiness
 
 k1 = t12 t23 … t99 1 
 k2 = t19 t201 … t1 0
 k3 = t1 t2 … t101 1
 …
 …
 kn = t1 t2 … t101 1 flatMapping (t19, 0), (t201, 0),…, (t1, 0) flatMap it to (t1, 1), (t2, 1),…, (t101, 1) flatMap it to
  • 79. flatMapping •>>> keyphrases0 = rdd.filter(lambda x: x[1]==0) •>>> keyphrases1 = rdd.filter(lambda x: x[1]==1) •>>> rdd0 = 
 keyphrases0.flatMap(lambda x: [(e,1) for e in x[0].split()]) •>>> rdd1 = 
 keyphrases1.flatMap(lambda x: [(e,1) for e in x[0].split()]) •>>> iR = keyphrases0.count() •>>> R = keyphrases1.count()
  • 80. Reducing (t1, 19) (t12, 19) (t101, 19) … … (t1, 200) (t12, 11) (t101, 1) … … rdd0 rdd1
  • 81. Reducing by Key and 
 MappingValues •>>> t_rdd0 = rdd0.reduceByKey(lambda x,y: x +y).mapValues(lambda x: (x+0.5)/(iR-x+0.5)) •>>> t_rdd1 = rdd1.reduceByKey(lambda x,y: x +y).mapValues(lambda x: (x+0.5)/(R-x+0.5))
  • 82. MappingValues (t1, some float value) (t12, some float value) (t101, some float value) … … (t1, some float value) (t12, some float value) (t101, some float value) … … t_rdd0 t_rdd1
  • 83. Joining to compute ct (t1, some float value) (t12, some float value) (t101, some float value) … … (t1, some float value) (t12, some float value) (t101, some float value) … … t_rdd0 t_rdd1
  • 84. Joining to compute ct •>>> ct_rdd = t_rdd0.join(t_rdd1).mapValues(lambda x: math.log(x[1]/x[0]))
  • 85. Broadcasting to all workers the look-up table ct •>>> cts = sc.broadcast(dict(ct_rdd.collect()))
  • 86. Measuring the accuracy of clickiness prediction •>>> def accuracy(rdd, cts, threshold):
 csv_rdd = rdd.map(lambda x: (x[0],x[1],sum([
 cts.value[t] for t in x[0].split() if t in cts.value])))
 results = csv_rdd.map(lambda x: 
 (x[1] == (1 if x[2] > threshold else 0),1))
 .reduceByKey(lambda x,y: x+y).collect()
 print float(results[1][1]) / 
 (results[0][1]+results[1][1]) •>>> accuracy(rdd,cts,10) •>>> accuracy(rdd,cts,-10)
  • 88. Spark SQL • Spark’s interface to work with structured and semistructured data. • Structured data is any data that has a schema, i.e., a know set of fields for each record.
  • 89. Spark SQL • Spark SQL can load data from a variety of structured sources (e.g., JSON, Hive and Parquet). • Spark SQL lets you query the data using SQL both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as business intelligence tools like Tableau. • You can join RDDs and SQL Tables using Spark SQL.
  • 90. Spark SQL • Spark SQL provides a special type of RDD called SchemaRDD. • A SchemaRDD is an RDD of Row objects, each representing a record. • A SchemaRDD knows the schema of its rows. • You can run SQL queries on SchemaRDDs. • You can create SchemaRDD from external data sources, from the result of queries, or from regular RDDs.
  • 92. Spark SQL • Spark SQL can be used via SQLContext or HiveContext. • SQLContext supports a subset of Spark SQL functionality excluding Hive support. • Use HiveContext. • If you have an existing Hive installation, you need to copy your hive-site.xml to Spark’s configuration directory.
  • 93. Spark SQL • Spark will create its own Hive metastore (metadata DB) called metastore_db in your program’s work directory. • The tables you create will be placed underneath 
 /user/hive/warehouse on your default file system:
 
 - local FS, or
 
 - HDFS if you have hdfs-site.xml on your classpath.
  • 94. Creating a HiveContext • >>> ## Assuming that sc is our SparkContext •>>> from pyspark.sql import HiveContext, Row •>>> hiveCtx = HiveContext(sc)
  • 95. Basic Query Example • ## Assume that we have an input JSON file. •>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”) •>>> rdd.registerTempTable(“reviews”) •>>> topterms = hiveCtx.sql(“SELECT * FROM reviews LIMIT 10").collect()
  • 96. SchemaRDD • Both loading data and executing queries return a SchemaRDD. • A SchemaRDD is an RDD composed of Row objects with additional schema information of the types in each column. • Row objects are wrappers around arrays of basic types (e.g., integers and strings). • In most recent Spark versions, SchemaRDD is renamed to DataFrame.
  • 97. SchemaRDD • A SchemaRDD is also an RDD, and you can run regular RDD transformations (e.g., map(), and filter()) on them as well. • You can register any SchemaRDD as a temporary table to query it a via hiveCtx.sql.
  • 98. Working with Row objects • In Python, you access the ith row element using row[i] or using the column name as row.column_name. •>>> topterms.map(lambda row: row.Keyword)
  • 99. Caching • If you expect to run multiple tasks or queries agains the same data, you can cache it. •>>> hiveCtx.cacheTable(“mysearchterms”) • When caching a table, Spark SQL represents the data in an in-memory columnar format. • The cached table will be destroyed once the driver exits.
  • 101. Converting an RDD to a SchemaRDD • First create an RDD of Row objects and then call inferSchema() on it. •>>> rdd = sc.parallelize([Row(name=“hero”, favouritecoffee=“industrial blend”)]) •>>> srdd = hiveCtx.inferSchema(rdd) •>>> srdd.registerTempTable(“myschemardd”)
  • 102. Working with nested data •>>> a = [{'name': 'mickey'}, {'name': 'pluto', 'knows': {'friends': ['mickey',‘donald']}}] •>>> rdd = sc.parallelize(a) •>>> rdd.map(lambda x: json.dumps(x)).saveAsTextFile(“test") •>>> srdd = sqlContext.jsonFile(“test")
  • 103. Working with nested data • >>> srdd.printSchema() 
 root
 |-- knows: struct (nullable = true)
 | |-- friends: array (nullable = true)
 | | |-- element: string (containsNull = true)
 |-- name: string (nullable = true)
  • 104. Working with nested data •>>> srdd.registerTempTable("test") • >>> sqlContext.sql("SELECT knows.friends FROM test").collect()
  • 105. MLlib
  • 106. MLlib • Spark’s library of machine learning functions. • The design philosophy is simple:
 - Invoke ML algorithms on RDDs.
  • 107. Learning in a nutshell
  • 108. Learning in a nutshell
  • 109. Learning in a nutshell
  • 110. Text Classification • Step 1. Start with an RDD of strings representing your messages. • Step 2. Run one of MLlib’s feature extraction algorithms to convert text into numerical features (suitable for learning algorithms).The result is an RDD of vectors. • Step 3. Call a classification algorithm (e.g., logistic regression) on the RDD of vectors.The result is a model.
  • 111. Text Classification • Step 4.You can evaluate the model on a test set. • Step 5.You can use the model for point shooting. Given a new data sample, you can classify it using the model.
  • 112. System requirements • MLlib requires gfortran runtime library for your OS. • MLlib needs NumPy.
  • 113. Spam Classification •>>> from pyspark.mllib.regression import LabeledPoint •>>> from pyspark.mllib.feature import HashingTF •>>> from pyspark.mllib.classification import LogisticRegressionWithSGD •>>> spamRows = sc.textFile(“spam.txt”) •>>> hamRows = sc.textFile(“ham.txt”)
  • 114. Spam Classification • ### for mapping emails to vectors of 10000 features. •>>> tf = HashingTF(numFeatures=10000)
  • 115. Spam Classification • ## Feature Extraction, email —> word features •>>> spamFeatures = spamRows.map(lambda email: tf.transform(email.split(“ ”))) •>>> hamFeatures = hamRows.map(lambda email: tf.transform(email.split(“ ”)))
  • 116. Spam Classification • ### Label feature vectors •>>> spamExamples = spamFeatures.map(lambda features: LabeledPoint(1, features)) •>>> hamExamples = hamFeatures.map(lambda features: LabeledPoint(0, features))
  • 117. Spam Classification •>>> trainingData = spamExamples.union(hamExamples) • ### Since learning via Logistic Regression is iterative •>>> trainingData.cache()
  • 118. Spam Classification •>>> model = LogisticRegressionWithSGD.train(trainingData)
  • 119. Spam Classification • ### Lets test it! •>>> posTest = tf.transform(“O M G GET cheap stuff”.split(“ ”)) •>>> negTest = tf.transform(“Enjoy Spark on Machine Learning”.split(“ ”)) •>>> print model.predict(posTest) •>>> print model.predict(negTest)
  • 120. Data Types • MLlib contains a few specific data types located in pyspark.mllib. •Vector : a mathematical vector (sparse or dense). •LabeledPoint : a pair of feature vector and its label. •Rating : a rating of a product by a user. • Various Model classes : the resulting model from training. It has a predict() function for ad-hoc querying.