Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
Unified Big Data Processing with Apache Spark (QCON 2014)
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Processing Large Data with Apache Spark -- HasGeek
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
Apache Spark: The Next Gen toolset for Big Data Processing
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark Summit East 2015 Advanced Devops Student Slides
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
Introduction to real time big data with Apache Spark
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
Data Science with Spark - Training at SparkSummit (East)
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
This document outlines steps for developing analytic applications using Apache Spark and Python. It covers prerequisites for accessing flight and weather data, deploying a simple data pipe tool to build training, test, and blind datasets, and using an IPython notebook to train predictive models on flight delay data. The agenda includes accessing necessary services on Bluemix, preparing the data, training models in the notebook, evaluating model accuracy, and deploying models.
This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points:
- Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster.
- RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation.
- Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler.
- Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing
Python and Bigdata - An Introduction to Spark (PySpark)
This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
- The document discusses a presentation given by Jongwook Woo on introducing Spark and its uses for big data analysis. It includes information on Woo's background and experience with big data, an overview of Spark and its components like RDDs and task scheduling, and examples of using Spark for different types of data analysis and use cases.
This document provides an overview of Near Real Time Analysis of Web Scale Social Data. It discusses how SpotRight collects and organizes publicly available user-generated social data at web scale, including connections between users, actions, events, profiles, demographics, and more from sources like Twitter, Pinterest, blogs and articles. It describes SpotRight's goals, architecture, algorithms, and tools used to perform real-time analysis and deliver timely insights to clients, including graph building, profile creation, and delivery of results. Key aspects involve collecting petabytes of social data, performing distributed graph algorithms at scale, and querying and delivering insights from the organized data.
This document provides an overview of a machine learning workshop including tutorials on decision tree classification for flight delays, clustering news articles with k-means clustering, and collaborative filtering for movie recommendations using Spark. The tutorials demonstrate loading and preparing data, training models, evaluating performance, and making predictions or recommendations. They use Spark MLlib and are run in Apache Zeppelin notebooks.
- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.
This document provides an overview of Scala and compares it to Java. It discusses Scala's object-oriented and functional capabilities, how it compiles to JVM bytecode, and benefits like less boilerplate code and support for functional programming. Examples are given of implementing a simple Property class in both Java and Scala to illustrate concepts like case classes, immutable fields, and less lines of code in Scala. The document also touches on Java interoperability, learning Scala gradually, XML processing capabilities, testing frameworks, and tool/library support.
This document provides an introduction to the Scala programming language. It discusses how Scala runs on the Java Virtual Machine, supports both object-oriented and functional programming paradigms, and provides features like pattern matching, immutable data structures, lazy evaluation, and parallel collections. Scala aims to be concise, expressive, and extensible.
Manchester Hadoop Meetup: Spark Cassandra Integration
This document discusses using Apache Spark and the Spark Cassandra connector to perform analytics on data stored in Apache Cassandra. It provides an overview of Spark and its components, describes how the Spark Cassandra connector allows reading and writing Cassandra data as Spark RDDs, and gives examples of migrating data from a relational database to Cassandra, performing aggregations with Spark SQL, and using Spark Streaming to process streaming Cassandra data.
In July 2016, we conducted our Apache Spark Survey to identify insights on how organizations are using Spark and highlight growth trends since our last Spark Survey 2015. The 2016 survey results reflect answers from 900 distinct organizations and 1615 respondents, who were predominantly Apache Spark users. The results show that the Spark community is...
https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html
Here are the answers to your questions:
1. The main differences between a Trait and Abstract Class in Scala are:
- Traits can be mixed in to classes using with, while Abstract Classes can only be extended.
- Traits allow for multiple inheritance as they can be mixed in, while Abstract Classes only allow single inheritance.
- Abstract Classes can have fields and constructor parameters while Traits cannot.
- Abstract Classes can extend other classes, while Traits can only extend other Traits.
2. abstract class Animal {
def isMammal: Boolean
def isFriendly: Boolean = true
def summarize: Unit = {
println("Characteristics of animal:")
}
This document provides information about running Spark on YARN including:
- Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs).
- When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine.
- Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.
spark example spark example spark examplespark examplespark examplespark example
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
Introduction to Apache Spark :: Lagos Scala Meetup session 2
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Real time Analytics with Apache Kafka and Apache Spark
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
Spark - The Ultimate Scala Collections by Martin Odersky
Spark is a domain-specific language for working with collections that is implemented in Scala and runs on a cluster. While similar to Scala collections, Spark differs in that it is lazy and supports additional functionality for paired data. Scala can learn from Spark by adding views to make laziness clearer, caching for persistence, and pairwise operations. Types are important for Spark as they prevent logic errors and help with programming complex functional operations across a cluster.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document discusses Apache Spark machine learning (ML) workflows for recommendation systems using collaborative filtering. It describes loading rating data from users on items into a DataFrame and splitting it into training and test sets. An ALS model is fit on the training data and used to make predictions on the test set. The root mean squared error is calculated to evaluate prediction accuracy.
Trading Privacy for Value
In the start-up culture of the 21st century, we live by the motto “move fast and break things.” What if what gets broken is society*?
how can we build data products and services that use data ethically & responsibly?
how do companies take a data (science) project from lab to production successfully?
Systems that can explain their decisions.
how can we interconnect the web of data, its agents, and their decisions to enlarge the pie?
Slides are from my welcome speech to the 2014-2015 Freshmen at Computer Science Department of Istanbul Sehir University. I emphasize the command of English, building trust, and being self-organized as three key takeaways.
This document discusses the need for data science skills and proposes a curriculum to address the skills gap. It notes that the web has evolved from static HTML to user-generated content and now machines understanding information. Current jobs require data analysis, idea generation, and hypothesis testing skills. A study found enterprises have major skills gaps in mobile, cloud, social and analytics technologies. The proposed curriculum aims to directly teach needed skills while keeping students engaged. Core classes focus on algorithms, systems, architecture, and machine intelligence. The curriculum is designed to bridge undergraduate and graduate programs and use Python to keep students engaged with hands-on projects. A future data science graduate program is outlined focusing on data engineering, networks, visualization, scalable systems, big data
This document provides an overview of the CS 361 Software Engineering course. It outlines attendance rules, instructors, required coursebooks, and key topics that will be covered including Agile development methodologies, Waterfall methodology, the Agile Manifesto, enabling technologies for Agile development, pair programming, user stories, system metaphors, on-site customers, and more. The document aims to introduce students to the structure and content of the course.
Open source refers to the process by which software is created, not the software itself. The open source process involves voluntary participation where anyone can contribute code freely and choose what tasks to work on. It relies on collaboration between many developers worldwide who are motivated to scratch an itch, avoid reinventing the wheel, solve problems in parallel, and leverage the law of large numbers through continuous beta testing. Documentation and frequent releases are also important aspects of open source development.
This document summarizes Week 3 of a Python programming course. It discusses introspection, which allows code to examine and manipulate other code as objects. It covers optional and named function arguments, built-in functions like type and str, and filtering lists with comprehensions. It also explains lambda functions and how and and or work in Python.
This document provides a summary of Week 2 of a Python programming course. It discusses dictionaries, including defining, modifying, and deleting dictionary items. It also covers lists, such as defining and slicing lists, as well as adding, searching, and deleting list elements. Finally, it introduces tuples as immutable lists and discusses variable declaration and string formatting in Python.
In this presentation, we provide the details of an ecosystem to foster scholarly work at an educational institution. Various research and funding processes are outlined to set up and execute a successful operational model.
This presentation outlines two main startup/business development models: product development model, customer development model. The right methodology is to use both at the same time with constant feedback and learning.
The document discusses the potential of group buying deals and collective discounts, noting that people are more likely to purchase items if they feel they are getting a good deal as part of a group. It proposes that a company can leverage their user base and merchant relationships to create dedicated group deal pages and use marketing techniques like emails and pop-ups to promote the deals in order to benefit both consumers and merchants through a commission-based sales model.
VMware ESX Server provides a virtualization platform for mission-critical environments. It utilizes hardware virtualization to present virtual machines with direct access to resources, allowing multiple guest operating systems to run in isolation on a single physical server. ESX Server offers a bare-metal architecture for high performance, as well as granular resource management and hardware support from major vendors to maximize utilization and flexibility.
Applications of Data Science in Various Industries
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
Apache Spark: The Next Gen toolset for Big Data Processingprajods
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
This document outlines steps for developing analytic applications using Apache Spark and Python. It covers prerequisites for accessing flight and weather data, deploying a simple data pipe tool to build training, test, and blind datasets, and using an IPython notebook to train predictive models on flight delay data. The agenda includes accessing necessary services on Bluemix, preparing the data, training models in the notebook, evaluating model accuracy, and deploying models.
This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points:
- Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster.
- RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation.
- Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler.
- Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
- The document discusses a presentation given by Jongwook Woo on introducing Spark and its uses for big data analysis. It includes information on Woo's background and experience with big data, an overview of Spark and its components like RDDs and task scheduling, and examples of using Spark for different types of data analysis and use cases.
This document provides an overview of Near Real Time Analysis of Web Scale Social Data. It discusses how SpotRight collects and organizes publicly available user-generated social data at web scale, including connections between users, actions, events, profiles, demographics, and more from sources like Twitter, Pinterest, blogs and articles. It describes SpotRight's goals, architecture, algorithms, and tools used to perform real-time analysis and deliver timely insights to clients, including graph building, profile creation, and delivery of results. Key aspects involve collecting petabytes of social data, performing distributed graph algorithms at scale, and querying and delivering insights from the organized data.
This document provides an overview of a machine learning workshop including tutorials on decision tree classification for flight delays, clustering news articles with k-means clustering, and collaborative filtering for movie recommendations using Spark. The tutorials demonstrate loading and preparing data, training models, evaluating performance, and making predictions or recommendations. They use Spark MLlib and are run in Apache Zeppelin notebooks.
An Introduct to Spark - Atlanta Spark Meetupjlacefie
- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone.
- Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing.
- Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.
This document provides an overview of Scala and compares it to Java. It discusses Scala's object-oriented and functional capabilities, how it compiles to JVM bytecode, and benefits like less boilerplate code and support for functional programming. Examples are given of implementing a simple Property class in both Java and Scala to illustrate concepts like case classes, immutable fields, and less lines of code in Scala. The document also touches on Java interoperability, learning Scala gradually, XML processing capabilities, testing frameworks, and tool/library support.
Scala presentation by Aleksandar ProkopecLoïc Descotte
This document provides an introduction to the Scala programming language. It discusses how Scala runs on the Java Virtual Machine, supports both object-oriented and functional programming paradigms, and provides features like pattern matching, immutable data structures, lazy evaluation, and parallel collections. Scala aims to be concise, expressive, and extensible.
Manchester Hadoop Meetup: Spark Cassandra IntegrationChristopher Batey
This document discusses using Apache Spark and the Spark Cassandra connector to perform analytics on data stored in Apache Cassandra. It provides an overview of Spark and its components, describes how the Spark Cassandra connector allows reading and writing Cassandra data as Spark RDDs, and gives examples of migrating data from a relational database to Cassandra, performing aggregations with Spark SQL, and using Spark Streaming to process streaming Cassandra data.
In July 2016, we conducted our Apache Spark Survey to identify insights on how organizations are using Spark and highlight growth trends since our last Spark Survey 2015. The 2016 survey results reflect answers from 900 distinct organizations and 1615 respondents, who were predominantly Apache Spark users. The results show that the Spark community is...
https://databricks.com/blog/2016/09/27/spark-survey-2016-released.html
Here are the answers to your questions:
1. The main differences between a Trait and Abstract Class in Scala are:
- Traits can be mixed in to classes using with, while Abstract Classes can only be extended.
- Traits allow for multiple inheritance as they can be mixed in, while Abstract Classes only allow single inheritance.
- Abstract Classes can have fields and constructor parameters while Traits cannot.
- Abstract Classes can extend other classes, while Traits can only extend other Traits.
2. abstract class Animal {
def isMammal: Boolean
def isFriendly: Boolean = true
def summarize: Unit = {
println("Characteristics of animal:")
}
This document provides information about running Spark on YARN including:
- Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs).
- When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine.
- Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
Spark is a domain-specific language for working with collections that is implemented in Scala and runs on a cluster. While similar to Scala collections, Spark differs in that it is lazy and supports additional functionality for paired data. Scala can learn from Spark by adding views to make laziness clearer, caching for persistence, and pairwise operations. Types are important for Spark as they prevent logic errors and help with programming complex functional operations across a cluster.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Spark real world use cases and optimizationsGal Marder
This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document discusses Apache Spark machine learning (ML) workflows for recommendation systems using collaborative filtering. It describes loading rating data from users on items into a DataFrame and splitting it into training and test sets. An ALS model is fit on the training data and used to make predictions on the test set. The root mean squared error is calculated to evaluate prediction accuracy.
Data Economy: Lessons learned and the Road ahead!Ahmet Bulut
Trading Privacy for Value
In the start-up culture of the 21st century, we live by the motto “move fast and break things.” What if what gets broken is society*?
how can we build data products and services that use data ethically & responsibly?
how do companies take a data (science) project from lab to production successfully?
Systems that can explain their decisions.
how can we interconnect the web of data, its agents, and their decisions to enlarge the pie?
Slides are from my welcome speech to the 2014-2015 Freshmen at Computer Science Department of Istanbul Sehir University. I emphasize the command of English, building trust, and being self-organized as three key takeaways.
This document discusses the need for data science skills and proposes a curriculum to address the skills gap. It notes that the web has evolved from static HTML to user-generated content and now machines understanding information. Current jobs require data analysis, idea generation, and hypothesis testing skills. A study found enterprises have major skills gaps in mobile, cloud, social and analytics technologies. The proposed curriculum aims to directly teach needed skills while keeping students engaged. Core classes focus on algorithms, systems, architecture, and machine intelligence. The curriculum is designed to bridge undergraduate and graduate programs and use Python to keep students engaged with hands-on projects. A future data science graduate program is outlined focusing on data engineering, networks, visualization, scalable systems, big data
This document provides an overview of the CS 361 Software Engineering course. It outlines attendance rules, instructors, required coursebooks, and key topics that will be covered including Agile development methodologies, Waterfall methodology, the Agile Manifesto, enabling technologies for Agile development, pair programming, user stories, system metaphors, on-site customers, and more. The document aims to introduce students to the structure and content of the course.
Open source refers to the process by which software is created, not the software itself. The open source process involves voluntary participation where anyone can contribute code freely and choose what tasks to work on. It relies on collaboration between many developers worldwide who are motivated to scratch an itch, avoid reinventing the wheel, solve problems in parallel, and leverage the law of large numbers through continuous beta testing. Documentation and frequent releases are also important aspects of open source development.
This document summarizes Week 3 of a Python programming course. It discusses introspection, which allows code to examine and manipulate other code as objects. It covers optional and named function arguments, built-in functions like type and str, and filtering lists with comprehensions. It also explains lambda functions and how and and or work in Python.
This document provides a summary of Week 2 of a Python programming course. It discusses dictionaries, including defining, modifying, and deleting dictionary items. It also covers lists, such as defining and slicing lists, as well as adding, searching, and deleting list elements. Finally, it introduces tuples as immutable lists and discusses variable declaration and string formatting in Python.
In this presentation, we provide the details of an ecosystem to foster scholarly work at an educational institution. Various research and funding processes are outlined to set up and execute a successful operational model.
This presentation outlines two main startup/business development models: product development model, customer development model. The right methodology is to use both at the same time with constant feedback and learning.
The document discusses the potential of group buying deals and collective discounts, noting that people are more likely to purchase items if they feel they are getting a good deal as part of a group. It proposes that a company can leverage their user base and merchant relationships to create dedicated group deal pages and use marketing techniques like emails and pop-ups to promote the deals in order to benefit both consumers and merchants through a commission-based sales model.
VMware ESX Server provides a virtualization platform for mission-critical environments. It utilizes hardware virtualization to present virtual machines with direct access to resources, allowing multiple guest operating systems to run in isolation on a single physical server. ESX Server offers a bare-metal architecture for high performance, as well as granular resource management and hardware support from major vendors to maximize utilization and flexibility.
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
3. Cluster Computing
• Apache Spark is a cluster computing platform designed
to be fast and general-purpose.
• Running computational tasks across many worker
machines, or a computing cluster.
4. Unified Computing
• In Spark, you can write one application that uses
machine learning to classify data in real time as it is
ingested from streaming sources.
• Simultaneously, analysts can query the resulting data,
also in real time, via SQL (e.g., to join the data with
unstructured log-files).
• More sophisticated data engineers and data scientists
can access the same data via the Python shell for ad
hoc analysis.
6. Spark Core
• Spark core:“computational engine” that is responsible
for scheduling, distributing, and monitoring applications
consisting of many computational tasks on a computing
cluster.
7. Spark Stack
• Spark Core: the basic functionality of Spark, including
components for task scheduling, memory management,
fault recovery, interacting with storage systems, and
more.
• Spark SQL: Spark’s package for working with
structured data.
• Spark Streaming: Spark component that enables
processing of live streams of data.
8. Spark Stack
• MLlib: library containing common machine learning
(ML) functionality including classification, regression,
clustering, and collaborative filtering, as well as
supporting functionality such as model evaluation and
data import.
• GraphX: library for manipulating graphs (e.g., a social
network’s friend graph) and performing graph-parallel
computations.
• Cluster Managers: Standalone Scheduler,Apache
Mesos, HadoopYARN.
9. “Data Scientist: a person, who is better
at statistics than a computer engineer,
and better at computer engineering
than a statistician.”
I do not believe in this new job role.
Data Science is embracing all stakeholders.
10. Data Scientists of Spark age
• Data scientists use their skills to analyze data with the
goal of answering a question or discovering insights.
• Data science workflow involves ad hoc analysis.
• Data scientists use interactive shells (vs. building
complex applications) for seeing the results to their
queries and for writing snippets of code quickly.
11. Data Scientists of Spark age
• Spark’s speed and simple APIs shine for data science, and
its built-in libraries mean that many useful algorithms
are available out of the box.
12. Storage Layer
• Spark can create distributed datasets from any file
stored in the Hadoop distributed filesystem (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local filesystem,Amazon S3, Cassandra,
Hive, HBase, etc.).
• Spark does not require Hadoop; it simply has support
for storage systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro, Parquet,
and any other Hadoop InputFormat.
13. Downloading Spark
• The first step to using Spark is to download and unpack
it.
• For a recent precompiled released version of Spark.
• Visit http://spark.apache.org/downloads.html
• Select the package type of “Pre-built for Hadoop 2.4 and
later,” and click “Direct Download.”
• This will download a compressed TAR file, or tarball,
called spark-1.2.0-bin-hadoop2.4.tgz.
14. Directory structure
• README.md
Contains short instructions for getting started with Spark.
• bin
Contains executable files that can be used to interact with
Spark in various ways.
15. Directory structure
• core, streaming, python, ...
Contains the source code of major components of the Spark
project.
• examples
Contains some helpful Spark standalone jobs that you can
look at and run to learn about the Spark API.
16. PySpark
• The first step is to open up one of Spark’s shells.To
open the Python version of the Spark shell, which we
also refer to as the PySpark Shell, go into your Spark
directory and type:
$ bin/pyspark
17. Logging verbosity
• You can control the verbosity of the logging, create a file
in the conf directory called log4j.properties.
• To make the logging less verbose, make a copy of conf/
log4j.properties.template called conf/log4j.properties and
find the following line:
log4j.rootCategory=INFO, console
Then lower the log level to
log4j.rootCategory=WARN, console
18. IPython
• IPython is an enhanced Python shell that offers features
such as tab completion. Instructions for installing it is at
http://ipython.org.
• You can use IPython with Spark by setting the
IPYTHON environment variable to 1:
IPYTHON=1 ./bin/pyspark
19. IPython
• To use the IPython Notebook, which is a web-browser-
based version of IPython, use
IPYTHON_OPTS="notebook" ./bin/pyspark
• On Windows, set the variable and run the shell as
follows:
set IPYTHON=1
binpyspark
20. Script #1
•# Create an RDD
>>> lines = sc.textFile("README.md")
•# Count the number of items in the RDD
>>> lines.count()
•# Show the first item in the RDD
>>> lines.first()
21. Resilient Distributed
Dataset
• The variable lines is an RDD: Resilient Distributed
Dataset.
• on RDDs, you can run parallel operations.
22. Intro to
Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
• Spark Shell is a driver program itself.
• Driver programs access Spark through SparkContext
object, which represents a connection to a computing
cluster.
• In the Spark shell, the context is automatically created
as the variable sc.
24. Intro to
Core Spark Concepts
• Driver programs manage a number of nodes called
executors.
• For example, running the count() on a cluster would
translate into different nodes counting the different
ranges of the input file.
26. Standalone applications
• Apart from running interactively, Spark can be linked
into standalone applications in either Python, Scala, or
Java.
• The main difference is that you need to initialize your
own SparkContext.
• How to py it:
Write your applications as Python scripts as you
normally do, but to run them with cluster aware logic,
use spark-submit script.
27. Standalone applications
•$ bin/spark-submit my_script.py
• The spark-submit script sets up the environment for
Spark’s Python API to function by including Spark
dependencies.
28. Initializing Spark in Python
• # Excerpt from your driver program
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster(“local”).setAppName(“My App”)
sc = SparkContext(conf=conf)
30. Operations on RDDs
• Transformations and Actions.
• Transformations construct a new RDD from a previous
one.
• “Filtering data that matches a predicate” is an example
transformation.
31. Transformations
• Let’s create an RDD that holds strings containing the
word Python.
•>>> pythonLines = lines.filter(lambda line:“Python” in
line)
32. Actions
• Actions compute a result based on an RDD.
• They can return the result to the driver, or to an
external storage system (e.g., HDFS).
•>>> pythonLines.first()
33. Transformations & Actions
• You can create RDDs at any time using transformations.
• But, Spark will materialize them once they are used in an
action.
• This is a lazy approach to RDD creation.
34. Lazy …
• Assume that you want to work with a Big Data file.
• But you are only interested in the lines that contain
Python.
• were Spark to load and save all the lines in the file as
soon as sc.textFile(…) is called, it would waste storage
space.
• Therefore, Spark chooses to see all transformations
first, and then compute the result to an action.
35. Persistence of RDDs
• RDDs are re-computed each time you run an action on
them.
• In order to re-use an RDD in multiple actions, you can
ask Spark to persist it using RDD.persist().
36. Resilience of RDDs
• Once computed, RDD is materialized in memory.
• Persistence to disk is also possible.
• Persistence is optional, and not a default behavior.The
reason is that if you are not going to re-use an RDD,
there is no point in wasting storage space by persisting
it.
• The ability to re-compute is what makes RDDs resilient
to node failures.
38. Working with Key/Value
Pairs
• Most often you ETL your data into a key/value format.
• Key/value RDDs let you
count up reviews for each product,
group together data with the same key,
group together two different RDDs.
39. Pair RDD
• RDDs containing key/value pairs are called pair RDDs.
• Pair RDDs are a useful building block in many programs
as they expose operations that allow you to act on each
key in parallel or regroup data across the network.
• For example, pair RDDs have a reduceByKey() method
that can aggregate data separately for each key.
• join() method merges two RDDs together by grouping
elements with the same key.
40. Creating Pair RDDs
• Use a map() function that returns key/value pairs.
•pairs = lines.map(lambda x: (x.split(“ ”)[0], x))
41. Transformations on Pair
RDDs
• Let the rdd be [(1,2),(3,4),(3,6)]
• reduceByKey(func) combines values with the same key.
•>>> rdd.reduceByKey(lambda x,y: x+y) —> [(1,2),(3,10)]
•groupByKey() group values with the same key.
•>>> rdd.groupByKey() —> [(1,[2]),(3,[4,6])]
42. Transformations on Pair
RDDs
• mapValues(func) applies a function to each value of a
pair RDD without changing the key.
•>>> rdd.mapValues(lambda x: x+1)
•keys() returns an rdd of just the keys.
•>>> rdd.keys()
•values() returns an rdd of just the values.
•>>> rdd.values()
43. Transformations on Pair
RDDs
• sortByKey() returns an rdd, which has the same contents
as the original rdd, but sorted by its keys.
•>>> rdd.sortByKey()
44. Transformations on Pair
RDDs
•join() performs an inner join between two RDDs.
•let rdd1 be [(1,2),(3,4),(3,6)] and rdd2 be [(3,9)].
•>>> rdd1.join(rdd2) —> [(3,(4,9)),(3,(6,9))]
45. Pair RDDs are still RDDs
you can also filter by value! try.
46. Pair RDDs are still RDDs
• Given that pairs is an RDD with the key being an
integer:
•>>> filteredRDD = pairs.filter(lambda x: x[0]>5)
47. Lets do a word count
•>>> rdd = sc.textFile(“README.md”)
•>>> words = rdd.flatMap(lambda x: x.split(“ ”))
•>>> result =
words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
48. Lets identify the top words
•>>> sc.textFile("README.md")
.flatMap(lambda x: x.split(" "))
.map(lambda x: (x.lower(),1))
.reduceByKey(lambda x,y: x+y)
.map(lambda x: (x[1],x[0]))
.sortByKey(ascending=False)
.take(5)
51. Grouping data
• On an RDD consisting of keys of type K and values of
type V, we get back an RDD of type [K, Iterable[V]].
• >>> rdd.groupByKey()
• We can group data from multiple RDDs using cogroup().
• Given two RDDs sharing the same key type K, with the
respective value types asV and W, the resulting RDD is
of type [K, (Iterable[V], Iterable[W])].
• >>> rdd1.cogroup(rdd2)
52. Joins
• There are two types of joins as inner joins and outer
joins.
• Inner joins require a key to be present in both RDDs.
There is a join() call.
• Outer joins do not require a key to be present in both
RDDs.There is a leftOuterJoin() and rightOuterJoin().
None is used as the value for the RDD which has the
key missing.
54. Sorting data
• We can sort an RDD with Key/Value pairs provided that
there is an ordering defined on the key.
• Once we sorted our data, subsequent calls, e.g., collect(),
return ordered data.
•>>> rdd.sortByKey(ascending=True,
numPartitions=None, keyfunc=lambda x: str(x))
57. Accumulators
• Accumulators are shared variables.
• They are used to aggregate values from worker nodes
back to the driver program.
• One of the most common uses of accumulators is to
count events that occur during job execution for
debugging purposes.
58. Accumulators
•>>> inputfile = sc.textFile(inputFile)
• ## Lets create an Accumulator[Int] initialized to 0
•>>> blankLines = sc.accumulator(0)
59. Accumulators
•>>> def parseOutAndCount(line):
# Make the global variable accessible
global blankLines
if (line == ""): blankLines += 1
return line.split(" ")
•>>> rdd = inputfile.flatMap(parseOutAndCount)
• Do an action so that the workers do real work!
•>>> rdd.saveAsTextFile(outputDir + "/xyz")
•>>> blankLines.value
60. Accumulators &
Fault Tolerance
• Spark automatically deals with failed or slow machines
by re-executing failed or slow tasks.
• For example, if the node running a partition of a map()
operation crashes, Spark will rerun it on another node.
• If the node does not crash but is simply much slower
than other nodes, Spark can preemptively launch a
“speculative” copy of the task on another node, and
take its result instead if that finishes earlier.
61. Accumulators &
Fault Tolerance
• Even if no nodes fail, Spark may have to rerun a task to
rebuild a cached value that falls out of memory.
“The net result is therefore that the same function may
run multiple times on the same data depending on
what happens on the cluster.”
62. Accumulators &
Fault Tolerance
• For accumulators used in actions, Spark applies each
task’s update to each accumulator only once.
• For accumulators used in RDD transformations
instead of actions, this guarantee does not exist.
• Bottomline: use accumulators only in actions.
63. BroadcastVariables
• Spark’s second type of shared variable, broadcast
variables, allows the program to efficiently send a large,
read-only value to all the worker nodes for use in one
or more Spark operations.
• Use it if your application needs to send a large, read-
only lookup table or a large feature vector in a
machine learning algorithm to all the nodes.
64. Yahoo SEM Click Data
• Dataset:Yahoo’s Search Marketing Advertiser Bid-
Impression-Click data, version 1.0
• 77,850,272 rows, 8.1GB in total.
• Data fields:
0 day
1 anonymized account_id
2 rank
3 anonymized keyphrase (list of anonymized keywords)
4 avg bid
5 impressions
6 clicks
66. Feeling clicky?
keyphrase impressions clicks
iphone 6 plus for cheap 100 2
new samsung tablet 10 1
iphone 5 refurbished 2 0
learn how to program for iphone 200 0
67. Getting Clicks = Popularity
• Click Through Rate (CTR) = —————————
# of impressions
• If CTR > 0, it is a popular keyphrase.
• If CTR == 0, it is an unpopular keyphrase.
# of clicks
68. Keyphrase = {terms}
• Given keyphrase “iphone 6 plus for cheap”, its terms are:
iphone
6
plus
for
cheap
70. Clickiness of a term
• For the term presence to click reception contingency
table shown previously, we can compute a given term t’s
clickiness value ct as follows:
• ct = log ——————————
(n-s+0.5)/(N-n-S+s+0.5)
(s+0.5)/(S-s+0.5)
71. Clickiness of a keyphrase
• Given a keyphrase K that consists of terms t1 t2 … tn,
its clickiness can be computed by summing up the
clickiness of the terms present in it.
• That is, cK = ct1 + ct2 + … + ctn
72. Feeling clicky?
keyphrase impressions clicks clickiness
iphone 6 plus for cheap 100 2 1
new samsung tablet 10 1 1
iphone 5 refurbished 2 0 0
learn how to program for iphone 200 0 0
81. Reducing by Key and
MappingValues
•>>> t_rdd0 = rdd0.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(iR-x+0.5))
•>>> t_rdd1 = rdd1.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(R-x+0.5))
82. MappingValues
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
t_rdd0 t_rdd1
83. Joining to compute ct
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
(t1, some float value)
(t12, some float value)
(t101, some float value)
…
…
t_rdd0 t_rdd1
88. Spark SQL
• Spark’s interface to work with structured and
semistructured data.
• Structured data is any data that has a schema, i.e., a
know set of fields for each record.
89. Spark SQL
• Spark SQL can load data from a variety of structured
sources (e.g., JSON, Hive and Parquet).
• Spark SQL lets you query the data using SQL both
inside a Spark program and from external tools that
connect to Spark SQL through standard database
connectors (JDBC/ODBC), such as business intelligence
tools like Tableau.
• You can join RDDs and SQL Tables using Spark SQL.
90. Spark SQL
• Spark SQL provides a special type of RDD called
SchemaRDD.
• A SchemaRDD is an RDD of Row objects, each
representing a record.
• A SchemaRDD knows the schema of its rows.
• You can run SQL queries on SchemaRDDs.
• You can create SchemaRDD from external data sources,
from the result of queries, or from regular RDDs.
92. Spark SQL
• Spark SQL can be used via SQLContext or HiveContext.
• SQLContext supports a subset of Spark SQL
functionality excluding Hive support.
• Use HiveContext.
• If you have an existing Hive installation, you need to
copy your hive-site.xml to Spark’s configuration
directory.
93. Spark SQL
• Spark will create its own Hive metastore (metadata DB)
called metastore_db in your program’s work directory.
• The tables you create will be placed underneath
/user/hive/warehouse on your default file system:
- local FS, or
- HDFS if you have hdfs-site.xml on your classpath.
94. Creating a HiveContext
• >>> ## Assuming that sc is our SparkContext
•>>> from pyspark.sql import HiveContext, Row
•>>> hiveCtx = HiveContext(sc)
95. Basic Query Example
• ## Assume that we have an input JSON file.
•>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”)
•>>> rdd.registerTempTable(“reviews”)
•>>> topterms = hiveCtx.sql(“SELECT * FROM reviews
LIMIT 10").collect()
96. SchemaRDD
• Both loading data and executing queries return a
SchemaRDD.
• A SchemaRDD is an RDD composed of Row objects
with additional schema information of the types in each
column.
• Row objects are wrappers around arrays of basic types
(e.g., integers and strings).
• In most recent Spark versions, SchemaRDD is renamed
to DataFrame.
97. SchemaRDD
• A SchemaRDD is also an RDD, and you can run regular
RDD transformations (e.g., map(), and filter()) on them
as well.
• You can register any SchemaRDD as a temporary table
to query it a via hiveCtx.sql.
98. Working with Row objects
• In Python, you access the ith row element using row[i] or
using the column name as row.column_name.
•>>> topterms.map(lambda row: row.Keyword)
99. Caching
• If you expect to run multiple tasks or queries agains the
same data, you can cache it.
•>>> hiveCtx.cacheTable(“mysearchterms”)
• When caching a table, Spark SQL represents the data in
an in-memory columnar format.
• The cached table will be destroyed once the driver
exits.
101. Converting an RDD to a
SchemaRDD
• First create an RDD of Row objects and then call
inferSchema() on it.
•>>> rdd = sc.parallelize([Row(name=“hero”,
favouritecoffee=“industrial blend”)])
•>>> srdd = hiveCtx.inferSchema(rdd)
•>>> srdd.registerTempTable(“myschemardd”)
102. Working with nested data
•>>> a = [{'name': 'mickey'}, {'name': 'pluto', 'knows':
{'friends': ['mickey',‘donald']}}]
•>>> rdd = sc.parallelize(a)
•>>> rdd.map(lambda x:
json.dumps(x)).saveAsTextFile(“test")
•>>> srdd = sqlContext.jsonFile(“test")
110. Text Classification
• Step 1. Start with an RDD of strings representing your
messages.
• Step 2. Run one of MLlib’s feature extraction algorithms
to convert text into numerical features (suitable for
learning algorithms).The result is an RDD of vectors.
• Step 3. Call a classification algorithm (e.g., logistic
regression) on the RDD of vectors.The result is a
model.
111. Text Classification
• Step 4.You can evaluate the model on a test set.
• Step 5.You can use the model for point shooting. Given
a new data sample, you can classify it using the model.
119. Spam Classification
• ### Lets test it!
•>>> posTest = tf.transform(“O M G GET cheap
stuff”.split(“ ”))
•>>> negTest = tf.transform(“Enjoy Spark on Machine
Learning”.split(“ ”))
•>>> print model.predict(posTest)
•>>> print model.predict(negTest)
120. Data Types
• MLlib contains a few specific data types located in
pyspark.mllib.
•Vector : a mathematical vector (sparse or dense).
•LabeledPoint : a pair of feature vector and its label.
•Rating : a rating of a product by a user.
• Various Model classes : the resulting model from
training. It has a predict() function for ad-hoc querying.