Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Vitalii Bondarenko
Data Platform Competency Manager at Eleks
Vitaliy.bondarenko@eleks.com
HDInsight: Spark
Advanced in-memory BigData Analytics with Microsoft Azure
Agenda
● Spark Platform
● Spark Core
● Spark Extensions
● Using HDInsight Spark
About me
Vitalii Bondarenko
Data Platform Competency Manager
Eleks
www.eleks.com
20 years in software development
9+ years of developing for MS SQL Server
3+ years of architecting Big Data Solutions
● DW/BI Architect and Technical Lead
● OLTP DB Performance Tuning
●
Big Data Data Platform Architect
Spark Platform

Recommended for you

Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk

This document summarizes a system using Cassandra, Spark, and ELK (Elasticsearch, Logstash, Kibana) for processing streaming data. It describes how the Spark Cassandra Connector is used to represent Cassandra tables as Spark RDDs and write RDDs back to Cassandra. It also explains how data is extracted from Cassandra into RDDs based on token ranges, transformed using Spark, and indexed into Elasticsearch for visualization and analysis in Kibana. Recommendations are provided for improving performance of the Cassandra to Spark data extraction.

cassandrasparkscala
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance

Agenda: • Spark Streaming Architecture • How different is Spark Streaming from other streaming applications • Fault Tolerance • Code Walk through & demo • We will supplement theory concepts with sufficient examples Speakers : Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs) Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719 Sachin Aggarwal (Developer, Analytics Platform at IBM Labs) Profile : https://in.linkedin.com/in/nitksachinaggarwal Github Link: https://github.com/agsachin/spark-meetup

architecturefault tolerancespark streaming
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark

Are you tired of struggling with your existing data analytic applications? When MapReduce first emerged it was a great boon to the big data world, but modern big data processing demands have outgrown this framework. That’s where Apache Spark steps in, boasting speeds 10-100x faster than Hadoop and setting the world record in large scale sorting. Spark’s general abstraction means it can expand beyond simple batch processing, making it capable of such things as blazing-fast, iterative algorithms and exactly once streaming semantics. This combined with it’s interactive shell make it a powerful tool useful for everybody, from data tinkerers to data scientists to data developers.

apache sparkfast data
Spark Stack
● Clustered computing platform
● Designed to be fast and general purpose
● Integrated with distributed systems
● API for Python, Scala, Java, clear and understandable code
● Integrated with Big Data and BI Tools
● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O
● First Apache release 2013, this moth v.2.0 has been released
Map-reduce computations
In-memory map-reduce
Execution Model
Spark Execution
● Shells and Standalone application
● Local and Cluster (Standalone, Yarn, Mesos, Cloud)
Spark Cluster Arhitecture
● Master / Cluster manager
● Cluster allocates resources on nodes
● Master sends app code and tasks tor nodes
● Executers run tasks and cache data
Connect to Cluster
● Local
● SparkContext and Master field
● spark://host:7077
● Spark-submit

Recommended for you

Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS

This is a sharing on a seminar held together by Cathay Bank and the AWS User Group in Taiwan. In this sharing, overview of Amazon EMR and AWS Glue is offered and CDK management on those services via practical scenarios is also presented

awsaws-cdkamazon-emr
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot

2014-05-06 Presentation to Boston Elasticsearch Meetup on Yieldbot's use of Elasticsearch in a Lambda Architecture

elasticsearcharchitecturebig data
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

This document discusses Pearson's use of Apache Blur for distributed search and indexing of data from Kafka streams into Blur. It provides an overview of Pearson's learning platform and data architecture, describes the benefits of using Blur including its scalability, fault tolerance and query support. It also outlines the challenges of integrating Kafka streams with Blur using Spark and the solution developed to provide a reliable, low-level Kafka consumer within Spark that indexes messages from Kafka into Blur in near real-time.

luc
DEMO: Execution Environments
● Local Spark installation
● Shells and Notebook
● Spark Examples
● HDInsight Spark Cluster
● SSH connection to Spark in Azure
● Jupyter Notebook connected to HDInsight Spark
Spark Core
RDD: resilient distributed dataset
● Parallelized collections with fault-tolerant (Hadoop datasets)
● Transformations set new RDDs (filter, map, distinct, union, subtract, etc)
● Actions call to calculations (count, collect, first)
● Transformations are lazy
● Actions trigger transformations computation
● Broadcast Variables send data to executors
● Accumulators collect data on driver
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
print "Input had " + badLinesRDD.count() + " concerning lines"
Spark program scenario
● Create RDD (loading external datasets, parallelizing a
collection on driver)
● Transform
● Persist intermediate RDDs as results
● Launch actions

Recommended for you

Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences

Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn: - How are they architecturally similar and how are they different? - What's the difference between them in performance and features? - How do their software lifecycles and release cadences contrast?

big data databasedatabasenosql
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...

Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator. With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster. Speaker bio Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.

los angeles big data users groupapache cassandralabdug
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS

GumGum relies heavily on Cassandra for storing different kinds of metadata. Currently GumGum reaches 1 billion unique visitors per month using 3 Cassandra datacenters in Amazon Web Services spread across the globe. This presentation will detail how we scaled out from one local Cassandra datacenter to a multi-datacenter Cassandra cluster and all the problems we encountered and choices we made while implementing it. How did we architect multi-region Cassandra in AWS? What were our experiences in implementing multi-datacenter Cassandra? How did we achieve low latency with multi-region Cassandra and the Datastax Driver? What are the different Cassandra use cases at GumGum? How did we integrate our Cassandra with Spark?

apache cassandrabig datadatastax community
Persistence (Caching)
● Avoid recalculations
● 10x faster in-memory
● Fault-tolerant
● Persistence levels
● Persist before first action
input = sc.parallelize(xrange(1000))
result = input.map(lambda x: x ** x)
result.persist(StorageLevel.MEMORY_ONLY)
result.count()
result.collect()
Transformations (1)
Transformations (2)
Actions (1)

Recommended for you

Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT

This document discusses using Apache Spark and Cassandra for IoT applications. Cassandra is a distributed database that is highly available, horizontally scalable, and supports multiple datacenters with no single point of failure. It is well-suited for storing time series sensor data. Spark can be used for both batch and stream processing of data in Cassandra. The Spark Cassandra Connector allows Cassandra tables to be accessed as Spark RDDs. Real-time sensor data can be ingested using Spark Streaming and stored in Cassandra. Common use cases with this architecture include real-time analytics on streaming data and batch analytics on historical sensor data.

mqttspark-streamingspark
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...

This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are: - SMACK stack overview - storage layer layout - fixing NoSQL limitations (joins and group by) - cluster resource management and dynamic allocation - reliable scheduling and execution at scale - different options for getting the data into your system - preparing for failures with proper backup and patching strategies

mesoscassandraspark
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL

Event streaming applications unlock new benefits by combining various data feeds. However, getting actionable insights in a timely fashion has remained a challenge, as the data has been siloed in disparate systems. ksqlDB solves this by providing an interactive SQL interface that can seamlessly combine and transform data from various sources. In this webinar, we will show how streaming queries of high throughput NoSQL systems can derive insights from various push/pull queries via ksqlDB's User-Defined Functions, Aggregate Functions and Table Functions.

confluentapache kafkascylladb
Actions (2)
Data Partitioning
● userData.join(events)
● userData.partitionBy(100).persist()
● 3-4 partitions on CPU Core
● userData.join(events).mapValues(...).reduceByKey(...)
DEMO: Spark Core Operations
● Transformations
● Actions
Spark Extensions

Recommended for you

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...

cassandraakkascala
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK

The document describes a presentation about data processing with CDK (Cloud Development Kit). It includes an agenda that covers CDK and Projen, serverless ETL with Glue, Databrew with continuous integration/continuous delivery (CICD), and using Amazon Comprehend with S3 object lambdas. Constructs are demonstrated for building architectures with CDK across multiple programming languages. Examples are provided of using CDK to implement Glue workflows, Databrew CICD pipelines, and combining Comprehend with S3 object lambdas for PII detection and redaction.

awsaws-cdkprojen
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray

My presentation at the Toronto Scala and Typesafe User Group: http://www.meetup.com/Toronto-Scala-Typesafe-User-Group/events/224034596/.

akkaspraytypesafe
Spark Streaming Architecture
● Micro-batch architecture
● SparkStreaming Concext
● Batch interval from 500ms
●
Transformation on Spark Engine
●
Outup operations instead of Actions
● Different sources and outputs
Spark Streaming Example
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
input_stream = ssc.textFileStream("sampleTextDir")
word_pairs = input_stream.flatMap(
lambda l:l.split(" ")).map(lambda w: (w,1))
counts = word_pairs.reduceByKey(lambda x,y: x + y)
counts.print()
ssc.start()
ssc.awaitTermination()
● Process RDDs in batches
● Start after ssc.start()
● Output to console on Driver
● Awaiting termination
Streaming on a Cluster
● Receivers with replication
● SparkContext on Driver
● Output from Exectors in batches saveAsHadoopFiles()
● spark-submit for creating and scheduling periodical streaming jobs
● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
Streaming Transformations
● DStreams
●
Stateless transformantions
● Stagefull transformantions
● Windowed transformantions
● UpdateStateByKey
● ReduceByWindow, reduceByKeyAndWindow
● Recomended batch size from 10 sec
val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1))
val ipCountDStream = ipDStream.reduceByKeyAndWindow(
{(x, y) => x + y}, // Adding elements in the new batches entering the window
{(x, y) => x - y}, // Removing elements from the oldest batches exiting the window
Seconds(30), // Window duration
Seconds(10)) // Slide duration

Recommended for you

Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape

What is the state of the art of high performance, distributed databases as we head into 2022, and which options are best suited for your own development projects? The data-intensive applications leading this next tech cycle are typically powered by multiple types of databases and data stores — each satisfying specific needs and often interacting with a broader data ecosystem. Even the very notion of “a database” is evolving as new hardware architectures and methodologies allow for ever-greater capabilities and expectations for horizontal and vertical scalability, performance, and reliability. In this webinar, ScyllaDB Director of Technology Advocacy Peter Corless will survey the current landscape of distributed database systems and highlight new directions in the industry. This talk will cover different database and database-adjacent technologies as well as describe their appropriate use cases, patterns and antipatterns with a focus on: - Distributed SQL, NewSQL and NoSQL - In-memory datastores and caches - Streaming technologies with persistent data storage

databsbig data databasedatabasenosql
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...

Transitioning a legacy monolithic application to microservices is a daunting task by itself and it only gets more complicated as you start to dig through all the libraries and frameworks out there meant to help. In this talk, we'll cover the transition of a real Cassandra-based application to a microservices architecture using Grpc from Google and Falcor from Netflix. (Yes, Falcor is more than just a magical luck dragon from an awesome 80's movie.) We'll talk about why these technologies were a good fit for the project as well as why Cassandra is often a great choice once you go down the path of microservices. And since all the code for the project is open source, you'll have plenty to dig into afterwards. About the Speaker Luke Tillman Technical Evangelist, DataStax Luke is a Technical Evangelist for Apache Cassandra at DataStax. He's spent most of the last 15 years writing code for web applications built on relational databases both large and small. Most recently, before living the glamorous life of an Evangelist at DataStax, he worked as a software engineer at Hobsons on systems used by hundreds of colleges and universities across the U.S. and the World.

netflixsessionsfalcor
Oleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectOleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile project

This document discusses best practices for continuously delivering mobile projects. It outlines a CI/CD workflow that includes running tests and manual QA on pull requests, notifying stakeholders, automatically generating changelogs and version bumps, preparing release artifacts, and publishing them to stores or S3. Key steps are running tests on pull requests, using strict PR naming conventions, notifying teams in Slack, automating versioning and publishing with scripts and Fastlane, and deploying beta builds to Fabric/Crashlytics. The full workflow aims to streamline mobile releases by automating repetitive tasks and integrating all steps.

iosse2016
DEMO: Spark Streaming
● Simple streaming with PySpark
Spark SQL
●
SparkSQL interface for working with structured data by SQL
●
Works with Hive tables and HiveQL
● Works with files (Json, Parquet etc) with defined schema
●
JDBC/ODBC connectors for BI tools
●
Integrated with Hive and Hive types, uses HiveUDF
●
DataFrame abstraction
Spark DataFrames
● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive
● df.show()
● df.select(“name”, df(“age”)+1)
● df.filtr(df(“age”) > 19)
● df.groupBy(df(“name”)).min()
# Import Spark SQLfrom pyspark.sql
import HiveContext, Row
# Or if you can't include the hive requirementsfrom pyspark.sql
import SQLContext, Row
sc = new SparkContext(...)
hiveCtx = HiveContext(sc)
sqlContext = SQLContext(sc)
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweet
CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
Catalyst: Query Optimizer
● Analysis: map tables, columns, function, create a logical plan
● Logical Optimization: applies rules and optimize the plan
● Physical Planing: physical operator for the logical plan execution
● Cost estimation
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1

Recommended for you

Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?

Что такое дизайн?

ui/uxse2016
Anton Ivinskyi Application level metrics and performance tests
Anton Ivinskyi	Application level metrics and performance testsAnton Ivinskyi	Application level metrics and performance tests
Anton Ivinskyi Application level metrics and performance tests

It is important to understand how your code behaves in production, not just guess how it should behave. Know what takes time and what goes wrong. Measure it all. Be ready for the load with performance tests.

javase2016
Vladimir Lozanov How to deliver high quality apps to the app store
Vladimir Lozanov	How to deliver high quality apps to the app storeVladimir Lozanov	How to deliver high quality apps to the app store
Vladimir Lozanov How to deliver high quality apps to the app store

Mobile QA teams are responsible for thoroughly testing apps before release to ensure high quality. They use a variety of manual and automated testing methods at different stages of development. QA works closely with development and customer support to catch bugs, validate fixes, and improve the product based on user feedback. The goal is to deliver stable, bug-free apps through collaboration across teams.

se2016ios
DEMO: Using SparkSQL
● Simple SparkSQL querying
● Data Frames
● Data exploration with SparkSQL
● Connect from BI
Spark ML
Spark ML
●
Classification
●
Regression
●
Clustering
● Recommendation
●
Feature transformation, selection
●
Statistics
●
Linear algebra
●
Data mining tools
Pipeline Cmponents
●
DataFrame
●
Transformer
●
Estimator
● Pipeline
●
Parameter
Logistic Regression
DEMO: Spark ML
● Training a model
● Data visualization

Recommended for you

Макс Семенчук Дизайнер, которому доверяют
 Макс Семенчук Дизайнер, которому доверяют Макс Семенчук Дизайнер, которому доверяют
Макс Семенчук Дизайнер, которому доверяют

Дизайнер, которому доверяют

se2016ui/ux
Alexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldAlexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real world

This document discusses test-driven development (TDD) practices. It covers topics like the benefits of cleaner interfaces and unbiased design when tests are written first. It also addresses challenges like introducing TDD to an existing codebase or team. Key points emphasized are starting simple with critical features, finding the lowest testable point, and making incremental changes to introduce tests and refactoring step-by-step. Continuous integration practices are also highlighted.

se2016ios
Виталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииВиталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компании

Почему продукта и бизнес-модели недостаточно для успешного IT-бизнеса

se2016management and trends
New in Spark 2.0
● Unifying DataFrames and Datasets in Scala/Java (compile time
syntax and analysis errors). Same performance and convertible.
● SparkSession: a new entry point that supersedes SQLContext and
HiveContext.
● Machine learning pipeline persistence
● Distributed algorithms in R
● Faster Optimizer
● Structured Streaming
New in Spark 2.0
spark = SparkSession
.builder()
.appName("StructuredNetworkWordCount")
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode('complete')
.format('console')
.start()
query.awaitTermination()
windowedCounts = words.groupBy(
window(words.timestamp, '10 minutes', '5 minutes'),
words.word
).count()
HDInsight: Spark
Spark in Azure

Recommended for you

Новые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноНовые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможно

РИФ+КИБ 2016 Секция "Модераторы секции: Анатолий Рожков и Сергей Паранько". Доладчик: Виталий Лаптенок, Head of Product, Genesis Media: – Как мы строим медиа-проекты с аудиторией в 50 млн человек на десяти развивающихся рынках – Как мы работаем с платформами и зарабатываем с этого деньги – Почему бизнес в медиа - это хороший бизнес

analyticsmediainternet
Anton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersAnton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designers

The document provides 4 tips to boost a designer's workflow: 1) Use Git to version and collaborate on design files, 2) Automate repetitive processes, 3) Be prepared for changes by using flexible components and responsive design, 4) Create prototypes to gather feedback early in the design process.

se2016ui/ux
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

This document provides an introduction to Microsoft Azure HDInsight, including: - An overview of HDInsight and how it is Microsoft's Hadoop distribution running in the cloud based on Hortonworks Data Platform. - The architecture of HDInsight and how it is tightly integrated with Microsoft's technology stack. - Examples of use cases for HDInsight like iterative data exploration, data warehousing on demand, and ETL automation.

microsoft azureazure deploymentsbig data
HDInsight benefits
● Ease of creating clusters (Azure portal, PowerShell, .Net SDK)
●
Ease of use (noteboks, azure control panels)
●
REST APIs (Livy: job server)
●
Support for Azure Data Lake Store (adl://)
●
Integration with Azure services (EventHub, Kafka)
●
Support for R Server (HDInsight R over Spark)
● Integration with IntelliJ IDEA (Plugin, create and submit apps)
● Concurrent Queries (many users and connections)
● Caching on SSDs (SSD as persist method)
● Integration with BI Tools (connectors for PowerBI and Tableau)
● Pre-loaded Anaconda libraries (200 libraries for ML)
● Scalability (change number of nodes and start/stop cluster)
● 24/7 Support (99% up-time)
HDInsight Spark Scenarious
1. Streaming data, IoT and real-time analytics
2. Visual data exploration and interactive analysis (HDFS)
3. Spark with NoSQL (HBase and Azure DocumentDB)
4. Spark with Data Lake
5. Spark with SQL Data Warehouse
6. Machine Learning using R Server, Mllib
7. Putting it all together in a notebook experience
8. Using Excel with Spark
Q&A

More Related Content

What's hot

Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
ScyllaDB
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
Vasil Remeniuk
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
DataStax Academy
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
Matthias Niehoff
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
Shu-Jeng Hsieh
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
ScyllaDB
 
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
DataStax
 

What's hot (20)

Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
 
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
 

Viewers also liked

Oleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectOleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile project
Аліна Шепшелей
 
Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?
Аліна Шепшелей
 
Anton Ivinskyi Application level metrics and performance tests
Anton Ivinskyi	Application level metrics and performance testsAnton Ivinskyi	Application level metrics and performance tests
Anton Ivinskyi Application level metrics and performance tests
Аліна Шепшелей
 
Vladimir Lozanov How to deliver high quality apps to the app store
Vladimir Lozanov	How to deliver high quality apps to the app storeVladimir Lozanov	How to deliver high quality apps to the app store
Vladimir Lozanov How to deliver high quality apps to the app store
Аліна Шепшелей
 
Макс Семенчук Дизайнер, которому доверяют
 Макс Семенчук Дизайнер, которому доверяют Макс Семенчук Дизайнер, которому доверяют
Макс Семенчук Дизайнер, которому доверяют
Аліна Шепшелей
 
Alexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldAlexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real world
Аліна Шепшелей
 
Виталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииВиталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компании
Аліна Шепшелей
 
Новые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноНовые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможно
Mediaprojects Mail.Ru Group
 
Anton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersAnton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designers
Аліна Шепшелей
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
HARMAN Services
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
Judy Nash
 

Viewers also liked (11)

Oleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectOleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile project
 
Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?
 
Anton Ivinskyi Application level metrics and performance tests
Anton Ivinskyi	Application level metrics and performance testsAnton Ivinskyi	Application level metrics and performance tests
Anton Ivinskyi Application level metrics and performance tests
 
Vladimir Lozanov How to deliver high quality apps to the app store
Vladimir Lozanov	How to deliver high quality apps to the app storeVladimir Lozanov	How to deliver high quality apps to the app store
Vladimir Lozanov How to deliver high quality apps to the app store
 
Макс Семенчук Дизайнер, которому доверяют
 Макс Семенчук Дизайнер, которому доверяют Макс Семенчук Дизайнер, которому доверяют
Макс Семенчук Дизайнер, которому доверяют
 
Alexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldAlexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real world
 
Виталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииВиталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компании
 
Новые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноНовые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможно
 
Anton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersAnton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designers
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
 

Similar to Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
Ridwan Fadjar
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

Similar to Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 

More from Аліна Шепшелей

Valerii Iakovenko Drones as the part of the present
Valerii Iakovenko	Drones as the part of the presentValerii Iakovenko	Drones as the part of the present
Valerii Iakovenko Drones as the part of the present
Аліна Шепшелей
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
Аліна Шепшелей
 
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Аліна Шепшелей
 
Andrew Veles Product design is about the process
Andrew Veles Product design is about the processAndrew Veles Product design is about the process
Andrew Veles Product design is about the process
Аліна Шепшелей
 
Kononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TVKononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TV
Аліна Шепшелей
 
Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?
Аліна Шепшелей
 
Gregory Shehet Undefined' on prod, or how to test a react app
Gregory Shehet Undefined' on  prod, or how to test a react appGregory Shehet Undefined' on  prod, or how to test a react app
Gregory Shehet Undefined' on prod, or how to test a react app
Аліна Шепшелей
 
Alexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programmingAlexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programming
Аліна Шепшелей
 
Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web
Аліна Шепшелей
 
Roman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your dbRoman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your db
Аліна Шепшелей
 
Dmutro Panin JHipster
Dmutro Panin JHipster Dmutro Panin JHipster
Dmutro Panin JHipster
Аліна Шепшелей
 
Alex Theedom Java ee revisits design patterns
Alex Theedom	Java ee revisits design patternsAlex Theedom	Java ee revisits design patterns
Alex Theedom Java ee revisits design patterns
Аліна Шепшелей
 
Alexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystackAlexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystack
Аліна Шепшелей
 
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big dataVolodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
Аліна Шепшелей
 
Maksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a jobMaksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a job
Аліна Шепшелей
 
Ievgen Vladimirov Only cloud
Ievgen Vladimirov Only cloudIevgen Vladimirov Only cloud
Ievgen Vladimirov Only cloud
Аліна Шепшелей
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
Аліна Шепшелей
 
Den Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily usersDen Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily users
Аліна Шепшелей
 
Anton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side developmentAnton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side development
Аліна Шепшелей
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 

More from Аліна Шепшелей (20)

Valerii Iakovenko Drones as the part of the present
Valerii Iakovenko	Drones as the part of the presentValerii Iakovenko	Drones as the part of the present
Valerii Iakovenko Drones as the part of the present
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
 
Andrew Veles Product design is about the process
Andrew Veles Product design is about the processAndrew Veles Product design is about the process
Andrew Veles Product design is about the process
 
Kononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TVKononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TV
 
Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?
 
Gregory Shehet Undefined' on prod, or how to test a react app
Gregory Shehet Undefined' on  prod, or how to test a react appGregory Shehet Undefined' on  prod, or how to test a react app
Gregory Shehet Undefined' on prod, or how to test a react app
 
Alexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programmingAlexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programming
 
Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web
 
Roman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your dbRoman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your db
 
Dmutro Panin JHipster
Dmutro Panin JHipster Dmutro Panin JHipster
Dmutro Panin JHipster
 
Alex Theedom Java ee revisits design patterns
Alex Theedom	Java ee revisits design patternsAlex Theedom	Java ee revisits design patterns
Alex Theedom Java ee revisits design patterns
 
Alexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystackAlexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystack
 
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big dataVolodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
 
Maksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a jobMaksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a job
 
Ievgen Vladimirov Only cloud
Ievgen Vladimirov Only cloudIevgen Vladimirov Only cloud
Ievgen Vladimirov Only cloud
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
Den Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily usersDen Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily users
 
Anton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side developmentAnton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side development
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 

Recently uploaded

Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
apoorva2579
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
uuuot
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
kantakumariji156
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
Edge AI and Vision Alliance
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 

Recently uploaded (20)

Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure

  • 1. Vitalii Bondarenko Data Platform Competency Manager at Eleks Vitaliy.bondarenko@eleks.com HDInsight: Spark Advanced in-memory BigData Analytics with Microsoft Azure
  • 2. Agenda ● Spark Platform ● Spark Core ● Spark Extensions ● Using HDInsight Spark
  • 3. About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing for MS SQL Server 3+ years of architecting Big Data Solutions ● DW/BI Architect and Technical Lead ● OLTP DB Performance Tuning ● Big Data Data Platform Architect
  • 5. Spark Stack ● Clustered computing platform ● Designed to be fast and general purpose ● Integrated with distributed systems ● API for Python, Scala, Java, clear and understandable code ● Integrated with Big Data and BI Tools ● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O ● First Apache release 2013, this moth v.2.0 has been released
  • 8. Execution Model Spark Execution ● Shells and Standalone application ● Local and Cluster (Standalone, Yarn, Mesos, Cloud) Spark Cluster Arhitecture ● Master / Cluster manager ● Cluster allocates resources on nodes ● Master sends app code and tasks tor nodes ● Executers run tasks and cache data Connect to Cluster ● Local ● SparkContext and Master field ● spark://host:7077 ● Spark-submit
  • 9. DEMO: Execution Environments ● Local Spark installation ● Shells and Notebook ● Spark Examples ● HDInsight Spark Cluster ● SSH connection to Spark in Azure ● Jupyter Notebook connected to HDInsight Spark
  • 11. RDD: resilient distributed dataset ● Parallelized collections with fault-tolerant (Hadoop datasets) ● Transformations set new RDDs (filter, map, distinct, union, subtract, etc) ● Actions call to calculations (count, collect, first) ● Transformations are lazy ● Actions trigger transformations computation ● Broadcast Variables send data to executors ● Accumulators collect data on driver inputRDD = sc.textFile("log.txt") errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badLinesRDD = errorsRDD.union(warningsRDD) print "Input had " + badLinesRDD.count() + " concerning lines"
  • 12. Spark program scenario ● Create RDD (loading external datasets, parallelizing a collection on driver) ● Transform ● Persist intermediate RDDs as results ● Launch actions
  • 13. Persistence (Caching) ● Avoid recalculations ● 10x faster in-memory ● Fault-tolerant ● Persistence levels ● Persist before first action input = sc.parallelize(xrange(1000)) result = input.map(lambda x: x ** x) result.persist(StorageLevel.MEMORY_ONLY) result.count() result.collect()
  • 18. Data Partitioning ● userData.join(events) ● userData.partitionBy(100).persist() ● 3-4 partitions on CPU Core ● userData.join(events).mapValues(...).reduceByKey(...)
  • 19. DEMO: Spark Core Operations ● Transformations ● Actions
  • 21. Spark Streaming Architecture ● Micro-batch architecture ● SparkStreaming Concext ● Batch interval from 500ms ● Transformation on Spark Engine ● Outup operations instead of Actions ● Different sources and outputs
  • 22. Spark Streaming Example from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 1) input_stream = ssc.textFileStream("sampleTextDir") word_pairs = input_stream.flatMap( lambda l:l.split(" ")).map(lambda w: (w,1)) counts = word_pairs.reduceByKey(lambda x,y: x + y) counts.print() ssc.start() ssc.awaitTermination() ● Process RDDs in batches ● Start after ssc.start() ● Output to console on Driver ● Awaiting termination
  • 23. Streaming on a Cluster ● Receivers with replication ● SparkContext on Driver ● Output from Exectors in batches saveAsHadoopFiles() ● spark-submit for creating and scheduling periodical streaming jobs ● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
  • 24. Streaming Transformations ● DStreams ● Stateless transformantions ● Stagefull transformantions ● Windowed transformantions ● UpdateStateByKey ● ReduceByWindow, reduceByKeyAndWindow ● Recomended batch size from 10 sec val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow( {(x, y) => x + y}, // Adding elements in the new batches entering the window {(x, y) => x - y}, // Removing elements from the oldest batches exiting the window Seconds(30), // Window duration Seconds(10)) // Slide duration
  • 25. DEMO: Spark Streaming ● Simple streaming with PySpark
  • 26. Spark SQL ● SparkSQL interface for working with structured data by SQL ● Works with Hive tables and HiveQL ● Works with files (Json, Parquet etc) with defined schema ● JDBC/ODBC connectors for BI tools ● Integrated with Hive and Hive types, uses HiveUDF ● DataFrame abstraction
  • 27. Spark DataFrames ● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive ● df.show() ● df.select(“name”, df(“age”)+1) ● df.filtr(df(“age”) > 19) ● df.groupBy(df(“name”)).min() # Import Spark SQLfrom pyspark.sql import HiveContext, Row # Or if you can't include the hive requirementsfrom pyspark.sql import SQLContext, Row sc = new SparkContext(...) hiveCtx = HiveContext(sc) sqlContext = SQLContext(sc) input = hiveCtx.jsonFile(inputFile) # Register the input schema RDD input.registerTempTable("tweets") # Select tweets based on the retweet CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
  • 28. Catalyst: Query Optimizer ● Analysis: map tables, columns, function, create a logical plan ● Logical Optimization: applies rules and optimize the plan ● Physical Planing: physical operator for the logical plan execution ● Cost estimation SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1
  • 29. DEMO: Using SparkSQL ● Simple SparkSQL querying ● Data Frames ● Data exploration with SparkSQL ● Connect from BI
  • 30. Spark ML Spark ML ● Classification ● Regression ● Clustering ● Recommendation ● Feature transformation, selection ● Statistics ● Linear algebra ● Data mining tools Pipeline Cmponents ● DataFrame ● Transformer ● Estimator ● Pipeline ● Parameter
  • 32. DEMO: Spark ML ● Training a model ● Data visualization
  • 33. New in Spark 2.0 ● Unifying DataFrames and Datasets in Scala/Java (compile time syntax and analysis errors). Same performance and convertible. ● SparkSession: a new entry point that supersedes SQLContext and HiveContext. ● Machine learning pipeline persistence ● Distributed algorithms in R ● Faster Optimizer ● Structured Streaming
  • 34. New in Spark 2.0 spark = SparkSession .builder() .appName("StructuredNetworkWordCount") .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:9999 lines = spark .readStream .format('socket') .option('host', 'localhost') .option('port', 9999) .load() # Split the lines into words words = lines.select( explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count() # Start running the query that prints the running counts to the console query = wordCounts .writeStream .outputMode('complete') .format('console') .start() query.awaitTermination() windowedCounts = words.groupBy( window(words.timestamp, '10 minutes', '5 minutes'), words.word ).count()
  • 37. HDInsight benefits ● Ease of creating clusters (Azure portal, PowerShell, .Net SDK) ● Ease of use (noteboks, azure control panels) ● REST APIs (Livy: job server) ● Support for Azure Data Lake Store (adl://) ● Integration with Azure services (EventHub, Kafka) ● Support for R Server (HDInsight R over Spark) ● Integration with IntelliJ IDEA (Plugin, create and submit apps) ● Concurrent Queries (many users and connections) ● Caching on SSDs (SSD as persist method) ● Integration with BI Tools (connectors for PowerBI and Tableau) ● Pre-loaded Anaconda libraries (200 libraries for ML) ● Scalability (change number of nodes and start/stop cluster) ● 24/7 Support (99% up-time)
  • 38. HDInsight Spark Scenarious 1. Streaming data, IoT and real-time analytics 2. Visual data exploration and interactive analysis (HDFS) 3. Spark with NoSQL (HBase and Azure DocumentDB) 4. Spark with Data Lake 5. Spark with SQL Data Warehouse 6. Machine Learning using R Server, Mllib 7. Putting it all together in a notebook experience 8. Using Excel with Spark
  • 39. Q&A