Big Data on Azure Tutorial

Building big data applications on
Azure
Pranav Rastogi/ Bharath Sreenivas
Microsoft
pranav.rastogi@microsoft.com
@rustd/ @bharathbs

Security and privacyFlexibility of choiceReason over any data, anywhere
Data warehouses
Data lakes
Operational databases
Hybrid
Data warehouses
Data lakes
Operational databases
SocialLOB Graph IoTImageCRM

Apps + insights
Social
LOB
Graph
IoT
Image
CRM INGEST STORE PREP & TRAIN MODEL &
SERVE
Data orchestration
and monitoring
Big data
store
Hadoop/Spark and
machine learning
Data warehouse

Solution scenarios
Three scenarios that take optimal advantage of Big Data
Modern DW
“We want to incorporate all
of our data including ‘big
data” with our data
warehouse”
Advanced Analytics
“We are trying to predict
when our customers churn.”
Internet of Things (IoT)
“We are trying to get insights
from our devices in real-time,
etc.”

Governance and
Master Data Management
Azure SQL Data Warehouse
Data Quality and
Lineage
ERP, CRM,
and other
LOB Data
OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
ETL
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
Azure Analysis
Services
BI Models
Power BI
Reports and
Dashboards
Polybase
Analyst
Power User
Data Engineer
Data Scientist
Big Data Warehouse

OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
REPL and
Machine
Learning Tools
Data
Wrangling
Tools
Data Engineer Data Scientist
Deep Learning
& Cognitive
Services
Azure
Cosmos DB
Apps
Automated
Systems
People
Web
Mobile
Bots
ML Models
and Scoring
APIs
Advanced Analytics and AI
Azure Data Lake
Analytics (U-SQL)
Azure HDInsight
(Hadoop / Spark)

Azure Stream Analytics / Spark Streaming
Clean,
Curate,
Aggregate
Combine
reference
data
Perform
Scoring from
ML models
IoT Sensors
and/or
User
activity
streams
Social,
Trends,
Weather
etc.
Clickstream,
Batch Files,
server logs,
Images,
videos, and
other
unstructured
data
Azure Event Hubs,
Apache Kafka
Event
Broker/Buffer
Queue
Event
Broker
Power BI
Realtime
Dashboards
Analyst
Data Engineer
Data Scientist
Azure ML / R
Trained Machine
Learning Models
Azure SQL DB /
Cosmos DB
Reference Data
Automated
Systems
Realtime Processing with Lambda Architecture
Azure Data Lake
Analytics (U-SQL)
Azure HDInsight
(Hadoop / Spark)

A d v a n c e d a n a l y t i c s a n d b i g d a t a
i m p a c t s a l l v e r t i c a l s
Heartland Bank prevents fraud
and boosts profits
The UK NHS transforms healthcare
with faster access to information.
City of Barcelona boosts citizen
unsegmented with intelligent app
Jet.com transforms customer engagement
with truly aerosolized experience
Rolls Royce decreases costs with
Predictive Maintenance
Manufacturing
Eliminate downtime and
increase efficiency by enabling
better predictive maintenance
for your capital assets.
Banking
Minimize losses with more
accurate fraud detection and
assess exposure to asset,
credit and market risk using a
holistic approach
Boost operational efficiency
and improve patient acre
experience with intelligent
detection and in time service.
Healthcare Government
Empower citizens and
improve their engagement
with relevant information and
personalized citizen services.
Retail
Turn individual customer
interactions into contextual
engagements and increase
customer satisfaction with highly
personalized offers and content

Managed Open Source Analytics for the
cloud with a 99.9% SLA.
100% Open Source
Clusters up and running in minutes
63% lower TCO than deploy your own Hadoop on-
premises
Separation of compute and store allows you to scale
clusters to exponentially reduce costs
Open Source Analytics for the Enterprise

Big data is hard
Buy
Servers
Install
OSS
Secure Configure
Optimize
Debug
Success
Scale up

HDInsight makes it easy
Provide
Cluster
details
HDInsight
Cluster
 100% open source
 Optimized
 Highly available
 Secure
 Scalable
 Dedicated
 Managed
 Certified ISVs
 Customizable
Browse to
Azure Portal

Multi Region Availability
Available in >25 regions world-wide
Launched most recently in US West 2, and UK regions
Available in China, Europe and US Government clouds
Deploy Globally Within Minutes

Perimeter Level Security
Virtual Networks
Network Security Groups (firewalls)
Authentication
Azure Active Directory
Kerberos authentication
Authorization
Apache Ranger
RBAC for Admin
POSIX ACLs for Data Plane Data Security
Server-Side encryption at rest
HTTPS/TLS In-transit
Security and Compliance to Enable OSS for Enterprises

Plugins for HDI available for most popular IDEs for agile
development and debugging
Rich support for powerful notebooks used by data
scientists
Develop in C#, deploy on Linux in Java via HDI
developed SCP.Net technology
Remote Debugging for Spark jobs
Rich Developer Ecosystem

Recognized by
Top Analysts
Forrester Wave for Big Data
Hadoop Cloud
• Named industry leader by
Forrester with the most
comprehensive, scalable, and
integrated platforms*
• Recognized for its cloud-first
strategy that is paying off*
*The Forrester WaveTM: Big Data Hadoop Cloud Solutions, Q2 2016.

Products and Services Organization Size Industry Country Business Need
Simplified pricing process
now takes minutes instead
of days
Competitive pricing, product demand, the costs of materials, gas and
labor, and the thousands of other market variables affect product cost
and customer demand for products or services around the world. It’s
why accurate and profitable pricing represents one of the most
difficult business challenges for many companies. Manufacturing,
distribution, services, and airline companies look to the science and
technology provided by PROS to keep their pricing accurate,
competitive, and profitable. The PROS Guidance product runs
enormously complex pricing calculations based on variables that
comprise multiple terabytes of data. To handle this calculation
complexity and data volume, and then deliver specific results to its
clients quickly, PROS built its services on top of Azure HDInsight.
Pricing Software-
as-a-Service
United StatesOther-
unsegmented
1,000Microsoft Azure
Azure HDInsight
Apache Spark for Azure
HDInsight

HDInsight architecture
Hive meta store
Azure SQL database
Azure Storage or
Data Lake Store
Client
machines
HDInsight cluster
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes

Scale compute & storage independently
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes
Azure Blob Storage
or
Azure Data Lake
Store

Persist & Reuse your data
 Your data is outside the
HDInsight cluster.
 Hence data is persisted
even if you drop and
recreate the cluster.
 Create multiple clusters
and point to same storage.
Azure Blob Storage
or
Azure Data Lake
Store
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster

Create cluster using Azure CLI
https://docs.microsoft.com/en-
us/azure/hdinsight/hdinsight-hadoop-create-linux-clusters-
azure-cli
azure hdinsight cluster create -g groupname -l location WestUS-y Linux --clusterType Hadoop --
defaultStorageAccountName storagename.blob.core.windows.net --defaultStorageAccountKey storagekey
--defaultStorageContainer clustername --workerNodeCount 3 --userName admin --password
httppassword --sshUserName sshuser --sshPassword sshuserpassword clustername

Azure
Blob
Storage
HDInsight Spark cluster
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure Data Lake
Store
Azure Cosmos
DB
jobs

Storage
Files/Folders Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
jobs

Storage Storage
HDInsight Spark cluster1. Create cluster
2. Submit jobs
6. Drop cluster jobs

streaming jobs
Web app
Mobile
Azure
Blob
Storage
Kafka
Event Hub
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot

Apache
Flume
Kafka
Event Hub
Storage
Azure SQL
Data Warehouse
Azure SQL
Database
PrestoHDInsight
(Spark SQL)
HDInsight
(Interactive Hive)
Hive PartitionsFiles/Folders
HDInsight
(Spark streaming)
HDInsight
(Spark batch)
HDInsight
(AtScale)

Reads from
HDFS
Writes to
HDFS
Reads from
HDFS
Writes to
HDFSStep 1
“mapper”
Step 2
“reducer”
Step 1
Reads and writes
from HDFS
Read 1MB
sequentially from
disk
20,000,000 ns
Read 1 MB
sequentially from
SSD
1,000,000 ns
Read 1 MB
sequentially from
memory
250,000 ns

RDD
RDD
RDD
RDDRDD
Transformations ValueActions

val file = spark.textFile(“wasb://...")
val errors = file.filter(line => line.contains("ERROR"))
// Cache errors
errors.cache()
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains(“Web")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains(“Error")).collect()

SQL
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataSet

HDInsight R Server cluster Web app
Mobile
request/response
Bot

streaming jobs
Web app
Mobile
Azure
Blob
Storage
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot
Power BI
real-time
dashboard
Kafka
Event Hub

Peace of mind Speed and
scalability
Flexibility

100% compatible with open source R
Wide range of scalable and distributed R functions
Ability to parallelize R functions

"http://www.ats.ucla.edu/stat/data/binary.csv"

Cluster Name
pranavstratalab# 1-30
pranavstratalab# 30-45
pranavstratalab## 45-70
Cluster URL https://pranavstratalab##.azurehdinsight.net
Notebooks URL
https://pranavstratalab##.azurehdinsight.net/jupyter/tre
e
Cluster login user admin
Cluster password Abc!1234567890

Phone Tracking Across Cell Sites
Connected Car - Remote
Management & Diagnostics
Asset Tracking
Fleet Management
Facilities Management
Personnel Tracking & Crowd
Control
Ride Sharing
Geofencing
Racecar Telemetry
Connected Manufacturing
and many more…

Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Consumption
(Ingestion)
Data Processing
Presentation/Serving
Layer

Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Processing
REALTIME ANALYTICS
INTERACTIVE ANALYTICS
BATCH ANALYTICS
Machine Learning
(Spark + Azure ML)
(Failure and RCA
Predictions)
HDI + ISVs
OLAP for Data
Warehousing
HDI Custom ETL
Aggregate /Partition
PowerBI
dashboard
(Shared with field
Ops, customers,
MIS, and Engineers)
Realtime Machine Learning
(Anomaly Detection)
CosmosDB
Interactive HDInsight clusters
BIG DATA STORAGE ANALYTICS
Big Data Storage
Azure Data
Lake Store
CosmosDB Azure Blob
Storage
Data Scientists,
BI Analysts
Big Data Applications

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-high-
availability

Costin$
Throughput MBps
Kafka Cost Estimator
Non Managed Disks Managed Disks
#KAFKANODES
THROUGHPUT MBPS
Kafka scale forecast
Kafka nodes (OS VHDs) Kafka nodes (managed disks)

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway
Azure VNet Boundary

Microsoft Databus
(Siphon) Usage 8 million
EVENTS PER SECOND PEAK INGRESS
800 TB (10 GB per Sec)
INGRESS PER DAY
1,800; 450
PRODUCTION KAFKA BROKERS; TOPICS
15 Sec
99th PERCENTILE LATENCY
KEY CUSTOMER SCENARIOS
Ads Monetization (Fast BI)
O365 Customer Fabric NRT – Tenant & User insights
BingNRT Operational Intelligence
Presto (Fast SML) interactive analysis
Delve Analytics
0
5
10
15
20
25
30
35
40
45
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Volume published (GBps) Volume subscribed (GBps)
0
5
10
15
20
25
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
EPS In Eps Out

Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon

Tool Purpose
Ambari Dashboard for monitoring health and status of the
Hadoop cluster
Yarn UI Monitor Yarn Application and logs
Tez View Track and debug the execution of jobs
Grafana Workload specific JMX metrics
Spark History Server The history server displays both completed and
incomplete Spark jobs
HMaster UI HBase provides a web-based user interface that you
can use to monitor your HBase cluster
Visual Studio /VS Code Monitor a Job status in VS with DataLake tools. Spark
Remote Job debugging

OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigomsconfig
Spark
Hive
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service

Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes

HDInsight security – rings of defense
Perimeter level security
Virtual network
Network security (i.e. firewalls)
Gateway
Service Tunneling
Authentication
Kerberos
Active directory
Authorization
Hive policies
HBase policies
File and folder level ACLS
Data security
Encryption @ rest

Using virtual network and gateway service
Virtual network
Network security (i.e. firewalls)
Gateway
Service Tunneling

Perimeter level security – Virtual Network and Gateway
HDInsight cluster
Head node

Perimeter level security – Network Security Group
HDInsight cluster
Head node
Contoso
Server,
Microsoft
IP
Storage,
SQL

Authentication
Integration with Azure Active Directory
Authentication
Kerberos
Active directory

Authorization
Application and data-level authorization
Authorization
Hive policies
HBase policies
File and folder level ACLS

HDInsight cluster
Head node
Domain credentials
Kerberos ticket
OAuth ticket
Kerberos AuthN
LDAP
Authorization: Workload and Storage (WASB/ADLS)
Active Directory Domain
Services Azure VNET to
VNET peering
SAS Keys

Data security
Transparent Server Side Encryption
Data security
Encryption @ rest & in transit

Transparent Server Side Encryption
Azure Data Lake Storage
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer
managed keys
Encryption @ Rest and Encryption in Transit
Microsoft Azure Storage Blob
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer managed keys
Encryption @ Rest and Encryption in Transit

https://azure.microsoft.com/en-
us/services/hdinsight/
https://docs.microsoft.com/en-us/azure/hdinsight/
https://aka.ms/hdinsighttraining

THANK YOU
Pranav Rastogi/ Bharath Sreenivas
Microsoft
@rustd/ @bharathbs

Big Data on Azure Tutorial

More Related Content

What's hot

What's hot (20)

Similar to Big Data on Azure Tutorial

Similar to Big Data on Azure Tutorial (20)

More from rustd

More from rustd (8)

Recently uploaded

Recently uploaded (20)

Big Data on Azure Tutorial

Editor's Notes