1 Introduction to Microsoft data platform analytics for release

Introduction to Analytics
with the Microsoft Data
Platform
Jen Stirrup
Data Whisperer,
Data Relish
Level: 300

Jen Stirrup
·Consultant
·Postgraduate degrees
in Artificial Intelligence
and Cognitive
Science
·Twenty year career in
industry
·Author

Contact Details
·http://bit.ly/JenStirrupRD
·http://bit.ly/JenStirrupLinkedIn
·http://bit.ly/JenStirrupMVP
·http://bit.ly/JenStirrupTwitter
JenStirrup.com
DataRelish.com

Agenda
• Azure Data Explorer
• Azure Data Factory
• Streaming Analytics
• Event Hubs
• Azure SQL Database

Agenda
• Analysis Services
• Data Lake Analytics
• HDInsight
• Azure Databricks
• Azure SQL Datawarehouse
• Azure Synapse

What to use and when?
A fully managed, elastic data warehouse with security at
every level of scale at no extra cost
Azure Synapse Analytics
Fast, easy and collaborative Apache Spark-based analytics
platform
Azure Databricks
A fully managed cloud Hadoop and Spark service backed by
99.9% SLA for your enterprise
HDInsight
A fully managed cloud service that enables you to easily
build, deploy and share predictive analytics solutions
Machine Learning
An on-demand, real-time stream processing service with
enterprise-grade security, auditing and support
Stream Analytics

What to use and when?
A no-limits data lake built to support massively parallel
analytics
Data Lake Store
A fully managed on-demand pay-per-job analytics
service with enterprise-grade security, auditing and
support
Data Lake Analytics
An enterprise-wide metadata catalogue that makes
data asset discovery simple
Azure Data Catalog
A data integration service to orchestrate and automate
data movement and transformation
Data Factory

INGEST
Modern Data Warehouse
PREPARE TRANSFORM
& ENRICH
SERVE
STORE
VISUALIZE
On-premises data
Cloud data
SaaS data

Azure Data Explorer
Jupyter Notebook allows you to create and share documents that contain live code, equations,
visualizations, and explanatory text.
We are excited to announce KQL magic commands which extends the functionality of the Python kernel in
Jupyter Notebook. KQL magic allows you to write KQL queries natively and query data from Microsoft
Azure Data Explorer. You can easily interchange between Python and KQL, and visualize data using rich
Plot.ly library integrated with KQL render commands. KQL magic supports Azure Data Explorer,
Application Insights, and Log Analytics as data sources to run queries against.

Azure Data Explorer
Fast and highly scalable data exploration
service.
Azure Data Explorer is a fast, fully managed
data analytics service for real-time analysis on
large volumes of data streaming from
applications, websites, IoT devices and more.

Azure Data Explorer
● Low-latency ingestion
● Fast read-only query with high concurrency
● Query large amounts of structured, semi-
structured (JSON-like nested types) and
unstructured (free-text) data.

1 Introduction to Microsoft data platform analytics for release

Azure Data Factory
Data Ingestion

Data Factory
● No code or maintenance required to build
hybrid ETL and ELT pipelines within the
Data Factory visual environment.
● Cost-efficient and fully managed serverless
cloud data integration tool that scales on
demand.

Data Factory
● Azure security measures to connect to on-
premises, cloud-based and software-as-a-
service apps with peace of mind.
● SSIS integration runtime to easily move
SSIS ETL workloads into the cloud with
minimal effort.

Data Factory
● Ingest, move, prepare, transform and
process your data in a few clicks, and
complete your data modelling within the
accessible visual environment.

Why Data Factory?
● Orchestrate, monitor & schedule data
pipelines
● Automatic cloud resource management
● Single pane of glass

• Data Sources, Linked Services & Datasets
• Activities
• Pipelines
• Supported data sources
• Supported activity types
Module Overview

Stream Analytics
● Build streaming pipelines in minutes - Run complex analytics with no need to learn new
processing frameworks or provision virtual machines (VMs) or clusters. Use familiar SQL language
that is extensible with JavaScript and C# custom code for more advanced use cases. Easily enable
scenarios such as low-latency dashboarding, streaming ETL and real-time alerting with one-click
integration across sources and sinks.
● Run mission-critical workloads with subsecond latencies - Get guaranteed, “exactly once”
event processing with 99.9% availability and built-in recovery capabilities. Easily set up a continuous
integration and continuous delivery (CI-CD) pipeline and achieve subsecond latencies on your most
demanding workloads.
● Deploy in the cloud and on the edge - Bring real-time insights and analytics capabilities closer to
where your data originates. Enable new scenarios with true hybrid architectures for stream
processing and run the same query in the cloud or on the edge.
● Power real-time analytics with artificial intelligence - Take advantage of built-in machine learning
(ML) models to shorten time to insights. Use ML-based capabilities to perform anomaly detection
directly in your streaming jobs with Azure Stream Analytics.

Event Hubs
● A hyper-scale telemetry ingestion service
that collects, transforms and stores millions
of events.
● Event Hubs is a fully managed, real-time
data ingestion service that’s simple, trusted
and scalable.

Event Hubs
● Integrate seamlessly with other Azure
services to unlock valuable insights.
● Experience real-time data ingestion and
microbatching on the same stream.

Event Hubs
● Focus on drawing insights from your data instead of managing
infrastructure. Build real-time big data pipelines and respond to
business challenges right away.
● Build real-time data pipelines with just a couple of clicks. Seamlessly
integrate with Azure data services to uncover insights faster.

Event Hubs
● Ingest millions of events per second -
Continuously ingress data from hundreds of
thousands of sources with low latency and
configurable time retention.

Analysis Services
Focus on solving business problems, not
learning new skills, when you use the familiar,
integrated development environment of Visual
Studio. Easily deploy your existing SQL
Server 2016 tabular models to the cloud.

Data Lake Analytics
● Easily develop and run massively parallel
data transformation and processing
programs in U-SQL, R, Python and .NET
over petabytes of data. With no
infrastructure to manage, you can process
data on demand, scale instantly and only
pay per job.

Data Lake Analytics
● Process big data jobs in seconds with
Azure Data Lake Analytics. There is no
infrastructure to worry about because there
are no servers, virtual machines or clusters
to wait for, manage or tune.

Data Lake Analytics
● Instantly scale the processing power,
measured in Azure Data Lake Analytics
Units (AU), from one to thousands for each
job. You only pay for the processing that
you use per job.

Data Lake Analytics
● U-SQL is a simple, expressive and
extensible language that allows you to write
code once and have it automatically
parallelised for the scale you need.

Data Lake Analytics
● Process petabytes of data for diverse
workload categories such as querying, ETL,
analytics, machine learning, machine
translation, image processing and
sentiment analysis by leveraging existing
libraries written in .NET languages, R or
Python.

Three Ages of Databases
Data
WarehouseData
Warehouse RDBMS
RDBMS
RDBMS NoSQL
1985-1995
2010-Now1995-2010

Pressures on single node
RDBMS
Scalability
Single Node
RDBMS
OLAP/BI/Data
Warehouse
Social
Networks
Agile
Schema
Free

Key-Value Stores
Graph/Triple Stores
XML
Column-Family Stores
Object Stores
17

Simplicity?
YesDoes it
look like
document?
Start
No
Stop
Use the
RDBMS
Use Microsoft
Office

What is big data?
• When you have to innovate to collect,
store, organize, analyse and share it.
• - Werner Vogels, Amazon CTO

What is Big Data?
• Traditionally…..
– Physics Experiments
– Sensor data
– Satellite data

Now?
• Now: zettabytes in the cloud
expected by end of next year
• https://datarelish.net/2019/11/07/whats-
the-future-for-cloud-data-storage-clouds-of-
glass/

Azure Synapse
● Azure Synapse delivers insights from all your data,
across data warehouses and big data analytics
systems, with blazing speed.
● Data professionals can query both relational and non-
relational data at petabyte-scale using the familiar
SQL language
● Credit to the Microsoft team for help with these decks

Azure Synapse
● Azure Synapse is a limitless analytics service
that brings together enterprise data
warehousing and Big Data analytics.
● Query data on your terms, using either
serverless on-demand or provisioned
resources—at scale

Azure Synapse
• Discover
powerful insights
across your
most important
data
• Unified analytics
experience

Azure Synapse
• support for SSDT with Visual Studio 2019
• native platform integration with Azure
DevOps
• built-in continuous integration and
deployment (CI/CD) capabilities for
enterprise-level deployments

Azure Synapse
STORE
VISUALIZE
On-premises data
Cloud data
SaaS data

Best in class price
per performance
Developer
productivity
Intelligent workload
management
Data flexibility
Up to 94% less expensive
than competitors
Prioritize resources for
the most valuable
workloads
Ingest variety of data
sources to derive the
maximum benefit
Use preferred tooling for
SQL data warehouse
development
Industry-leading
security
Defense-in-depth
security and 99.9%
financially backed
availability SLA
Azure Synapse

DirectQuery Composite Models &
Aggregation Tables
The enterprise solution
Avoid data movement
Delegate query work to
the back-end source;
take advantage of Azure
SQL Data Warehouse’s
advanced features
Why choose? Import and
DirectQuery in a single
model
Keep summarized data
local; get detail data from
the source
Import
Great for small data
sources and personal
data discovery
Fine for CSV files,
spreadsheet data and
summarized OLTP data
Power BI

Best in class price
per performance
Developer
productivity
Intelligent workload
management
Data flexibility
Up to 94% less expensive
than competitors
Prioritize resources for
the most valuable
workloads
Ingest variety of data
sources to derive the
maximum benefit
Use preferred tooling for
SQL data warehouse
development
Industry-leading
security
Defense-in-depth
security and 99.9%
financially backed
availability SLA
Azure SQL Data Warehouse

Complete Data SecurityCategory Feature
Data Protection
Data in Transit
Data Encryption at Rest
Data Discovery and Classification
Access Control
Object Level Security (Tables/Views)
Row Level Security
Column Level Security
Dynamic Data Masking
SQL Login
Authentication Azure Active Directory
Multi-Factor Authentication
Virtual Networks
Network Security Firewall
Azure ExpressRoute
Thread Detection
Threat Protection Auditing
Vulnerability Assessment

Workload
Management
Scale-In Isolation
Predictablecost
Online elasticity
Efficient for unpredictableworkloads
No cacheeviction forscaling
Intra Cluster Workload Isolation
(Scale In)
Marketing
CREATE WORKLOAD GROUP Sales
WITH
(
[ MIN_PERCENTAGE_RESOURCE = 60 ]
[ CAP_PERCENTAGE_RESOURCE = 100 ]
[ MAX_CONCURRENCY = 6 ] )
40%
Compute
1000c DWU
60%
Sales
60%
100%
M I C R OSOFT C O NFIDE NTI AL

Heterogeneous
Data
Scale-In Isolation
Predictablecost
Online elasticity
Efficient for unpredictableworkloads
No cacheeviction forscaling

Performance Optimized Storage
Elastic Architecture Columnar Storage Columnar Ordering Table Partitioning
Nonclustered Indexes Hash Distribution Materialized Views Resultset Cache

Azure SQL Data Warehouse performance
advantage
Overview

Complete
Security
Data In Transit
Data encryption at rest (Service & User Managed Keys)
Data Discovery and Classification
Native Row Level Security
Table and View Security (GRANT / DENY)
Column Level Security
Dynamic Data Masking
SQL Authentication
Native Azure Active Directory
Integrated Security
Multi-Factor Authentication
Virtual Network (VNET)
SQL Firewall (server)
Integration with ExpressRoute
SQL Threat Detection
SQL Auditing
Vulnerability Assessment
Data Protection
CATEGORY FEATURE SQL DATA
WAREHOUSE
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes

https://azure.microsoft.com/en-gb/services/synapse-analytics/

What is Polybase?
Unstructured Data

What is Polybase?
SQL Server
Azure SQL Data Warehouse
Parallel Data Warehouse
Azure SQL Database

STRUCTURED
DATA
UNSTRUCTURED
DATA
BUSINESS DATA
Polybase

Making business data accessible
Provides a scalable, T-SQL-compatible query processing
framework for combining data from both universes

Polybase Purpose
Consumer Analyst Scientist
Data Volume Medium to Low Reasonable High -> Huge
Degree of Structure Very High Some Low ->None
Number of Users Very High Medium Low
Transformation Complexity Low Medium to High High
Analytics Complexity Low Medium Very High
Data

Agenda
• Why machine learning in SQL Server?
• How to leverage:
– SQL Compute context
– sp_execute_external_script features
– PREDICT T-SQL Function
• Call to action
• Questions

Why machine learning with SQL Server?
Reduce or eliminate
data movement with
in-database analytics
Operationalize
machine learning
models
Get enterprise scale,
performance, and
security

Machine Learning Services
• R/Python Integration Design
– Invokes runtime outside of SQL Server process
– Batch-oriented operations
• SQL Compute context

Any R/Python
IDE
Data Scientist
Workstation
Typical Machine Learning workflow against database
SQL Server
Pull Data1
train <-
sqlQuery(connection,
“select * from nyctaxi_sample”)
model <- glm(formula, train)
3
Model
Output
2 Execution

Any R/Python
IDE
Data Scientist
Workstation rx*
output
3
Machine Learning workflow using SQL compute context
Execution2
SQL Server 2017
SQL Server
R/Python Runtime
Machine Learning
Services
Script1
cc <- RxInSqlServer( connectionString,
computeContext)
rxLogit(formula, cc)
Model or
Prediction
s
4

SQL Compute Context from
R/Python client
• Requirement - Use rx* functions

Push data from SQL Server to
external runtime
sp_execute_external_
script
@input_data_1 = N’
SELECT * FROM
TrainingData’
InputDataset:
data.frame
OR
Pandas
dataframe

Read Files with R Server
• R can read almost all flat text files like
SPSS, CSV, TXT.
• Provide path direction and read file directory
from the given path into R.

RStudio & XDF File
• In order to convert our text files either CSV
or other text formats, into XDF format,
RStudio can easily handle this task.
• XDF file formats can only be read by R, and
they are very small in size as compared to
other files.

Convert File to XDF
• TXT or CSV files can be converted to XDF
format .
• XDF file formats can only be read by Rand
they are very small in size as compared to
other files.

rxCrossTabs
rxCrossTabs is used to create contingency
tables from cross- classifying factors using a
formula interface.
rxCrossTabs() is also used to compute sums
according to combinations of different variables

RxCube
• rxCube() performs a very similar function to
rxCrossTabs().
• It computes tabulated sums or means.
• rxCube() produces the sums or means in long
format rather than a table.
• This can be useful when we want to
aggregate data for further analysis within R.

dplyrXdf Package
The dplyr package is a popular toolkit for data
transformation and manipulation.
dplyr supports data frames, data tables (from
the data.table package)
The dplyrXdf package implements such a
backend for the xdf file format, a technology
supplied as part of Revolution R Enterprise.

Ggplot2
Ggplot2 allows you to create graphs that
represent both univariate and multivariate
numerical and categorical data in a
straightforward manner
This function can be used to create the most
common graph types.
It can create a very wide range of useful plots.

Custom visualizations with
rxSummary & rxCube
• RxSummary:
• The rxSummary function provides
descriptive statistics using
a formula argument.

rxSummary & rxCube
• RxCube
• Rxcube is similar to rxSummary but it
returns fewer statistical summaries and
therefore run faster.
• With y ~u : v as the formula, the rxCube
returns count and averages for column y

rxSummary & rxCube
• RxCube
• Code: rxc1 <- rxCube(trip_distance ~
picup_nb:dropoff_nb,
• mht_xdf)

rxHistogram
• rxHistogram() is used to create a histogram
for the Close variable.
• Syntax: Function(formula, data, …)
• Formula = A formula that contains the
variable which you want to visualize. The

rxlinePlot
• Line or scatter plot use data from an .xdf file
or data frame
• Syntax: rxLinePlot(formula, data, …. )
• Formula = For this function, this formula
should have one variable on the left side of
the ~ that reflects the Y-axis, and one

rxDataSteps
• The rxDataStep function can be used to
process data in chunks.
• rxDataStep can be used to create and
transform subsets of data.

rxDataStep
•
• rxDatastep can be used to
• Modify existing columns or add new
columns to the data

Subset Rows Of Data Using
Transform Argument
• A common use of rxDataStep is to create a
new data set with a subset of rows and
variables.
• For this purpose, we use the data frame of
our data as the input data set.

On-the-fly Transformation
• Analytical functions within
the RevoScaleR package use a formal
transformation function framework for
generating on the fly variables
• The RevoScaleR approach is to use the

In-data transformation
• There are two main approaches to in-data
transformation:
• Define an external based R function and
reference it.
• Define an embedded transformation as an
input to a transforms argument on another
function.

Generate a data frame
• A data frame is a table or a two-
dimensional array-like structure in which
each column contains values of one variable
and each row contains one set of values
from each column.

Generate a Data Frame
• Code: SalesData <- file.path("D:/773Demo",
"CustomerSalesInfo.xdf")
• SalesDataFrame <- rxImport(inData =
SalesData)

POSIXct & POSIXIt
• R provides several options for dealing
with date and date/time data. The POSIXct
and POSIXlt classes allow for dates and
times with control for time zones.

Transform functions
▪ It is a generic function which does useful things
with data frames.
▪ Embedded transformations provide instructions
within a formula, through arguments on a
function.
▪ Using just arguments, you can manipulate data
using transformations.

Summary
Improve performance of your ML scripts by using:
– SQL Compute context from client (rx* functions)
– Streaming to reduce memory usage
– Trivial parallelism for scoring (predict or rxPredict)
– Parallel training and scoring using rx* functions
– Native PREDICT function for low latency scoring

Call to action
• Resources
– SQL Server Samples on GitHub – R Services &
ML Services
– Getting started tutorials: AKA.MS/MLSQLDEV
– Configure instance: SSMS Reports for ML
Services
– ML cheat sheet
– Microsoft documentation: SQL Server Machine

1 Introduction to Microsoft data platform analytics for release

More Related Content

What's hot

What's hot (20)

Similar to 1 Introduction to Microsoft data platform analytics for release

Similar to 1 Introduction to Microsoft data platform analytics for release (20)

More from Jen Stirrup

More from Jen Stirrup (20)

Recently uploaded

Recently uploaded (20)

1 Introduction to Microsoft data platform analytics for release