Sergii Baidachnyi ITEM 2018

Azure Machine Learning for
Data Scientists
Sergii Baidachnyi
Principal Software Engineer
Microsoft
sbaydach@microsoft.com
@sbaidachni

Offering
Platform for emerging data scientists to graphically build and deploy experiments
Key Value Props
• Rapid experiment composition
• > 100 easily configured modules for data prep, training, evaluation
• Extensibility through R & Python
• Serverless training and deployment
Numbers
• 100’s of thousands of deployed models serving billions of requests
Azure Machine
Learning Studio

Infrastructure Can Get in Your Way
Clusters
• Provision GPUs
• Install drivers
and software
• Interactive use
Scheduling
• Queue work
• Prioritize jobs
• Start MPI
• Monitor
• Handle failures
Data
• Scale access to
training data
• Output logs &
models
• Secure &
compliant
Cost
• Scale up and
down
• Share reserved
instances
• Low priority
Workflow
• Choose
efficient
hardware
• Tooling
integration
• Laptop to cloud

• Managed Service
• Supports Role Based Access Control
• Run any toolkit (CNTK, Tensorflow,
Caffee/Caffee2, Chainer, Keras, …)
• Run experiments in Parallel
• Run in Containers or directly on VM
• Support various Shared File Systems
• Load based automatic scaling
• Only Storage and compute cost. Service is free
Azure Batch
AI Service

Azure DataBricks
Databricks Spark as a managed service on Azure

CONTROL EASE OF USE
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
IaaS and PaaS Big Data Analytics

Azure Databricks
Microsoft Azure

Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks

Azure Databricks Cluster Architecture
Azure DB
for
PostgreSQL
Webapp
Azure Compute
Cluster
Manager
Databricks’ Azure Account User’s Azure Account
Azure Compute
Spark
Driver
Azure Compute
Spark
Worker
Azure Compute
Spark
Worker
Jobs
FileSystem
Service
Spark
History
Server
Log
Daemon
Log
Daemon

Azure Databricks Core Artifacts
Azure
Databricks

Azure Machine Learning
Experimentation and
Management

Apps + insights
Social
LOB
Graph
IoT
Image
CRM INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Data lake
and storage
Hadoop/Spark/SQL
and ML
.
IoT
Azure Machine Learning
The AI Development lifecycle

Local machine
Scale up to DSVM
Scale out with Spark on HDInsight
Azure Batch AI (Coming Soon)
ML Server (Coming Soon)
Experiment Anywhere
A ZURE ML
EXPERIMENTATION
Command line tools
IDEs
Notebooks in Workbench
VS Code Tools for AI

DOCKER
Single node deployment
(cloud/on-prem)
Azure Container Service
Azure IoT Edge
Microsoft ML Server
Spark clusters
SQL Server (Coming Soon)
Deploy Everywhere
A ZURE ML
MODEL MANAGEMENT

R Server Overview
• Enhances upon open source R to scale to big data
• Embraces combined open source and commercial innovations
• Allows customers to get the support they trust
• Microsoft innovations:
• RevoScaleR
• Parallelized, distributed algorithms
• Microsoft Machine learning
• Fast and Deep learning
• Pretrained models
• Custom parallel frameworks

ML Services Version 9.2 at a glance
Platforms & Data
Tools
Languages
Algorithms
Data Sources
Rattle Mrsdeploy
RESTful API
deployment
Real-Time
Scoring
Visualization
Tool
Integration
.csv Microsoft .XDF
In-database
deployment
Operationalization
Distributed Parallelized Algorithms:
•RevoScaleR and RevoScalePy libraries
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
ODBC/JDBC

Data Science lifecycle
•Primary stages:
Lifecycle

TDSP objective
Integrate DevOps with data science workflows to improve collaboration,
quality, robustness and efficiency in data science projects
o Infrastructure as Code (IaC)
o Building
o Testing
o CI / CD
o …
o App performance monitoring

TDSP documentation: https://aka.ms/tdsp

Using TDSP within Azure Machine Learning

Questions?
sbaydach@microsoft.com
@sbaidachni

Sergii Baidachnyi ITEM 2018

More Related Content

Sergii Baidachnyi ITEM 2018