Afternoons with Azure - Azure Machine Learning

Afternoons with Azure
Azure Machine Learning and Programming Languages

Our goal today is to provide an overview of Machine Learning and some of
the most common tools that can be used to apply it.
Machine Learning Introduction
Python: A General Purpose Language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up

Machine Learning has tons of useful applications you already encounter or
hear about every day.
Analyzing
Images
Understanding
Language
Forming &
Executing Strategy
Personalized
Recommendations
Autonomous
Decisions
Predicting
Asset Values

Machine Learning falls within the realm of predictive analytics.
Provide insights to existing data using:
• Raw data points
• Summaries of data
• Calculations across existing data fields
• KPIs
The data reported are historical or current facts.
Generate new data, including:
• Predicted future values
• Best guesses of missing values
• Suggested next steps
• Categorizations
The data generated are not facts and contain
some uncertainty, but can provide valuable
direction for moving the business forward.
Descriptive Analytics Predictive Analytics

Predictive modeling is historically very manual, but Machine Learning is
changing the way these models are built.
Technical advances in the past several years have enabled machines to conduct rapid trial and error with massive amounts of
data, resulting in better models that can be achieved quicker.
Machine Learning
• Machine learning also assigns weights to input data to
answer a problem
• Algorithms are defined to find solutions via rapid
iterations, where slight improvements are made with
each new piece of information
• Machines can consider thousands of variables to
design the optimal model
• Models can be tested in seconds or minutes
• Models can be designed to self-adapt in real time to
changes in the environment
• Advanced Machine Learning techniques can solve
complex problems like interpreting visual or audio
input using unstructured data
Traditional Modeling
• Traditional modeling involves selecting “important” input data and
assigning weights to build models
• Simplistic models are often solved mathematically, but
improvements require guesswork
• Human experts must select a reasonable amount of input data for
hypothesis testing when designing their models
• Humans may only be able to evaluate a handful of models over the
course of weeks or months
• Models are reviewed periodically (often annually or even less
frequently) for adjustments
• Traditional modeling techniques require structured data and are
too simplistic to emulate human intuition

Machine Learning is mostly thought of in two flavors.
Supervised Learning Unsupervised Learning
There aren’t necessarily “right answers,” we just want to
get a better understanding of our data.
We know the “right answers” for some of the scenarios.
– We may have history we can look back on
– We may be hoping to replicate human decision making

Supervised or Unsupervised?
Predict our profits next quarter. Supervised
Identify the number written on a check.
Group our customers into segments.
Supervised
Unsupervised
Predict a user’s rating for a given product. Supervised
Find the most important variables in a dataset. Unsupervised
Identify credit card transactions that are out of the ordinary. Unsupervised

Collect the data and randomly create initial decision rules.
Design a method for evaluating how good your hypothesis is, and test whether it applies generally.
Update your hypothesis in a way that marginally improves the performance of your decision rules.
Continue this process until the hypothesis either you are satisfied with the results, or your hypothesis
can’t improve anymore with the data available.
In Machine Learning, we expose large quantities of data to a computer so it
can find patterns iteratively.
Create a
hypothesis
Evaluate the
hypothesis
Adjust the
hypothesis
Repeat until
convergence
To illustrate the process, we’ll discuss a general case called Gradient Descent.

Another possible issue is “overfitting”, which is when the model doesn’t
generalize well with other data.
How can we know if a model is overfitted?
Hold out some of the data from training. Once the model has trained, evaluate the model
using the held out “test” data and see if it performs as well as it does for the “training” data.
How can we prevent overfitting?

There are two primary ways to prevent overfitting.
Collect more data Restrict the model
Collecting more data makes it
harder for the model to
perfectly fit the training data, so
it has to do a better job of
generalizing while it is training.
Penalizing the use of each
additional feature or manually
removing features from the
dataset prevents the model’s
ability to fit the data exactly.
Alternatively, some algorithms
are naturally less prone to
overfitting, so it may be best to
try a different algorithm.

Now that we have a better understanding of how machine learning
operates, we’ll dive into some of the most popular tools in the space.
Python: A general purpose language
Wrap Up

What is Python?
General Purpose Interpreted Programming Language
– Currently the 4th most used programming language
Used in a huge range of applications
– Powering Instagram & Reddit
– Testing Intel microchips
– Video game development
– Data Science
Open source with thousands of libraries for a plethora of
subject areas
Multiple versions (2.7 / 3.6) that are not fully backwards
compatible

Pros & Cons
Speed can be an issue
– Some implementations have been made to
run faster (PyPy)
Restrictions with white-space
Production errors from dynamic typing
Underdeveloped database access layers
Simplicity causes difficulty when transitioning
to other languages
Challenger in the data science arena
Easy to read / learn
Dynamic typing
Everything is an object
Small / good memory management
Open source
Parallelism / Concurrency
Existing Python backends

When to Use Python
Reading data
Data cleansing / preparation / munging
Creating algorithms
Production solutions
Internet of things applications
Big data solutions

Packages within Packages
Package 2Package 1Module 1
Module 2
Module 3
Module 4
Module 5

Package Managers
conda
Installed with Anaconda
distribution of Python
Uses packages included in
Anaconda distribution
Packages more specifically
for data science
Can also manage non-
python dependencies
Can also manage virtual
environments
VS.
pip
Installed with Python 3
Uses packages in PyPI
(Python Package Index)
More diverse packages
Can only manage python
dependencies
Needs separate tool to use
virtual environments

Useful
Packages/Modules
NumPy
– Numerical analysis
– Operations with
arrays
Pandas
– Built on top of
NumPy
– Works with data
frames
Matplotlib
– Visualizations

Useful
Packages/Modules
SciPy
– Linear algebra
– Optimization
Scikit-Learn
– Machine learning
TensorFlow
– Deep Learning

Next we’ll review R, which is a more specialized tool that is gaining
popularity.
Wrap Up

What is R?
R is a language and environment used mostly for
statistical computation and graphics.
Includes a large array of data manipulation and
data analysis functions.
R is open source meaning that any user can
write their own packages.
R (language)
RStudio (IDE)

Useful Packages
LOAD DATA
RMySQL
RODBC
xlsx
MANIPULATE DATA
dplyr
tidyr
stringr
lubridate
data.table
VISUALIZE DATA
ggplot2
rgl
MODEL DATA
stats
forecast
caret
randomForest
REPORT RESULTS
shiny
rmarkdown

Heavy code oriented (command line interface)
Steep learning curve
Poor memory management
Single threaded
One based indexing
Pros & Cons
Free
Open source
Highly customizable models and graphics
Use within AzureML
Online community
Popularity in education (Adoption)

When to Use R
Statistical computation
Data Profiling
Data wrangling/manipulation
Descriptive analytics
Predictive models

Microsoft leverages the Azure stack to make Machine Learning more
accessible.
Azure Machine Learning Studio: A graphical approach
Wrap Up

What is Azure Machine Learning Studio?
Streamlines the process to reduce the time between
design and deployment
Takes advantage of the accessibility and computing
power of the cloud
Empowers people to create and deploy predictive
analytics solutions

Machine Learning:
Statistics:
Work With Modules Not Code
Input / Output:
Transformations:

Things to Keep in
Mind w/ SQLite
• Using dynamic typing
of variables = implicit
type conversions
• LEFT OUTER JOIN
work but RIGHT
OUTER JOIN and FULL
OUTER JOIN are not
supported
• Partial ALTER TABLE
support
• Read-only Views

When to Use R/Python Scripts in Azure ML
Developing a new Model
Visualizations
Reading Data From Other Sources
Custom Math Operations
Splitting Columns
Advanced Time Series Algorithms

Inputs & Outputs for R/Python Scripts
Inputs:
Input cannot be direct from CSV, must be a dataset
Can have up to 2 input datasets + 1 zip file
Outputs:
Output must be a data.frame
Can have only 1 output dataset

Default Python Imports:
math
numbers
datetime
re
pandas as pd
numpy as np
scipy as sp
R/Python Packages in Azure ML
Distributions in Azure ML:
Python : Anaconda 4.0
R : Microsoft R Open
Package Requirements:
Cannot have a Java dependency
Cannot require internet/networking

Data Types
Python R Azure ML
Integer Integer Integer
Float Double Double
Complex Complex Complex*
Boolean Logical Boolean
String Character String
TimeDelta DiffTime TimeSpan
- Factor Categorical
Data.Frame Data.Frame Dataset
DateTime POSIXct Vector DateTime
List List -
*Azure recognizes Complex numbers however certain modules do
not

Export:
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
Connectivity
Import:
Web URL
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
SQL Server on Azure VM
On-Premise SQL Server Database
– Data Management Gateway
Data Feed Provider (OData)
Azure Cosmos DB

Standard
$9.99 per set per month
$1 per studio experimentation hour
Azure subscription required
No cap on module numbers
7 days per experiment w/24 hours per module
Can access on-premise SQL(preview)
Storage space is unlimited (BYO)
Multiple nodes
Includes Pproduction web API
Includes SLA
AML Studio Tiers and Pricing
Free
Azure subscription not required
100 module max per experiment
1 hour runtime max per experiment
Cannot access on-premise SQL server
10GB storage space
Single node performance
Does not include production web API
Does not include SLA

Consuming Via RESTful API
Request/Response or Batch Request
Connects to any programming language
that supports HTTP request response
Excel Online/Excel add-in
Web App Templates

Standard S2
Web Tiers and Pricing
Standard S1DEV
$0
1000 transaction/month
2 computer hours/month
2 web service
Cannot overage
Standard S3
$100.13
100,000 transaction/month
25 computer hours/month
10 web service
$0.50 per 1,000
transactions
$2 per API compute hr
$1,000.06
2,000,000 trans/month
500 computer hr/month
100 web service
$0.25 per 1,000
transactions
$1.50 per API compute hr
$9,999.98
50,000,000 trans/month
12,500 computer hr/month
500 web service
$0.10 per 1,000
transactions
$1 per API compute hr

Closed Source
Can have performance/speed issues
Fewer built-in algorithms
Limited customization(OOTB/without coding)
Poor model visibility
Pros & Cons
Free base level with many paid tiers
“Drag and drop” graphical user interface
Runs off the cloud
Supports custom R and Python code
Easy to create web service
Easy to retrain

To review, we’ll compare the tools we’ve discussed and when each is
appropriate for a machine learning problem.
Wrap Up

Pros/Cons
Python R Azure ML
Free ● ● ◐
Open Source ● ● ○
Ability to customize ● ◐ ○
Ease of use ◐ ◐ ●
Online community (documentation) ● ● ◐
Popularity ● ● ◐
Graphical user interface ○ ◐ ●
Ability to create graphics ◐ ● ○
Ease to deploy model ◐ ○ ●
Ability to connect to database ◐ ◐ ●
Memory Management ● ◐ ●

When to Use
Python R Azure ML
Consuming Big Data  
Basic Descriptive Analysis 
Data Profiling 
Data Visualization  
Data Wrangling  
Production  
Easy Web Service 
Model Development   

THANK YOU
www.ccganalytics.com | (813) 968-3238

Afternoons with Azure - Azure Machine Learning

Related slideshows

More Related Content

Afternoons with Azure - Azure Machine Learning

Editor's Notes