Journey through programming languages such as R, and Python that can be used for Machine Learning. Next, explore Azure Machine Learning Studio see the interconnectivity.
For more information about Microsoft Azure, call (813) 265-3239 or visit www.ccganalytics.com/solutions
2. Our goal today is to provide an overview of Machine Learning and some of
the most common tools that can be used to apply it.
Machine Learning Introduction
Python: A General Purpose Language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
3. Machine Learning has tons of useful applications you already encounter or
hear about every day.
Analyzing
Images
Understanding
Language
Forming &
Executing Strategy
Personalized
Recommendations
Autonomous
Decisions
Predicting
Asset Values
4. Machine Learning falls within the realm of predictive analytics.
Provide insights to existing data using:
• Raw data points
• Summaries of data
• Calculations across existing data fields
• KPIs
The data reported are historical or current facts.
Generate new data, including:
• Predicted future values
• Best guesses of missing values
• Suggested next steps
• Categorizations
The data generated are not facts and contain
some uncertainty, but can provide valuable
direction for moving the business forward.
Descriptive Analytics Predictive Analytics
5. Predictive modeling is historically very manual, but Machine Learning is
changing the way these models are built.
Technical advances in the past several years have enabled machines to conduct rapid trial and error with massive amounts of
data, resulting in better models that can be achieved quicker.
Machine Learning
• Machine learning also assigns weights to input data to
answer a problem
• Algorithms are defined to find solutions via rapid
iterations, where slight improvements are made with
each new piece of information
• Machines can consider thousands of variables to
design the optimal model
• Models can be tested in seconds or minutes
• Models can be designed to self-adapt in real time to
changes in the environment
• Advanced Machine Learning techniques can solve
complex problems like interpreting visual or audio
input using unstructured data
Traditional Modeling
• Traditional modeling involves selecting “important” input data and
assigning weights to build models
• Simplistic models are often solved mathematically, but
improvements require guesswork
• Human experts must select a reasonable amount of input data for
hypothesis testing when designing their models
• Humans may only be able to evaluate a handful of models over the
course of weeks or months
• Models are reviewed periodically (often annually or even less
frequently) for adjustments
• Traditional modeling techniques require structured data and are
too simplistic to emulate human intuition
6. Machine Learning is mostly thought of in two flavors.
Supervised Learning Unsupervised Learning
There aren’t necessarily “right answers,” we just want to
get a better understanding of our data.
We know the “right answers” for some of the scenarios.
– We may have history we can look back on
– We may be hoping to replicate human decision making
7. Supervised or Unsupervised?
Predict our profits next quarter. Supervised
Identify the number written on a check.
Group our customers into segments.
Supervised
Unsupervised
Predict a user’s rating for a given product. Supervised
Find the most important variables in a dataset. Unsupervised
Identify credit card transactions that are out of the ordinary. Unsupervised
8. Collect the data and randomly create initial decision rules.
Design a method for evaluating how good your hypothesis is, and test whether it applies generally.
Update your hypothesis in a way that marginally improves the performance of your decision rules.
Continue this process until the hypothesis either you are satisfied with the results, or your hypothesis
can’t improve anymore with the data available.
In Machine Learning, we expose large quantities of data to a computer so it
can find patterns iteratively.
Create a
hypothesis
Evaluate the
hypothesis
Adjust the
hypothesis
Repeat until
convergence
To illustrate the process, we’ll discuss a general case called Gradient Descent.
9. Another possible issue is “overfitting”, which is when the model doesn’t
generalize well with other data.
How can we know if a model is overfitted?
Hold out some of the data from training. Once the model has trained, evaluate the model
using the held out “test” data and see if it performs as well as it does for the “training” data.
How can we prevent overfitting?
10. There are two primary ways to prevent overfitting.
Collect more data Restrict the model
Collecting more data makes it
harder for the model to
perfectly fit the training data, so
it has to do a better job of
generalizing while it is training.
Penalizing the use of each
additional feature or manually
removing features from the
dataset prevents the model’s
ability to fit the data exactly.
Alternatively, some algorithms
are naturally less prone to
overfitting, so it may be best to
try a different algorithm.
11. Now that we have a better understanding of how machine learning
operates, we’ll dive into some of the most popular tools in the space.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
12. What is Python?
General Purpose Interpreted Programming Language
– Currently the 4th most used programming language
Used in a huge range of applications
– Powering Instagram & Reddit
– Testing Intel microchips
– Video game development
– Data Science
Open source with thousands of libraries for a plethora of
subject areas
Multiple versions (2.7 / 3.6) that are not fully backwards
compatible
13. Pros & Cons
Speed can be an issue
– Some implementations have been made to
run faster (PyPy)
Restrictions with white-space
Production errors from dynamic typing
Underdeveloped database access layers
Simplicity causes difficulty when transitioning
to other languages
Challenger in the data science arena
Easy to read / learn
Dynamic typing
Everything is an object
Small / good memory management
Open source
Parallelism / Concurrency
Existing Python backends
14. When to Use Python
Reading data
Data cleansing / preparation / munging
Creating algorithms
Production solutions
Internet of things applications
Big data solutions
16. Package Managers
conda
Installed with Anaconda
distribution of Python
Uses packages included in
Anaconda distribution
Packages more specifically
for data science
Can also manage non-
python dependencies
Can also manage virtual
environments
VS.
pip
Installed with Python 3
Uses packages in PyPI
(Python Package Index)
More diverse packages
Can only manage python
dependencies
Needs separate tool to use
virtual environments
19. Next we’ll review R, which is a more specialized tool that is gaining
popularity.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
20. What is R?
R is a language and environment used mostly for
statistical computation and graphics.
Includes a large array of data manipulation and
data analysis functions.
R is open source meaning that any user can
write their own packages.
R (language)
RStudio (IDE)
22. Heavy code oriented (command line interface)
Steep learning curve
Poor memory management
Single threaded
One based indexing
Pros & Cons
Free
Open source
Highly customizable models and graphics
Use within AzureML
Online community
Popularity in education (Adoption)
23. When to Use R
Statistical computation
Data Profiling
Data wrangling/manipulation
Descriptive analytics
Predictive models
24. Microsoft leverages the Azure stack to make Machine Learning more
accessible.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning Studio: A graphical approach
Wrap Up
25. What is Azure Machine Learning Studio?
Streamlines the process to reduce the time between
design and deployment
Takes advantage of the accessibility and computing
power of the cloud
Empowers people to create and deploy predictive
analytics solutions
29. Things to Keep in
Mind w/ SQLite
• Using dynamic typing
of variables = implicit
type conversions
• LEFT OUTER JOIN
work but RIGHT
OUTER JOIN and FULL
OUTER JOIN are not
supported
• Partial ALTER TABLE
support
• Read-only Views
30. When to Use R/Python Scripts in Azure ML
Developing a new Model
Visualizations
Reading Data From Other Sources
Custom Math Operations
Splitting Columns
Advanced Time Series Algorithms
31. Inputs & Outputs for R/Python Scripts
Inputs:
Input cannot be direct from CSV, must be a dataset
Can have up to 2 input datasets + 1 zip file
Outputs:
Output must be a data.frame
Can have only 1 output dataset
32. Default Python Imports:
math
numbers
datetime
re
pandas as pd
numpy as np
scipy as sp
R/Python Packages in Azure ML
Distributions in Azure ML:
Python : Anaconda 4.0
R : Microsoft R Open
Package Requirements:
Cannot have a Java dependency
Cannot require internet/networking
33. Data Types
Python R Azure ML
Integer Integer Integer
Float Double Double
Complex Complex Complex*
Boolean Logical Boolean
String Character String
TimeDelta DiffTime TimeSpan
- Factor Categorical
Data.Frame Data.Frame Dataset
DateTime POSIXct Vector DateTime
List List -
*Azure recognizes Complex numbers however certain modules do
not
34. Export:
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
Connectivity
Import:
Web URL
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
SQL Server on Azure VM
On-Premise SQL Server Database
– Data Management Gateway
Data Feed Provider (OData)
Azure Cosmos DB
35. Standard
$9.99 per set per month
$1 per studio experimentation hour
Azure subscription required
No cap on module numbers
7 days per experiment w/24 hours per module
Can access on-premise SQL(preview)
Storage space is unlimited (BYO)
Multiple nodes
Includes Pproduction web API
Includes SLA
AML Studio Tiers and Pricing
Free
Azure subscription not required
100 module max per experiment
1 hour runtime max per experiment
Cannot access on-premise SQL server
10GB storage space
Single node performance
Does not include production web API
Does not include SLA
36. Consuming Via RESTful API
Request/Response or Batch Request
Connects to any programming language
that supports HTTP request response
Excel Online/Excel add-in
Web App Templates
37. Standard S2
Web Tiers and Pricing
Standard S1DEV
$0
1000 transaction/month
2 computer hours/month
2 web service
Cannot overage
Standard S3
$100.13
100,000 transaction/month
25 computer hours/month
10 web service
$0.50 per 1,000
transactions
$2 per API compute hr
$1,000.06
2,000,000 trans/month
500 computer hr/month
100 web service
$0.25 per 1,000
transactions
$1.50 per API compute hr
$9,999.98
50,000,000 trans/month
12,500 computer hr/month
500 web service
$0.10 per 1,000
transactions
$1 per API compute hr
38. Closed Source
Can have performance/speed issues
Fewer built-in algorithms
Limited customization(OOTB/without coding)
Poor model visibility
Pros & Cons
Free base level with many paid tiers
“Drag and drop” graphical user interface
Runs off the cloud
Supports custom R and Python code
Easy to create web service
Easy to retrain
39. To review, we’ll compare the tools we’ve discussed and when each is
appropriate for a machine learning problem.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
40. Pros/Cons
Python R Azure ML
Free ● ● ◐
Open Source ● ● ○
Ability to customize ● ◐ ○
Ease of use ◐ ◐ ●
Online community (documentation) ● ● ◐
Popularity ● ● ◐
Graphical user interface ○ ◐ ●
Ability to create graphics ◐ ● ○
Ease to deploy model ◐ ○ ●
Ability to connect to database ◐ ◐ ●
Memory Management ● ◐ ●
41. When to Use
Python R Azure ML
Consuming Big Data
Basic Descriptive Analysis
Data Profiling
Data Visualization
Data Wrangling
Production
Easy Web Service
Model Development
Developed by Dutch Programmer, Guido van Rossum, and released in 1991
During which time he was in a big Monty Python phase, so he named it Python
General Purpose: can be used for almost anything
Interpreted: is not compiled beforehand, but interpreted as it goes
Open source: source code is available to the public for free, new packages everyday from around the world
Multiple Versions: not fully backwards compatible, some syntax differences. Not a drastically new language, but Python 2 code may not run on Python 3 environment
Easy to read/learn: focused on simplicity, syntax tries to mimic natural language, python motto
Dynamic typing: don’t need to specify variable types, defined based on the data assigned to it
Everything is an object: great for object oriented programming
Small/good memory management: doesn’t take up much space or use my memory
Open source: tons of custom packages and tons of resources online
Parallelism/concurrency: able to perform more than one task at a time
Existing Python backends: if being implemented into an existing system, many environments already have Python backends
Speed: it’s an interpreted language, so it can be a tad slow, faster implementations (open source) such as PyPy
White-space: the fact that python doesn’t use parenthesis as much as indentation can limit customization
Dynamic typing errors: variables can assume the wrong type and cause errors when pushed to production
Underdeveloped DB access: there are packages for it, but still a lot of room for improvement
Switching between languages: Python’s simplicity can make it more difficult to understand other languages
Challenger in data science: still catching up with popularity and data science functionalities
- Reading data: for when CSVs are too large to be handled effectively in Excel, or in other forms such as audio or pictures
- Data cleansing/preparation/munging: handling null values, transforming and assigning logic around features, to make data more usable
- Creating algorithms: has many algorithms available through packages, but makes it very easy to create your own
- Production solutions: great for models being introduced into existing systems, especially on-prem
- Internet of things applications: tons of new “smart” devices these days, and python has become a popular tool for handling their operation
- Big data solutions: some may have heard of apache spark, a big data cluster platform for processing huge volumes and streams of data, has the built in ability to run pySpark, their implementation of Python
Module: a collection of functions and definitions stored in a python script file
Package: a group of modules and/or other packages
Packages are then imported into python scripts allowing them the use functions defined in the package
Package manager: used for easy installation as well as managing dependencies
Pip is the default package manager that is installed with Python 3
Conda is another package manager that is installed with the Anaconda distribution of python
Pip installs packages from the Python Package Index, the official and largest repository
Conda installs packages included with the Anaconda distribution, with packages focusing on data science
Pip can only manage python dependencies
Conda can also handle dependencies from other languages such as java or R
Virtual environment: sometimes required to separate certain packages based on conflicting dependencies or to minimize memory usage
With pip, you need an additional tool to use virtual environments
Conda has this capability built in
NumPy: defines the numpy array object, develops arithmetic around these arrays, as well as other numerical analysis methods
Pandas: most popular package for data science, built on top of numpy, defines the pandas data frame object and operations for manipulating the data within it
Matplotlib: the most popular package for visualizations, huge variety of capabilities while being relatively easy to use
SciPy: many functions around linear algebra and optimization, which are the cornerstone of almost all machine learning, would be good to use if creating custom algorithms
Scikit-Learn: almost any machine learning algorithm you could need can be found in this package, as well as many functions for customizing and evaluating your model or splitting your data in train and test data
TensorFlow: a package that is getting very popular these days, developed by Google, implements a deep learning algorithm, which is a specific and complex time of machine learning, it’s popular for use with image and video analytics as well as many other things
Rather than focus heavily on coding, AML lets you build models by dragging and dropping modules, we'll dive through how to build a simple regression model in our workshop
Since I introducted you to the select columns module, I wanted to show you what else is available. Not a comprehensive list. AML has mods for importing/exporting, mods for different models, transforms. One of the best ways to learn AML is to read through the modules and see whats available in the tool kit. A few stand out modules I want to bring to your attention
As you start working with AML some things may come across as a bit quirky. A huge thing is you have to work 1 step at a time. If you have a table of orders with a name, date, qty and all the fields are string. You will need a module to turn date into a datetime, and another one for qty into an integer, you can't do that in one step like you could with say a SQL select statement.
While ML is a Microsoft tool, it does not use t-sql, the engine runs sqlite. Certain functions will not work and may not throw back any errors. Example is string concatenation with "+".
Will throw back a 0
While AML is a Microsoft tool, the SQL is not t-sql, it is sqlite. Certain functions will not work and may not throw back any errors. Example is string concatenation with "+".
Uses dynamic typing of variables. Implicit type conversion is allowed
Left outer join is valid. Right outer and full outer joins are not.
You can rename/add column with alter table but not drop column/alter column/add constraint
Views are read only
Another big module are the execute code scripts.
When working with different script its important to kepe in mind the data types used especially when you are trying to bring in data types from other data sources
Now moving into connectivity. Great connectivity with other services in the Azure stack. You can import and export to/from the blob storage, table storage, and azure sql databases.
Hadoop works great as well. If you work on-prem, it is currently a preview feature, meaning its not 100 fully supported, meaning its a prerelease/beta.
A couple cool things is the web URL, you can pull data from http websites and run machine learning on that that information
Pros
Free to try, and there are multiple levels designed to prevent you from over paying/pay for what you need
Easy to use and get started with
Leverages the cloud, work from anywhere
Web services are easy to deploy and manage
Easy to retrain
Cons
Generally is a little slower than traditional model training
Next 2 go hand in hand
Microsoft develops the premade modules, so there's a limited selection unless you can code.
Closed source – you know generally what modules do, but they are black box
Poor model visibility – unable to see training progress for models, 7 days max on model with not progress bar.