Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Afternoons with Azure
Azure Machine Learning and Programming Languages
Our goal today is to provide an overview of Machine Learning and some of
the most common tools that can be used to apply it.
Machine Learning Introduction
Python: A General Purpose Language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
Machine Learning has tons of useful applications you already encounter or
hear about every day.
Analyzing
Images
Understanding
Language
Forming &
Executing Strategy
Personalized
Recommendations
Autonomous
Decisions
Predicting
Asset Values
Machine Learning falls within the realm of predictive analytics.
Provide insights to existing data using:
• Raw data points
• Summaries of data
• Calculations across existing data fields
• KPIs
The data reported are historical or current facts.
Generate new data, including:
• Predicted future values
• Best guesses of missing values
• Suggested next steps
• Categorizations
The data generated are not facts and contain
some uncertainty, but can provide valuable
direction for moving the business forward.
Descriptive Analytics Predictive Analytics
Predictive modeling is historically very manual, but Machine Learning is
changing the way these models are built.
Technical advances in the past several years have enabled machines to conduct rapid trial and error with massive amounts of
data, resulting in better models that can be achieved quicker.
Machine Learning
• Machine learning also assigns weights to input data to
answer a problem
• Algorithms are defined to find solutions via rapid
iterations, where slight improvements are made with
each new piece of information
• Machines can consider thousands of variables to
design the optimal model
• Models can be tested in seconds or minutes
• Models can be designed to self-adapt in real time to
changes in the environment
• Advanced Machine Learning techniques can solve
complex problems like interpreting visual or audio
input using unstructured data
Traditional Modeling
• Traditional modeling involves selecting “important” input data and
assigning weights to build models
• Simplistic models are often solved mathematically, but
improvements require guesswork
• Human experts must select a reasonable amount of input data for
hypothesis testing when designing their models
• Humans may only be able to evaluate a handful of models over the
course of weeks or months
• Models are reviewed periodically (often annually or even less
frequently) for adjustments
• Traditional modeling techniques require structured data and are
too simplistic to emulate human intuition
Machine Learning is mostly thought of in two flavors.
Supervised Learning Unsupervised Learning
There aren’t necessarily “right answers,” we just want to
get a better understanding of our data.
We know the “right answers” for some of the scenarios.
– We may have history we can look back on
– We may be hoping to replicate human decision making
Supervised or Unsupervised?
Predict our profits next quarter. Supervised
Identify the number written on a check.
Group our customers into segments.
Supervised
Unsupervised
Predict a user’s rating for a given product. Supervised
Find the most important variables in a dataset. Unsupervised
Identify credit card transactions that are out of the ordinary. Unsupervised
Collect the data and randomly create initial decision rules.
Design a method for evaluating how good your hypothesis is, and test whether it applies generally.
Update your hypothesis in a way that marginally improves the performance of your decision rules.
Continue this process until the hypothesis either you are satisfied with the results, or your hypothesis
can’t improve anymore with the data available.
In Machine Learning, we expose large quantities of data to a computer so it
can find patterns iteratively.
Create a
hypothesis
Evaluate the
hypothesis
Adjust the
hypothesis
Repeat until
convergence
To illustrate the process, we’ll discuss a general case called Gradient Descent.
Another possible issue is “overfitting”, which is when the model doesn’t
generalize well with other data.
How can we know if a model is overfitted?
Hold out some of the data from training. Once the model has trained, evaluate the model
using the held out “test” data and see if it performs as well as it does for the “training” data.
How can we prevent overfitting?
There are two primary ways to prevent overfitting.
Collect more data Restrict the model
Collecting more data makes it
harder for the model to
perfectly fit the training data, so
it has to do a better job of
generalizing while it is training.
Penalizing the use of each
additional feature or manually
removing features from the
dataset prevents the model’s
ability to fit the data exactly.
Alternatively, some algorithms
are naturally less prone to
overfitting, so it may be best to
try a different algorithm.
Now that we have a better understanding of how machine learning
operates, we’ll dive into some of the most popular tools in the space.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
What is Python?
General Purpose Interpreted Programming Language
– Currently the 4th most used programming language
Used in a huge range of applications
– Powering Instagram & Reddit
– Testing Intel microchips
– Video game development
– Data Science
Open source with thousands of libraries for a plethora of
subject areas
Multiple versions (2.7 / 3.6) that are not fully backwards
compatible
Pros & Cons
Speed can be an issue
– Some implementations have been made to
run faster (PyPy)
Restrictions with white-space
Production errors from dynamic typing
Underdeveloped database access layers
Simplicity causes difficulty when transitioning
to other languages
Challenger in the data science arena
Easy to read / learn
Dynamic typing
Everything is an object
Small / good memory management
Open source
Parallelism / Concurrency
Existing Python backends
When to Use Python
Reading data
Data cleansing / preparation / munging
Creating algorithms
Production solutions
Internet of things applications
Big data solutions
Packages within Packages
Package 2Package 1Module 1
Module 2
Module 3
Module 4
Module 5
Package Managers
conda
Installed with Anaconda
distribution of Python
Uses packages included in
Anaconda distribution
Packages more specifically
for data science
Can also manage non-
python dependencies
Can also manage virtual
environments
VS.
pip
Installed with Python 3
Uses packages in PyPI
(Python Package Index)
More diverse packages
Can only manage python
dependencies
Needs separate tool to use
virtual environments
Useful
Packages/Modules
NumPy
– Numerical analysis
– Operations with
arrays
Pandas
– Built on top of
NumPy
– Works with data
frames
Matplotlib
– Visualizations
Useful
Packages/Modules
SciPy
– Linear algebra
– Optimization
Scikit-Learn
– Machine learning
TensorFlow
– Deep Learning
Next we’ll review R, which is a more specialized tool that is gaining
popularity.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
What is R?
R is a language and environment used mostly for
statistical computation and graphics.
Includes a large array of data manipulation and
data analysis functions.
R is open source meaning that any user can
write their own packages.
R (language)
RStudio (IDE)
Useful Packages
LOAD DATA
RMySQL
RODBC
xlsx
MANIPULATE DATA
dplyr
tidyr
stringr
lubridate
data.table
VISUALIZE DATA
ggplot2
rgl
MODEL DATA
stats
forecast
caret
randomForest
REPORT RESULTS
shiny
rmarkdown
Heavy code oriented (command line interface)
Steep learning curve
Poor memory management
Single threaded
One based indexing
Pros & Cons
Free
Open source
Highly customizable models and graphics
Use within AzureML
Online community
Popularity in education (Adoption)
When to Use R
Statistical computation
Data Profiling
Data wrangling/manipulation
Descriptive analytics
Predictive models
Microsoft leverages the Azure stack to make Machine Learning more
accessible.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning Studio: A graphical approach
Wrap Up
What is Azure Machine Learning Studio?
Streamlines the process to reduce the time between
design and deployment
Takes advantage of the accessibility and computing
power of the cloud
Empowers people to create and deploy predictive
analytics solutions
Visually Build Your Model
VS
Machine Learning:
Statistics:
Work With Modules Not Code
Input / Output:
Transformations:
Compatible with SQLite
Things to Keep in
Mind w/ SQLite
• Using dynamic typing
of variables = implicit
type conversions
• LEFT OUTER JOIN
work but RIGHT
OUTER JOIN and FULL
OUTER JOIN are not
supported
• Partial ALTER TABLE
support
• Read-only Views
When to Use R/Python Scripts in Azure ML
Developing a new Model
Visualizations
Reading Data From Other Sources
Custom Math Operations
Splitting Columns
Advanced Time Series Algorithms
Inputs & Outputs for R/Python Scripts
Inputs:
Input cannot be direct from CSV, must be a dataset
Can have up to 2 input datasets + 1 zip file
Outputs:
Output must be a data.frame
Can have only 1 output dataset
Default Python Imports:
math
numbers
datetime
re
pandas as pd
numpy as np
scipy as sp
R/Python Packages in Azure ML
Distributions in Azure ML:
Python : Anaconda 4.0
R : Microsoft R Open
Package Requirements:
Cannot have a Java dependency
Cannot require internet/networking
Data Types
Python R Azure ML
Integer Integer Integer
Float Double Double
Complex Complex Complex*
Boolean Logical Boolean
String Character String
TimeDelta DiffTime TimeSpan
- Factor Categorical
Data.Frame Data.Frame Dataset
DateTime POSIXct Vector DateTime
List List -
*Azure recognizes Complex numbers however certain modules do
not
Export:
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
Connectivity
Import:
Web URL
Hadoop (HiveQL)
Azure Blob Storage
Azure Table
Azure SQL Database/Data Warehouse
SQL Server on Azure VM
On-Premise SQL Server Database
– Data Management Gateway
Data Feed Provider (OData)
Azure Cosmos DB
Standard
$9.99 per set per month
$1 per studio experimentation hour
Azure subscription required
No cap on module numbers
7 days per experiment w/24 hours per module
Can access on-premise SQL(preview)
Storage space is unlimited (BYO)
Multiple nodes
Includes Pproduction web API
Includes SLA
AML Studio Tiers and Pricing
Free
Azure subscription not required
100 module max per experiment
1 hour runtime max per experiment
Cannot access on-premise SQL server
10GB storage space
Single node performance
Does not include production web API
Does not include SLA
Consuming Via RESTful API
Request/Response or Batch Request
Connects to any programming language
that supports HTTP request response
Excel Online/Excel add-in
Web App Templates
Standard S2
Web Tiers and Pricing
Standard S1DEV
$0
1000 transaction/month
2 computer hours/month
2 web service
Cannot overage
Standard S3
$100.13
100,000 transaction/month
25 computer hours/month
10 web service
$0.50 per 1,000
transactions
$2 per API compute hr
$1,000.06
2,000,000 trans/month
500 computer hr/month
100 web service
$0.25 per 1,000
transactions
$1.50 per API compute hr
$9,999.98
50,000,000 trans/month
12,500 computer hr/month
500 web service
$0.10 per 1,000
transactions
$1 per API compute hr
Closed Source
Can have performance/speed issues
Fewer built-in algorithms
Limited customization(OOTB/without coding)
Poor model visibility
Pros & Cons
Free base level with many paid tiers
“Drag and drop” graphical user interface
Runs off the cloud
Supports custom R and Python code
Easy to create web service
Easy to retrain
To review, we’ll compare the tools we’ve discussed and when each is
appropriate for a machine learning problem.
Machine Learning Introduction
Python: A general purpose language
R: A statistical analysis language
Azure Machine Learning: A graphical approach
Wrap Up
Pros/Cons
Python R Azure ML
Free ● ● ◐
Open Source ● ● ○
Ability to customize ● ◐ ○
Ease of use ◐ ◐ ●
Online community (documentation) ● ● ◐
Popularity ● ● ◐
Graphical user interface ○ ◐ ●
Ability to create graphics ◐ ● ○
Ease to deploy model ◐ ○ ●
Ability to connect to database ◐ ◐ ●
Memory Management ● ◐ ●
When to Use
Python R Azure ML
Consuming Big Data  
Basic Descriptive Analysis 
Data Profiling 
Data Visualization  
Data Wrangling  
Production  
Easy Web Service 
Model Development   
THANK YOU
www.ccganalytics.com | (813) 968-3238

More Related Content

Afternoons with Azure - Azure Machine Learning

  • 1. Afternoons with Azure Azure Machine Learning and Programming Languages
  • 2. Our goal today is to provide an overview of Machine Learning and some of the most common tools that can be used to apply it. Machine Learning Introduction Python: A General Purpose Language R: A statistical analysis language Azure Machine Learning: A graphical approach Wrap Up
  • 3. Machine Learning has tons of useful applications you already encounter or hear about every day. Analyzing Images Understanding Language Forming & Executing Strategy Personalized Recommendations Autonomous Decisions Predicting Asset Values
  • 4. Machine Learning falls within the realm of predictive analytics. Provide insights to existing data using: • Raw data points • Summaries of data • Calculations across existing data fields • KPIs The data reported are historical or current facts. Generate new data, including: • Predicted future values • Best guesses of missing values • Suggested next steps • Categorizations The data generated are not facts and contain some uncertainty, but can provide valuable direction for moving the business forward. Descriptive Analytics Predictive Analytics
  • 5. Predictive modeling is historically very manual, but Machine Learning is changing the way these models are built. Technical advances in the past several years have enabled machines to conduct rapid trial and error with massive amounts of data, resulting in better models that can be achieved quicker. Machine Learning • Machine learning also assigns weights to input data to answer a problem • Algorithms are defined to find solutions via rapid iterations, where slight improvements are made with each new piece of information • Machines can consider thousands of variables to design the optimal model • Models can be tested in seconds or minutes • Models can be designed to self-adapt in real time to changes in the environment • Advanced Machine Learning techniques can solve complex problems like interpreting visual or audio input using unstructured data Traditional Modeling • Traditional modeling involves selecting “important” input data and assigning weights to build models • Simplistic models are often solved mathematically, but improvements require guesswork • Human experts must select a reasonable amount of input data for hypothesis testing when designing their models • Humans may only be able to evaluate a handful of models over the course of weeks or months • Models are reviewed periodically (often annually or even less frequently) for adjustments • Traditional modeling techniques require structured data and are too simplistic to emulate human intuition
  • 6. Machine Learning is mostly thought of in two flavors. Supervised Learning Unsupervised Learning There aren’t necessarily “right answers,” we just want to get a better understanding of our data. We know the “right answers” for some of the scenarios. – We may have history we can look back on – We may be hoping to replicate human decision making
  • 7. Supervised or Unsupervised? Predict our profits next quarter. Supervised Identify the number written on a check. Group our customers into segments. Supervised Unsupervised Predict a user’s rating for a given product. Supervised Find the most important variables in a dataset. Unsupervised Identify credit card transactions that are out of the ordinary. Unsupervised
  • 8. Collect the data and randomly create initial decision rules. Design a method for evaluating how good your hypothesis is, and test whether it applies generally. Update your hypothesis in a way that marginally improves the performance of your decision rules. Continue this process until the hypothesis either you are satisfied with the results, or your hypothesis can’t improve anymore with the data available. In Machine Learning, we expose large quantities of data to a computer so it can find patterns iteratively. Create a hypothesis Evaluate the hypothesis Adjust the hypothesis Repeat until convergence To illustrate the process, we’ll discuss a general case called Gradient Descent.
  • 9. Another possible issue is “overfitting”, which is when the model doesn’t generalize well with other data. How can we know if a model is overfitted? Hold out some of the data from training. Once the model has trained, evaluate the model using the held out “test” data and see if it performs as well as it does for the “training” data. How can we prevent overfitting?
  • 10. There are two primary ways to prevent overfitting. Collect more data Restrict the model Collecting more data makes it harder for the model to perfectly fit the training data, so it has to do a better job of generalizing while it is training. Penalizing the use of each additional feature or manually removing features from the dataset prevents the model’s ability to fit the data exactly. Alternatively, some algorithms are naturally less prone to overfitting, so it may be best to try a different algorithm.
  • 11. Now that we have a better understanding of how machine learning operates, we’ll dive into some of the most popular tools in the space. Machine Learning Introduction Python: A general purpose language R: A statistical analysis language Azure Machine Learning: A graphical approach Wrap Up
  • 12. What is Python? General Purpose Interpreted Programming Language – Currently the 4th most used programming language Used in a huge range of applications – Powering Instagram & Reddit – Testing Intel microchips – Video game development – Data Science Open source with thousands of libraries for a plethora of subject areas Multiple versions (2.7 / 3.6) that are not fully backwards compatible
  • 13. Pros & Cons Speed can be an issue – Some implementations have been made to run faster (PyPy) Restrictions with white-space Production errors from dynamic typing Underdeveloped database access layers Simplicity causes difficulty when transitioning to other languages Challenger in the data science arena Easy to read / learn Dynamic typing Everything is an object Small / good memory management Open source Parallelism / Concurrency Existing Python backends
  • 14. When to Use Python Reading data Data cleansing / preparation / munging Creating algorithms Production solutions Internet of things applications Big data solutions
  • 15. Packages within Packages Package 2Package 1Module 1 Module 2 Module 3 Module 4 Module 5
  • 16. Package Managers conda Installed with Anaconda distribution of Python Uses packages included in Anaconda distribution Packages more specifically for data science Can also manage non- python dependencies Can also manage virtual environments VS. pip Installed with Python 3 Uses packages in PyPI (Python Package Index) More diverse packages Can only manage python dependencies Needs separate tool to use virtual environments
  • 17. Useful Packages/Modules NumPy – Numerical analysis – Operations with arrays Pandas – Built on top of NumPy – Works with data frames Matplotlib – Visualizations
  • 18. Useful Packages/Modules SciPy – Linear algebra – Optimization Scikit-Learn – Machine learning TensorFlow – Deep Learning
  • 19. Next we’ll review R, which is a more specialized tool that is gaining popularity. Machine Learning Introduction Python: A general purpose language R: A statistical analysis language Azure Machine Learning: A graphical approach Wrap Up
  • 20. What is R? R is a language and environment used mostly for statistical computation and graphics. Includes a large array of data manipulation and data analysis functions. R is open source meaning that any user can write their own packages. R (language) RStudio (IDE)
  • 21. Useful Packages LOAD DATA RMySQL RODBC xlsx MANIPULATE DATA dplyr tidyr stringr lubridate data.table VISUALIZE DATA ggplot2 rgl MODEL DATA stats forecast caret randomForest REPORT RESULTS shiny rmarkdown
  • 22. Heavy code oriented (command line interface) Steep learning curve Poor memory management Single threaded One based indexing Pros & Cons Free Open source Highly customizable models and graphics Use within AzureML Online community Popularity in education (Adoption)
  • 23. When to Use R Statistical computation Data Profiling Data wrangling/manipulation Descriptive analytics Predictive models
  • 24. Microsoft leverages the Azure stack to make Machine Learning more accessible. Machine Learning Introduction Python: A general purpose language R: A statistical analysis language Azure Machine Learning Studio: A graphical approach Wrap Up
  • 25. What is Azure Machine Learning Studio? Streamlines the process to reduce the time between design and deployment Takes advantage of the accessibility and computing power of the cloud Empowers people to create and deploy predictive analytics solutions
  • 27. Machine Learning: Statistics: Work With Modules Not Code Input / Output: Transformations:
  • 29. Things to Keep in Mind w/ SQLite • Using dynamic typing of variables = implicit type conversions • LEFT OUTER JOIN work but RIGHT OUTER JOIN and FULL OUTER JOIN are not supported • Partial ALTER TABLE support • Read-only Views
  • 30. When to Use R/Python Scripts in Azure ML Developing a new Model Visualizations Reading Data From Other Sources Custom Math Operations Splitting Columns Advanced Time Series Algorithms
  • 31. Inputs & Outputs for R/Python Scripts Inputs: Input cannot be direct from CSV, must be a dataset Can have up to 2 input datasets + 1 zip file Outputs: Output must be a data.frame Can have only 1 output dataset
  • 32. Default Python Imports: math numbers datetime re pandas as pd numpy as np scipy as sp R/Python Packages in Azure ML Distributions in Azure ML: Python : Anaconda 4.0 R : Microsoft R Open Package Requirements: Cannot have a Java dependency Cannot require internet/networking
  • 33. Data Types Python R Azure ML Integer Integer Integer Float Double Double Complex Complex Complex* Boolean Logical Boolean String Character String TimeDelta DiffTime TimeSpan - Factor Categorical Data.Frame Data.Frame Dataset DateTime POSIXct Vector DateTime List List - *Azure recognizes Complex numbers however certain modules do not
  • 34. Export: Hadoop (HiveQL) Azure Blob Storage Azure Table Azure SQL Database/Data Warehouse Connectivity Import: Web URL Hadoop (HiveQL) Azure Blob Storage Azure Table Azure SQL Database/Data Warehouse SQL Server on Azure VM On-Premise SQL Server Database – Data Management Gateway Data Feed Provider (OData) Azure Cosmos DB
  • 35. Standard $9.99 per set per month $1 per studio experimentation hour Azure subscription required No cap on module numbers 7 days per experiment w/24 hours per module Can access on-premise SQL(preview) Storage space is unlimited (BYO) Multiple nodes Includes Pproduction web API Includes SLA AML Studio Tiers and Pricing Free Azure subscription not required 100 module max per experiment 1 hour runtime max per experiment Cannot access on-premise SQL server 10GB storage space Single node performance Does not include production web API Does not include SLA
  • 36. Consuming Via RESTful API Request/Response or Batch Request Connects to any programming language that supports HTTP request response Excel Online/Excel add-in Web App Templates
  • 37. Standard S2 Web Tiers and Pricing Standard S1DEV $0 1000 transaction/month 2 computer hours/month 2 web service Cannot overage Standard S3 $100.13 100,000 transaction/month 25 computer hours/month 10 web service $0.50 per 1,000 transactions $2 per API compute hr $1,000.06 2,000,000 trans/month 500 computer hr/month 100 web service $0.25 per 1,000 transactions $1.50 per API compute hr $9,999.98 50,000,000 trans/month 12,500 computer hr/month 500 web service $0.10 per 1,000 transactions $1 per API compute hr
  • 38. Closed Source Can have performance/speed issues Fewer built-in algorithms Limited customization(OOTB/without coding) Poor model visibility Pros & Cons Free base level with many paid tiers “Drag and drop” graphical user interface Runs off the cloud Supports custom R and Python code Easy to create web service Easy to retrain
  • 39. To review, we’ll compare the tools we’ve discussed and when each is appropriate for a machine learning problem. Machine Learning Introduction Python: A general purpose language R: A statistical analysis language Azure Machine Learning: A graphical approach Wrap Up
  • 40. Pros/Cons Python R Azure ML Free ● ● ◐ Open Source ● ● ○ Ability to customize ● ◐ ○ Ease of use ◐ ◐ ● Online community (documentation) ● ● ◐ Popularity ● ● ◐ Graphical user interface ○ ◐ ● Ability to create graphics ◐ ● ○ Ease to deploy model ◐ ○ ● Ability to connect to database ◐ ◐ ● Memory Management ● ◐ ●
  • 41. When to Use Python R Azure ML Consuming Big Data   Basic Descriptive Analysis  Data Profiling  Data Visualization   Data Wrangling   Production   Easy Web Service  Model Development   

Editor's Notes

  1. Developed by Dutch Programmer, Guido van Rossum, and released in 1991 During which time he was in a big Monty Python phase, so he named it Python
  2. General Purpose: can be used for almost anything Interpreted: is not compiled beforehand, but interpreted as it goes Open source: source code is available to the public for free, new packages everyday from around the world Multiple Versions: not fully backwards compatible, some syntax differences. Not a drastically new language, but Python 2 code may not run on Python 3 environment
  3. Easy to read/learn: focused on simplicity, syntax tries to mimic natural language, python motto Dynamic typing: don’t need to specify variable types, defined based on the data assigned to it Everything is an object: great for object oriented programming Small/good memory management: doesn’t take up much space or use my memory Open source: tons of custom packages and tons of resources online Parallelism/concurrency: able to perform more than one task at a time Existing Python backends: if being implemented into an existing system, many environments already have Python backends Speed: it’s an interpreted language, so it can be a tad slow, faster implementations (open source) such as PyPy White-space: the fact that python doesn’t use parenthesis as much as indentation can limit customization Dynamic typing errors: variables can assume the wrong type and cause errors when pushed to production Underdeveloped DB access: there are packages for it, but still a lot of room for improvement Switching between languages: Python’s simplicity can make it more difficult to understand other languages Challenger in data science: still catching up with popularity and data science functionalities
  4. - Reading data: for when CSVs are too large to be handled effectively in Excel, or in other forms such as audio or pictures - Data cleansing/preparation/munging: handling null values, transforming and assigning logic around features, to make data more usable - Creating algorithms: has many algorithms available through packages, but makes it very easy to create your own - Production solutions: great for models being introduced into existing systems, especially on-prem - Internet of things applications: tons of new “smart” devices these days, and python has become a popular tool for handling their operation - Big data solutions: some may have heard of apache spark, a big data cluster platform for processing huge volumes and streams of data, has the built in ability to run pySpark, their implementation of Python
  5. Module: a collection of functions and definitions stored in a python script file Package: a group of modules and/or other packages Packages are then imported into python scripts allowing them the use functions defined in the package
  6. Package manager: used for easy installation as well as managing dependencies Pip is the default package manager that is installed with Python 3 Conda is another package manager that is installed with the Anaconda distribution of python Pip installs packages from the Python Package Index, the official and largest repository Conda installs packages included with the Anaconda distribution, with packages focusing on data science Pip can only manage python dependencies Conda can also handle dependencies from other languages such as java or R Virtual environment: sometimes required to separate certain packages based on conflicting dependencies or to minimize memory usage With pip, you need an additional tool to use virtual environments Conda has this capability built in
  7. NumPy: defines the numpy array object, develops arithmetic around these arrays, as well as other numerical analysis methods Pandas: most popular package for data science, built on top of numpy, defines the pandas data frame object and operations for manipulating the data within it Matplotlib: the most popular package for visualizations, huge variety of capabilities while being relatively easy to use
  8. SciPy: many functions around linear algebra and optimization, which are the cornerstone of almost all machine learning, would be good to use if creating custom algorithms Scikit-Learn: almost any machine learning algorithm you could need can be found in this package, as well as many functions for customizing and evaluating your model or splitting your data in train and test data TensorFlow: a package that is getting very popular these days, developed by Google, implements a deep learning algorithm, which is a specific and complex time of machine learning, it’s popular for use with image and video analytics as well as many other things
  9.  Rather than focus heavily on coding, AML lets you build models by dragging and dropping modules, we'll dive through how to build a simple regression model in our workshop
  10. Since I introducted you to the select columns module, I wanted to show you what else is available. Not a comprehensive list. AML has mods for importing/exporting, mods for different models, transforms. One of the best ways to learn AML is to read through the modules and see whats available in the tool kit. A few stand out modules I want to bring to your attention
  11. As you start working with AML some things may come across as a bit quirky. A huge thing is you have to work 1 step at a time. If you have a table of orders with a name, date, qty and all the fields are string. You will need a module to turn date into a datetime, and another one for qty into an integer, you can't do that in one step like you could with say a SQL select statement.  While ML is a Microsoft tool, it does not use t-sql, the engine runs sqlite. Certain functions will not work and may not throw back any errors. Example is string concatenation with "+". Will throw back a 0
  12. While AML is a Microsoft tool, the SQL is not t-sql, it is sqlite. Certain functions will not work and may not throw back any errors. Example is string concatenation with "+". Uses dynamic typing of variables. Implicit type conversion is allowed Left outer join is valid. Right outer and full outer joins are not. You can rename/add column with alter table but not drop column/alter column/add constraint Views are read only
  13. Another big module are the execute code scripts. 
  14. When working with different script its important to kepe in mind the data types used especially when you are trying to bring in data types from other data sources
  15. Now moving into connectivity. Great connectivity with other services in the Azure stack. You can import and export to/from the blob storage, table storage, and azure sql databases. Hadoop works great as well. If you work on-prem, it is currently a preview feature, meaning its not 100 fully supported, meaning its a prerelease/beta.  A couple cool things is the web URL, you can pull data from http websites and run machine learning on that that information
  16. Pros Free to try, and there are multiple levels designed to prevent you from over paying/pay for what you need Easy to use and get started with Leverages the cloud, work from anywhere Web services are easy to deploy and manage Easy to retrain  Cons Generally is a little slower than traditional model training Next 2 go hand in hand Microsoft develops the premade modules, so there's a limited selection unless you can code. Closed source – you know generally what modules do, but they are black box Poor model visibility – unable to see training progress for models, 7 days max on model with not progress bar.