Study Notes - Lesson 1 - 7 PDF
Study Notes - Lesson 1 - 7 PDF
Study Notes - Lesson 1 - 7 PDF
Scholarship Program
for
Microsoft Azure
LESSON 2
The number of channels required to represent the color is known as the color depth or
simply depth. For RGB image, depth = 3 while for grayscale image has depth=1
Text analysis pipeline ; Text preprocessing (removing stop words, stemming & lemmatizing),
text extraction (vectorization)
Linear Regression
Linear Regression simplifies the target function Y to a line
Assumptions
Assumes that the relationship between Input & Output is linear
Assume that input & output aren't noisy, so it is expected to remove outliers in the
form of noise
Remove collinearity ; It overfit where there are highly correlated features, hence it is
important to calculate pairwise correlation foe the input data and drop the most
correlated
Gaussian distribution ; Input and ouptut should be guassian distributed
Rescale inputs
Learning function
The core goal of machine learning prediction is to learn a useful transformation that
minimizes the error that is closer to the actual output value.
Y = F(x) + e
Where e is irreducible error which is independent on the input data (x) because no matter
how good our model get we can't completely remove the error
Limitations
Highly constrained
Limited complexity
Poor fit
Non Parametric
They are not making assumptions regarding the form of the mapping between input
data and output, they are free to learn any form from the date (e.g. KNN)
Benefits
High Flexibility and capable of fitting large number of functional forms
High performances
Limitations
More training data required
Slower to training due to having more parameters to tune.
Overfitting and difficulty in explanability
DL advantage
Use to learn large complex data availability
DL disadvantage
Black box due to difficulty in explanation
Large computation resource required
Classical ML advantages:
More suitable for small data
Easier to interpret outcomes
Cheaper to perform
Can run on low-end machines
Does not require large computational power
Classical ML disadvantages:
Difficult to learn large datasets
Require feature engineering
Difficult to learn complex functions
Unsupervised ; algorithm learn from data that contains only inputs and finds hidden
structures in the data
o Clustering ; assigns clusters or groups. The goal is to find inherent groups or
clusters within the data while maximizing inter cluster similarity and intra cluster
dissimilarity e.g. customer segmentation
o Feature learning ; features learned from unlabeled data
o Anomaly detection ; learns from unlabeled data using assumption
Reinforcement learning (RL) ; learns from an agent and instructs it to take actions in an
environment to maximize a reward function
What differentiates between Supervised, Unsupervised and RL is that the first 2 are passive
while RL is active. Passive in the sense that learning is performed without any action that
could influence what data could be observed in the future while RL is an active approach
where the action of the agent influences the environment.
Bias ; Simplifying assumptions made by a model to make the target function easier to learn
Variance ; Amount that the estimate of the target function will change if different training
data was used
As a general rule of thumb, parametric and linear algorithms often have High bias and low
variances and Non-parametric algorithms have Low bias and High variance.
Overfitting ; memorizing the data and doesn't generalize well on new data
Underfitting ;neither modeling the training data nor generalizing to new data but over
simplifying the assumed learned function to use
Any machine learning algorithm and solution aim to reduce the prediction error which
consist of;
Prediction error = Bias Error (BE) + Variance Error (VE) + Irreducible Error (IE)
IE = This is independent on the algorithm used and can be caused by undetected variables
used to train the model
BE = This is induced by bias, a low bias correlate to few bias made by the target function,
hence less assumptions about the target functions e.g Decision trees, KNN
VE = This is influenced by the training data used to model, hence more assumptions about
the target function e.g. Linear and Logistic Regression
The goal of a Machine learning problem is to have a low bias and low variance The optimal
model complexity is where bias error crosses with variance error.
Overfitting ; Good performance on Training data and poor on new/test data where the data
has memorizing all the details of the training data.
Limiting Overfitting
Resampling techniques like K-fold cross validation
Hold back a validation dataset from the initial training data
Simplify the model
Use more data if available
Reduce dimensionality in the training dataset
Underfitting ; Can neither model the training data nor generalize on the new/test data
In summary
High variance causes overfitting
Low bias causes underfitting
LESSON 3
Model training is the process through which data is transformed into trained machine learning
model. It is one of the important process in the Machine learning pipeline, it always you to build,
train and check the quality of a machine learning model.
Data store ; offer a layer of abstraction over the supported azure storage services, it stores all
the information require connect to a particular storage or service. The can be shared and
accessed simultaneously by different instances of the process. They answer the question 'how
do I securely connect to the data in my Azure storage"
Different specific data files are transferred to the data stores through the data set.
Data set ; how to get specific data file and sets that contain either the training or validation data
for the machine learning task. The process of data preparation is one of the most important in a
machine learning pipeline. They are created from public datasets, url or upload from local.
Datasets are therefore references that point to the data in the service
Data store & data set offers a secured, scalable and reproducible way to deliver data into all of
machine learning task
Data Sets
They are used to interact with data in the datasets and are used to package consumable object
for other machine learning tasks. They are not copies a your data but rather they only reference
to the original data that is kept into the storage service, hence there is no copy or duplicate or
data that might lead to billing when a new dataset is created.
Azure ML datasets are required when you need to access data for your local or remote ML
experiments, and are also used as input and outputs for ML pipelines
One of the importance is ability to make a copy of a data in the data store and reference it
multiple times for use
Types
Tabular Datasets
Web URL (File datasets)
Dataset versioning plays a key role in bookmarking a data state for retraining and tracing of the
data. It's very useful ;
o When new data is available for retraining
o When you are applying different data preparation ore feature engineering
approaches.
Feature Engineering
One of the challenge in ML is selecting the best features which are most appropriate and
suitable for the type of algorithm you are trying to model, also many times the existing features
isn't enough which results to reengineering new features to be used to train the machine
learning model which can be called Feature Engineering (FE). Similar to FE is Feature Selection
(FS) which can be used to select the best feature that can be used to train a machine learning
model in a scenario where the dataset has a large number of features the model can learn from
which wouldn’t be optimally in many cases, hence the need to reduce such features to selecting
the perfect features using different techniques of selection, one of which is called Dimensionality
Reduction
FE help increase the performance of a machine learning model leverage on existing features to
generate new features that might be useful in improving model performance. FE isn't always
necessary because there are cases where the existing data are enough and significant enough to
train the model.
Classical ML is much more reliant on FE than DeepLearning (DL)
Examples of FE tasks
Aggregation
Part-of (Extract a part of a certain data structure e.g. say part of a month of a date_
Binning
Flagging
Frequency-based
Embedding
Deriving by example
FE processes are applied differently depending on the data types you are working with.
Feature Selection
FE is about creating new features in a dataset, however help in the selecting the perfect features
to model the data to have an explosion in the feature space
Reasons for FS
Elimination of irrelevant, redundant, or highly correlated features
Reduce dimensionality
Techniques
The curse of dimensionality given that those algorithms can perform well when there is a high
dimension. Dimension reduction algorithm include;
PCA (Principal Component Analysis)
T-Sne (
t-Distributed Stochastic Neighboring Entities )
Feature embedding
Data Drift ; It is the change in the input data for a model. It causes degradation of model's
performance.
Causes
Changes in the upstream process
Data Quality issues
Natural data drift in the data due to intrinsic nature of the data
Changes in the relationship between different features
Data drift is one of the top reason model accuracy degrade over time. A model training process
doesn't finish after that 1st training but rather it's an iterative process and there is need to
constantly monitor the performance of the model and one of the way to do that is to
continuously the data drift.
Data drift is monitored using dataset. Scenarios for setting data drift;
Monitoring a model's input data for drift from the model's training data
Monitoring a time series dataset for drift from a previous time period
Performing analysis on past data
Model Training
The goal of training in ML process is to produce a learned model that can later be used. The basic
training process or pre include;
Understanding the data
Preparing/transforming the data
Creating new features (FE)
Feature selection
Data splitting ; Train, Test & Validation ( where Train & Validation is used in the training
process while the Test is used for evaluating and validating the outcome of the model)
Model training involves the iterative process of tuning the parameters and hyperparamaters to
improve the performance of the model.
Taxonomy of Azure ML
Azure Workspace ; the interface that must be created before starting anything on Azure
ML
Compute Instances ; They allow access to environment such as Jupyter notebook, it also
have a designer package to perform codeless task
Datasets ; It make data available for ML process
Experiments; Task performed in the Azure ML studio, it is a container that helps group
artifact that are grouped together
o Run ; e.g. Training, Validation etc. Actions and every run output a process
Snapshot
Output files
Metrics
Logs
Possible compute targets ; Azure ML can run with variants data source and environment
such as; Local, DS VM, Data bricks, Kubernetes etc
Model Registry ; A service that enable versioning of model trained for easy bookmarking or
tracing
Deployments ;
Algorithms used
Logistics Regression
Support Vector Machine (SVM)
Algorithms used
Linear Regressor
Decision Forest Regressor
Confusion Matrix
Accuracy ; The proportion of correct predictions ; (TP + TN)/(TP + FP+ FN+TN)
Precision ; It is the proportion of positive cases that were correctly identified ; TP / (TP + FP)
Recall ; The proportion of actual proportion of cases that were correctly identified ; TP /
(TP + FN)
F1 Score ; Measures the balance between the Recall and Precision; 2 * (Precision * Recall)/
(Precision + Recall)
Model Evaluation Chart (ROC ; Receiver Operative Characters for Classification) - The graph of
True positive rate against False Positive Rate
The Area under the curve (0.5-1)
Gain and Lift charts
Regression Evaluation Metrics
1. RMSE
2. MAE
3. R^2 (Coefficient of Determination); measures how good the model is in explaining the
variability of the model
4. Spearman correlation
Strength in Numbers
No matter how well-trained an individual model is, there is still a significant chance that it could
perform poorly. Automated ML help to scale up the process of training model by combining the
result/strength outcome of individual models
Ensemble Learning
It is used to combine multiple ML algorithm to produce one powerful predictive model with
higher accuracy. There are 3 types;
1. Bagging or Bootstrap aggregation ; This is used to reduce overfitting for tree based
algorithm such as Decision trees. It involves using random subsampling of the training data
to produce a bag of trained models
2. Boosting ; using weak learners to combine where the final predictions are weighted
average from the individual models
3. Stacking; To train a large number of completely different models and combine as output.
Unsupervised Learning
Clustering
Representation Learning
Supervised
Classification ; when the predicted output is discrete and belong to a class e.g.
o Classification on tabular data
o Classification on image or sound data
o Classification on text data
Example;
o Computer Vision
o Speech recognition
o Sentiment analysis
o Anomaly detection
o Credit risk scoring
There are 3 c
Two -Class Classification
o Two categories (Yes/No, True/False)
Multi Class Single Label Classification
o Multiple categories, output belongs to single category e.g. Red, Blue, Green
Multi-Class Multi-Label Classification; Output can belong to one or more categories
Introduction to Regression
Categories of Algorithms
Linear Regression
o Linear relationship between independent variables and a numeric outcome
o Approaches
Ordinary Least Square Method
Gradient Descent
Decision Tree Regression
o Ensemble Learning method using multiple decision trees
o Each tree outputs a decision
Neural Network Regression
o Supervised Learning method, therefore required a label target
It groups those algorithms that relies only on input in the training process. It comes from the fact
that there is no expected output, the algorithm learn from the input data to discover hidden
behavioral pattern to produce an output. Unsupervised learning help discover useful information
from unlabeled data. Applications of Unsupervised ML includes;
Anomaly detection
Customer segmentation
Types of Algorithm;
Clustering : Occurs when entity from the input data must be assigned into a finite number
of subsets called clusters
Feature learning ; Learning is used to transform set of inputs into other inputs that are
potentially more useful in solving a given problem
Anomaly Detection ; To identify 2 major groups of entity;
o Normal
o Abnormal (Anomaly)
Application of Anomaly detection can be seen in Spam detection & Credit Card
Fraud detection
Dimensionality Reduction
The underlying motivation for unsupervised ML problem is that labeled data used for Supervised
ML is often hard to acquire and expensive whereas acquiring unlabeled data for unsupervised
ML task is usually inexpensive. Semi Supervised ML
It leverage on the availability of the fraction of the training data that is labeled to be used for the
unlabeled data using;
Self-training ; The model is trained using the labeled data and then used to make prediction
for the unlabeled data where the end result is a fully labeled data
Multi-view training ; This means training multiple models on different views of the data
which mean various feature selection part of training data or various model architecture
Self-ensemble training ; This is similar to Multi-View training only that a single based model
is used but different hyperparameters that is on different views of the data
Clustering
Clustering is a problem of organizing entities from the input data into a finite number of subsets
or clusters with the goal of maximizing;
Intra cluster similarity
Inter cluster dissimilarity
K-Means Clustering
It is a Centroid-based unsupervised algorithm which creates up to a target (K) number of clusters
and grouping similar members and the objective is to minimize the square error of the member
with the center of the centroid, that is minimizing intra cluster similarity
Initialize Centroids >> Cluster Assignment >> Move Centroids >> Check for Convergence
LESSON 5
Classical Machine Learning vs. Deep Learning
One of the factors that separate DL classical ML is it inner capability of learning new features
without explicit Feature Engineering which is common with Classical ML
Artificial Neural Networks though were inspired by the Human brain, however it doesn't copy the
human brain.
Similarity Learning
This is closely related to classification and regression, however uses a different type of objective
function and mostly applied in recommender system and also used in solving verification
problem such as speech, face, etc.
Similarity learning as a supervised learning approach is treated as a classification task where a
similarity function maps pairs of entities to a finite number of similarity levels (0/1), similarly it
can be treated as a regression approach where the similarity function maps pairs of entities to
numerical values.
The main aim of a recommendation system is to recommend one or more items to users of the
system. The approach to Recommendation engine include;
1. Content - Based ; make uses of features for both users and items
2. Collaborative Filtering ; Uses only identifiers for users and Items and get information from a
matrix of ratings.
Text Classification
Text translation into numerical formats is referred to as text embedding (Word embedding &
scoring). In word embedding we are trying to transform every text in the data set into a form
numeric feature, whereas in scoring we are aiming to calculate some kind of score that is related
to the importance of the word in a text. The resulting numerical representation which is usually
as vectors is then used as an input to the classification algorithm.
Text preprocessing
Feature Learning
Feature engineering is one of the core techniques that can be used to increase the chances of
success in solving machine learning problems. As a part of feature engineering, feature learning
(also called representation learning) is a technique that you can use to learn or derive new
features in your dataset. This is why having the right input feature in a vital prerequisite for
training a ML model.
Feature Learning is used to set of input to another input.
Approaches;
Supervised Feature Learning ; New features are learned using data that has already been
labelled e.g.
o Data sets that have a high cardinality (One-hot encoding blows up the feature space
that is the dimensions of the data)
o Image classification
Unsupervised Feature Learning ; Based on learning the new features without having
labeled input data.
o Clustering = a form of feature learning that is cluster identifier
Anomaly Detection
It is a machine learning technique concerned with finding data point of interest that deviate
significantly from the norm. It can be approached as both Supervised and Unsupervised ML task.
Usually the number of abnormal entities is much more smaller than a normal entity, this mean
general Anomaly detection often create an imbalance dataset which makes it quite difficult to
solve.
Forecasting
Deals with set of event that can be ordered in time. This ordering in time mean most of time
connection with date, order in time.
Types of Forecasting Algorithms
ARIMA - AutoRegressive Integrated Moving Average (evolution of ARMA - AutoRegressive
Moving Average)
Multi-Variate Regression - Time-series forecasting can be pictured as a form of regression
problem.
Prophet ; work best with time series that have strong seasonal effect.
Temporal Convolutional Network (TCN) (It is 1D convolutional) - It is capable of exhibiting a
longer memory than other types of Forecasting algorithm
RNNs (Recurrent Neural Networks)
LESSON 6
Conventional ML requires several tool to prepare data to deploy and can be tedious due to ;
Lengthy installation and setup process
Expertise to configure hardware
Fair amount of troubleshoot
Managed Services save the day with very little setup and easy configuration for any hardware
since it is cloud based, that is they provide a ready-made environment that is pre-optimized for
machine learning development and deployment
Compute Resources
Training Clusters is a compute used for training model, also the local machine (local
compute) can also be used for small task. Training cluster is used for training and batch
inferencing . With the option of training clusters provides:
o Single or multi-node cluster
o Can auto scale each time you submit a run due to it elastic capacity
o Automatic cluster management and job scheduling
o It has support for both CPU & GPU resources to allow for various large task
Inferencing Compute
Inferencing is also called scoring. When you have a trained model, you will want to able to
deploy it or use it when needed
Real time & Batch Inferencing
o Real time - To make inferences for each new rows of data in real-time, for this
packaging the model as a webservice is a desired approach.
Azure has an Inferencing Cluster for real time inferencing
Azure Kubernetes Service (AKS)
o Batch Inferencing - To make inferencing on multi rows of data name batches. Any
computing resource can be used that enable training and loading of such.
Any compute resource can be used for batch inferencing
Azure ML training cluster
Azure functions
Azure IoT edge
Azrue Data Box Edge
Web-App >> Invokes ML model via web service >>> Model deployed in Azure Kubernetes
Service (Server Cluster) for high scalability) >>> Azure ML used to train the model and stored in
Azure Container Registry
Azure Container Instances and Compute Clusters are managed by Azure Machine Learning
The training of a ML model is the process through which a mathematical model is built from data
that contain input and expected output or only input in the case of unsupervised learning
Basic Modeling
The steps involved in a generic Model training are as follows;
Experiment : It is a generic content of handling runs, it is a folder that organizes the artifact
use in the training process
Runs : They are used to build a trained model, it contain all artifact associated with a
training log and script for that model. Azure ML record and take a snapshot of all artifact
associated with training of the model such as logs, meta data etc.
Model registry ; It keeps track of all the models in Azure ML workspace. A run is used to
product a model, a model is a piece of code that takes an input and produces an output
which can either be produced by a Run or originate from outside of Azure ML
o Model = algorithm + data + hyperparameters
When training a model in Azure ML, the follow steps are performed;
Create a new experiment for the run; An experiment is a generic context for handling and
organizing runs.
You'll want to create one or more runs within the experiment once it is created
You then register the final model in a model registry once you identify the best model.
After registration, you can then download or deploy the registered model and receive all
the files that were registered.
Advanced Modelling
Process involved in end-to-end ML model building:
Data ingestion
Data preparation (such as Normalization, Transformation, Validation & Featurization)
Model building and training (Process such as Hyper-parameter tuning, Automatic Model
selection, Model testing, Model validation)
Model deployment (Deployment & Batch scoring)
These steps are organized into Machine Learning pipelines. They are used to create and manage
workflow. ML pipelines are made up of distinct steps as highlighted above. Another steps of ML
pipelines is that are modular and they allow for collaboration while working on separate areas of
the workflow amidst different Data scientist/ ML Engineer
MLOps = applying DevOps principles to Machine learning pipelines. MLOps on the other
side enables;
o Model reproducibility
o Model Validation
o Model deployment
o Model retraining
It is DevOps for AI which includes;
Automating the end-to-end ML lifecycle
Monitor ML processes
Capture traceability data
Operationalizing of Model
This is a similar term to Deploying the model somewhere for use outside the testing or
development environment. A typical model development is as follow;
Get the mode file (any format)
Create a scoring script(.py)
Optionally create a schema file describing the web service input (.json)
Create a real-time scoring web service
Call the web service from your applications
Repeat the process each time you want to re-train the model
Azure ML service simplifies this step. Model scoring and inferencing can be done real time (on
demand that is as it receives the data) or batch (where model is run on large quantities of
existing data, that is it is run on recurring schedule)
Responsible AI
Approaches to Responsible AI
Model Explainability
o Understand Global behavior/explanation
o Understand specific predictions generated by the model
Evaluate the model for fairness
o Who is neglected
o Who is mis-represented ?
Microsoft AI Principles
1. Fairness : All systems should treat people fairly, you should build a system from a
diversified group that detect and eliminate bias
2. Reliability and Safety ; End user need to be certain the product will be reliable and
responsible in an expected way in a certain scenario
3. Privacy and Security ; It should be secured and data collection should be transparent and it
should comply with necessary regulation
4. Inclusiveness ; It should benefit everyone and should eliminate unintentional bias.
5. Transparency ; When they help make decisions that impact people's life, it is important
there is transparency
6. Accountability ; The developers and AI experts should be able to be accountable on the
solutions developed
Explainability in Azure ML
Direct Explainers ; It is used base on the model type and used directly to explain the code.
They are used when one know they best explainer for the job. Examples of model specific
direct explainers include;
o SHAP Tree Explainer used for Tree based or ensembles model
o SHAP Deep Explainer for DL
o Mimic explainer
o SHAP Kernel Explainer
Meta Explainers ; They give the right explainer to use for a type of task
o Tabular Explainer
o Text Explainer
o Image Explainer
Model Fairness
FairLearn is a toolkit to identify and mitigate unfairness in machine learning models