Machine Learning
Machine Learning
1. Supervised ML
2. Unsupervised ML
3. Reinforcement ML
1. Business Problem:
Identify the problem that could benefit from ML
2. Business Formulation: Preparing the problem
Identify ML model type (Classification, Regression)
Frame the simplest solution without losing important information
Choosing the data:
How much is it?
Where is it?
Do I have access to it?
Get a domain expert:
An expert for the business case.
Can identify the important features
Decide whether the data is representative for the real world
Evaluate the data quality
Identifying the features and the label
Does the problem needs a lot of labeled data?
Identify the metrics:
Model performance metric
Used during model training and evaluation
Business goal metric
Used after model deployment
Measures how well the model is performing
3. Data preparation and preprocessing
Data collection and integration
Machine Learning
Page 2 of 28
Determine where data comes from
Data Preprocessing and visualization
Design data for the model
4. Model Training and Tuning
5. Model Evaluation:
Testing the model on new data and assess the results
6. Optimization
Data Augmentation:
Modify the data
Feature Engineering:
Create new features
7. Model Deployment
Getting Started
Data Collection
Data Preparation
Reformat the collected data from (CSV, JSON, Pickle, ..etc) into a tabular
format.
Make sure that the data is imported properly by importing the first few rows
Understand the data dimensions, column names
Checking for missing data, duplicates, wrong data types
Data Cleaning
Missing Data:
Sources:
Undefined values, collection errors, left joins, etc..
Issues:
Many learning algorithms can't handle missing values.
Makes it hard to interpret a target relationship
Identifying the cause can determine whether to delete or impute them.
What were the mechanisms that caused the missing values?
Is it random and which kind of values are missing?
Are there rows or columns missing that not aware of?
Dropping:
Risk of dropping rows:
o Not enough training samples (overfitting)
o May bias sample
Risk of dropping columns:
o May lose information in features (underfitting)
Imputation:
Unit Non-Response:
o Refers to entire rows of missing data
Machine Learning
Page 3 of 28
o Imputation Methods include: Weight-Class Adjustments
Item Non-Response
o Specific cells of a column are missing
o Types:
o MCAR: stands for Missing Completely at Random.
o This happens when missing values are missing
independently from all the features as well as the target.
o MAR: stands for Missing at Random.
o This occurs when the missing value is dependant on a
variable, but independent from itself.
o MNAR: stands for Missing Not at Random.
o This is the case where the missingness of a value is
dependent on the value itself.
o Methods:
o Weight-Class Adjustments
o Deductive Imputation
o Mean/Median/Mode Imputation
o Hot-Deck Imputation
o Model-Based Imputation (Regression, Bayesian, etc)
o Proper Multiple Stochastic Regression
o Pattern Submodel Approach
Python libraries:
o SciKit learn:
o sklearn.preprocessing.SimpleImputer
o Advanced methods for Imputation:
o MICE (Multiple Imputation by chained
Equations). sklearn.impute.MICEImpute (v0.20)
o Python (not sklearn) fancyimpute package (KNN impute,
SoftImpute, MICE, etc..)
References:
o A Comprehensive Guide To Data Imputation
o Matt Brems Repo
Inconsistency
Column values with different units
Wrong or unrelated column values
Outliers:
It can:
Add richness to the data
Make accurate predictions more difficult
Indicate that the data point belongs to another column
Types:
Artificial: when the outlier doesn't belong to the real world
o eg: Age is 150
o It needs to be deleted
Natural: when the outlier can be actually genuine
o eg: Salary of CEO vs other employees
o Transform the outlier
o Ex: Use the natural log of each value in the column to reduce the
extreme variation between the values.
o Impute a new value for the outlier
o Ex: Use the mean of that column
Data Preprocessing
Machine Learning
Page 4 of 28
Descriptive Statistics:
Categorical vs Numerical stats
Understanding Mean and Median
Encoding Labels for categorical features:
It is converting categorical variable into a numerical variable.
Ordinal:
SciKit Learn library, LabelEncoder, converts categorical variable to
numerical variable that starts with 0 and increase with 1. But if this
applied to non-nominal categorical type, may lead to wrong
computations or wrong usages. So, it is recommended to be used with
Ordinal type that has relationships with each other.
from sklearn.preprocessing import LabelEncoder
loan_enc = LabelEncoder()
y = group_enc.fit_transform(df['loan_approved'])
Nominal:
While library OneHotEncoder is more likely to be used for Nominal
variables.
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({"Fruits":
["Apple","Banana","Banana","Mango","Banana"]})
num_type = group_enc.fit_transform(df['Fruits'])
type_enc = OneHotEncoder()
type_enc.fit(num_type.reshape(-1,1))
Pandas library has get_dummies() function to do the same as
OneHotEncoder
import pandas as pd
df = pd.DataFrame({"Fruits":
["Apple","Banana","Banana","Mango","Banana"]})
pd.get_dummies(df)
Encoding with many classes:
Define a hierarchy structure.
Try to group the levels by similarity to reduce the overall number of
groups.
Data Visualization
Machine Learning
Page 5 of 28
Benefits:
Identify correlation between features.
High correlation between features can sometimes lead to poor model
performance
Visualization techniques:
Scatterplot
Scatterplot with labels identification
Scatterplot matrix
Correlation matrix with Heatmap
Model Training
The goal of training is to create an accurate model that answers the business
question accurately as often as you need it or more.
Algorithms
1. Supervised Learning
1. Classification
1. Binary
1. Linear learner
2. XGBoost
2. Multi-class
1. XGBoost
2. KNN
2. Regression
1. XGBoost
2. KNN
3. Linear learner
4. Factorization machines
3. Recommendation
1. Factorization machines
2. Unsupervised Learning
1. Clustering
1. K-means
2. LDA
2. Topic modeling
1. LDA
3. Embeddings
1. Object2Vec
4. Anomaly detection
1. Random cut forest
2. IP insights
5. Dimensionality reduction
1. PCA
Formatting data
Machine Learning
Page 6 of 28
Splitting data
We split the data to avoid overfitting and get generalized performance. The
data is split into three sets:
Training set:
It is used in the model training phase to see patterns.
It ranges around 80% of the full data
Validation/Evaluation set:
It is used also in the training phase but used to give an estimate of
model performance and/or compare performance across different
models.
It ranges around 10% of the full data.
Testing set:
It is used to evaluate the predictive quality of the model.
It ranges around 10% of the full data.
Python SciKit learn sklearn.model_selection.train_test_split can be used for
splitting
Cross-validation
Machine Learning
Page 7 of 28
Distributes label class across training and testing datasets.
For imbalanced data.
SciKit Learn library sklearn.model_selection.cross_val_score can be used for
cross validation:
In Classification problem, SciKit Learn do "Stratified K-fold Cross-
validation". The Stratified Cross-validation means that when splitting
the data, the proportions of classes in each fold are made as close as
possible to the actual proportions of the classes in the overall data set.
In regression, SciKit Learn uses regular k-fold cross-validation.
Machine Learning
Page 8 of 28
o Can get stuck at local Minima or fail to reach Global Minima.
Stochastic Gradient Descent:
o It is the same as gradient descent except that the weights is
updated at every data point.
o It is very fast to converge.
o The drawback is that it is very noisy, in such that the steps might
be in several directions.
Mini-Batch Gradient Descent:
o It uses mini batch of records, and then the parameters is updated.
o It is slower than SGD but faster than Gradient Descent.
o It doesn't consume much memory as SGD
Gradient Descent Variations:
To find the Minima, an equation is calculated which is the derivative of
the plot when it is equals to zero. Which is the point where the slope is
neither increase nor decreases.
Model Evaluation
Model Metrics
Classification Metrics
Confusion Matrix:
True Positive (TP): When model predicts Positive outcome as Positive
Machine Learning
Page 9 of 28
True Negative (TN): When model predicts Negative Outcome as
Negative
False Positive (FP): When model predicts Positive outcome as Negative
False Negative (FN): When model predicts Negative outcome as Positive
To determine how well is the Logistic Model, the are some some metrics:
Accuracy:
Accuracy (also called Score) is the proportion of correctly labeled rows
divided by the total number of rows in data set. There are some cases
when Accuracy won't work well, when there are large class imbalances
in the dataset.
$$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$$
Precision:
Out of all the items labeled as positive, how many truly belong to the
positive class.
$$\text{Precision} = \frac{TP}{TP+FP}$$
Recall:
Out of all the items that are truly positive, how many were correctly
classified as positive. Or simply, how many positive items were
'recalled' from the dataset.
$$\text{Recall} = \frac{TP}{TP+FN}$$
F1 score:
It helps express precision and recall with a single value.
AUC - ROC:
AUC: Area-under-curve (degree or measure of separability).
ROC: Receiver-operator characteristic curve (probability curve).
AUC - ROC curve:
A performance measurement for a classification problem at various
threshold settings
The optimum model has 1 AUC value
Regression Metrics
Machine Learning
Page 10 of 28
Basically, standardized version of MSE
Good $R^2$ are determined by actual problem
$R^2$ always increase when more variables are added to the model.
Highest $R^2$ may not be the best model.
Adjusted $R^2$
Adjusted $R^2= 1-(1-R^2)\frac{\text{no. of data pts.} -1}{\text{no. of
data pts. - no. of variables}-1}$
Takes into account of the effect of adding more variables such that it
only increases when the added variables have significant effect in
prediction.
It is a better metric for multiple variates regression.
SciKit-learn: sklearn.metrics.r2_score
Confidence Interval
An average computed on a sample is merely an estimate of the true
population mean.
Confidence interval: Quantifies margin-of-error between sample metric
and true metric due to sampling randomness
Informal interpretation: with x% confidence, true metric lies within the
interval.
Precisely: If the true distribution is as stated, then with x% probability
the observed value is in the interval.
Z-score: Quantifies how much the value is above or below the mean in
terms of its standard deviation
Validation Curve
Learning Curve
Machine Learning
Page 11 of 28
It used to detect if the model is underfitting or overfitting, and impact of
training data size the error.
It plots the training dataset and validation dataset error or accuracy against
training set size.
scikit-learn: sklearn.learning_curve.learning_curve
Uses stratified k-fold cross-validation by default if output is binary or
multiclass (preserves percentage of samples in each class)
Note: sklearn.model_selection.learning_curve in v0.18
Model Debugging
Feature Engineering
Feature Extraction
Machine Learning
Page 12 of 28
It is unsupervised linear approach to feature extraction
Finds pattern based on correlations between features
Constructs principal components: orthogonal axes in directions of
maximum variance.
scikit-learn: sklearn.decomposition.PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
lr = LogisticRegression()
lr.fit(X_train_pca)
Linear discriminant analysis (LDA)
A supervised linear approach to feature extraction
Transforms to subspace that maximizes class separability
Assumes data is normally distributed
Used for dimensionality reduction of features
Can reduce to at most #classes-1 components
scikit-learn: sklearn.discriminant_analysis.LinearDiscriminantAnalysis
Kernel versions of these for fundamentally non-linear data
Feature Selection
Machine Learning
Page 13 of 28
o Centering the values around mean $\mu_j = 0$ with standard
deviation $\sigma_j = 1$ for each column.
o This can be achieved by removing the mean from the variable
and divide it with the standard variance.
o $$x_{i,j}^* = \frac{x_{i,j} - \mu_j}{\sigma_j}$$
o Advantages:
o Many algorithms behave better with smaller values
o Keeps outlier information, but reduces impact.
o SciKit-learn: sklearn.preprocessing.StandardScalar
o MinMax Scaling:
o It is the transformation of all the features, so they're all on the
same scale between zero and one.
o $$x_{i,j}^* = \frac{x_{i} - \min x_j}{\max x_j - \min x_j}$$
o Advantages:
o Robust to small standard deviations
o SciKit-learn: sklearn.preprocessing.MinMaxScaler
o MaxAbs scaling:
o Divides each element by the maximum absolute value in the
feature. $$x_{i,j}^* = \frac{x_{i,j}}{\max (|x_j|)}$$
o Advantages:
o It doesn't destroy sparsity, because there observations are
not centered around any measurement
o SciKit-learn: sklearn.preprocessing.MaxAbsScaler
o Robust scaling
o It is applied to particular features. $$x_{i}^* = \frac{x_{i} -
Q_{50}(x)}{Q_{75}(x) - Q_{25}(x)}$$
o Advantages:
o Minimizes the impact of large marginal outliers.
o After transformation, it will be robust outliers
o SciKit-learn: sklearn.preprocessing.RobustScaler
o Normalizer:
o It is applied to rows.
o Scaled values are scaled with standard deviation $\sigma_j = 1$ $
$x_{i,j}^* = \frac{x_{i,j}}{\sigma_j}$$
o SciKit-learn: sklearn.preprocessing.Normalizer
o Rescales $x_j$ to unit norm based on:
o L1 norm
o L2 norm
o Max norm
o It is widely used in text analysis.
Critical aspects to Feature Engineering:
o The scalar is fit to the training data only, then transform both
train and validation data.
o Apply the same scalar object to both training and testing data.
o Training the scalar object on the training data and not on the test
data. If it trained on the test data, it will cause a phenomena
called Data Leakage, where the training phase has information
that is leaked from the test set.
o The scaling should applied to the real world numbers not only to
that available in the dataset. For example: If the dataset has Age
column that ranges from 20 to 50, this doesn't neglect the fact
that Ages in real world can range from 0 to 80 or 100. So, this
should be taken in consideration while scaling in order to
generalize the model to the real world.
Machine Learning
Page 14 of 28
o Scaling is applied differently to each column
Polynomial Features
o Generate new features consisting of all polynomial combinations
of the original two features $𝑥_0,𝑥_1$.
o The degree of the polynomial specifies how many variables
participate at a time in each new feature (above example: degree
2).
o This is still a weighted linear combination of features, so it's still a
linear model, and can use same least-squares estimation method
for $w$ and $b$.
o Adding these extra polynomial features allows us a much richer
set of complex functions that we can use to fit to the data.
o This intuitively as allowing polynomials to be fit to the training
data instead of simply a straight line, but using the same least-
squares criterion that minimizes mean squared error.
o We want to transform the data this way to capture interactions
between the original features by adding them as features to the
linear model.
o Polynomial feature expansion with high as this can lead to
complex models that over-fit.
o Polynomial feature expansion is often combined with a
regularized learning method like ridge regression.
Techniques for categorical data:
Ordinal categories:
Convert binary classifications to 0 and 1
Mapping multi categorical features to numerics with the assistance of
the domain expert. For example: mapping Small, Medium, Large to 5,
10, and 20.
Nominal categories:
One Hot Encoding:
o It is creating a binary column for each of the classes in the
feature.
o Pandas: pandas.get_dummies()
Grouping:
o Create a binary column for a group of features together.
Other techniques:
Radial Basis Function
Transform: $f(x) = f(||x - c||)$
Widely used in Support Vector Machine as a kernel and in Radial Basis
Networks (RBNNs)
Gaussian RBF is the most common RBF used.
Text-based Features
Bag-of-words model
o Represent document as vector of numbers, one for each word
(tokenize, count, and normalize)
o Note:
o Sparse matrix implementation is typically used, ignores relative
position of words.
o Can be extended to bag of n-grams of words or of characters
Count Vectorizer
o Per-word value is count (also called term frequency)
o Note: Includes lowercasing and tokenization on white space and
punctuation
o scikit-learn: sklearn.feature_extraction.text.CountVectorizer
Machine Learning
Page 15 of 28
TfidfVectorizer
o Term-Frequency times Inverse Document-Frequency
o Per-word value is downweighted for terms common across
documents (eg. "the")
o scikit-learn: sklearn.feature_extraction.text.TfidfVectorizer
Hashing Vectorizer
o Stateless mapper from text to term index
o scikit-learn: sklearn.feature_extraction.text.HashingVectorizer
Bagging/Boosting
Feature extraction and selection are relatively manual processes. Bagging and
boosting are automated or semi-automated approaches to determining which
features to include.
Model Training/Tuning
If training set too small, then Sample and Label more data if possible
If training set biased against or missing some important scenarios, then
Sample and Label more data for those scenarios if possible.
If it is not easy to sample or label more, then consider creating synthetic data
(duplication or techniques like SMOTE)
IMPORTANT: Training data doesn't need to be exactly representative, but yor
test set does.
Regularization
Machine Learning
Page 16 of 28
$\text{cost}_{reg} = \text{cost} + \frac{\alpha}{2}\text{penalty}$
Idea: Large weights corresponds to higher complexity.
Two standard types:
L1 regularization, Lasso
L2 regularization, Ridge
Hyperparameter tuning
Hyperparameter types:
Model Hyperparameter:
It helps to define the model itself.
Ex: Filter size, pooling, stride, padding
Optimizer Hyperparameter:
It is how the model learns patterns on data
Ex: Gradient Descent, Stochastic Gradient Descent
Data Hyperparameter:
It defines attributes for the data itself
Useful for small/homogenous datasets
Hyperparameters must be optimized separately
Methods for tuning hyperparameters:
Manually:
Manually select Hyperparameters based on one's intuition and
experience.
Often too shallow and inefficient of an approach
Grid Search
Random Search
Bayesian Search
Hyperparameter tuning doesn't always improve the model.
Best practices:
Don't adjust every hyperparameter
Limit range of values to what's most effective.
Run one training job at a time rather in parallel.
Grid Search
Machine Learning
Page 17 of 28
Once all the combinations are evaluated, the model with the set of
parameters which give the top metric is considered to be the best.
GridSearchCV returns the best combination of the hyperparameters, the
best estimator equipped with these best hyperparameters, and will also
report the performance metric of this best estimator.
Randomized Search
Bayesian Search
The supervised aspect refers to the need for each training example to have a label
in order for the algorithm to learn how to make accurate predictions on future
examples.
This is in contrast to unsupervised machine learning where we don't have labels for
Machine Learning
Page 18 of 28
the training data examples, and we'll cover unsupervised learning in a later part of
this course.
Neural Networks
Perceptron
It consists of multiple inputs layer, multiple hidden layers and multiple output
layer.
Each node outputs to the next input node.
K-nearest neighbor
Machine Learning
Page 19 of 28
Space complexity and prediction-time complexity grow with size of training
data.
Suffers from curse of dimensionality: points became increasingly isolated with
more dimensions, for a fixed-size training dataset.
It can be sensitive to small changes in the training data.
It can be used in python as below:
1. Initiate a variable
2. from sklearn.neighbors import KNeighborsClassifier
3. knn = KNeighborsClassifier(n_neighbors)
2. Train the model to memorize all its features and labels
```python
knn.fit()
```
3. To predict a label use the below function with 1 parameter that has the same
number of feature as the trained data
```python
knn.predict(param)
```
4. The accuracy can be tested by passing testing data and testing labels
```python
knn.score(X_test, y_test)
```
Linear Model
Linear models make strong assumptions about the structure of the data.
The target value can be predicted just using a weighted sum of the input
variables, a linear function.
It can get stable, but potentially inaccurate predictions.
1. Linear Regression
1. Least Squares:
Machine Learning
Page 20 of 28
- The most popular way to estimate w and b parameters is using what's called
least-squares linear regression or ordinary least-squares. Least-squares finds the
values of w and b that minimize the total sum of squared differences between the
predicted y value and the actual y value in the training set. Or equivalently it
minimizes the mean squared error of the model.
- This technique is designed to find the slope, the w value, and the b value of the
y intercept, that minimize this squared error, this mean squared error.
- The mean squared error is the square difference between predicted and actual
values, and then all these are added up, and then divided by the number of training
points, take the average, that will be the mean squared error of the model.
- One thing to note about this linear regression model is that there are no
parameters to control the model complexity. No matter what the value of w and b,
the result is always going to be a straight line. This is both a strength and a
weakness of the model.
```python
from sklearn.linear_model import LinearRegression
linreg = LinearRegression().fit(X_train,y_train)
# w_i: coefficients
linreg.coef_
# b: the intercept term
linreg.intercept_
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Machine Learning
Page 21 of 28
X_train_scaled = scaler.fit_transfor(X_train)
X_test_scaled = scaler.transform(X_test)
Machine Learning
Page 22 of 28
classifier to find a large marge on decision boundary, even if that
decision boundary leads to more points being misclassified.
- **Linear Model Pros**:
- Simple and easy to train.
- Fast prediction.
- Scales well to very large dataset.
- Works well with sparse data.
- Reasons for prediction are relatively easy to interpret.
- **Linear Model Cons**:
- For lower-dimensional data, other models may have superior generalization
performance.
- For classification, data may not be linearly separable.
1. Logistic Regression
It is a kind of generalized linear model.
In spite of being called a regression measure, it is actually used for
classification
like ordinary least squares and other regression methods, logistic
regression takes a set input variables, the features, and estimates a
target value.
Unlike ordinary linear regression, in it's most basic form logistic
repressions target value is a binary variable instead of a continuous
value.
There are flavors of logistic regression that can also be used in cases
where the target value to be predicted is a multi class categorical
variable, not just binary.
Logistic regression is similar to linear regression, but with one critical
addition. The logistic regression model still computes a weighted sum of
the input features xi and the intercept term b (like in linear regression),
but it runs this result through a special non-linear function f, the logistic
function represented by this new box in the middle of the diagram to
produce the output y. The effect of applying the logistic function is to
compress the output of the linear function so that it's limited to a range
between 0 and 1. Below the diagram, you can see the formula for the
predicted output y hat which first computes the same linear
combination of the inputs xi, model coefficient weights wi hat and
intercept b hat, but runs it through the additional step of applying the
logistic function to produce y hat.
If we pick different values for b hat and the w hat coefficients, we'll get
different variants of this s shaped logistic function, which again is
always between 0 and 1.
To perform logistic, regression in Scikit-Learn, you import the logistic
regression class from the sklearn.linear model module, then create the
object and call the fit method using the training data just as you did for
other class files like k nearest neighbors.
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X_C2,
y_C2,random_state = 0)
clf = LogisticRegression(C=1).fit(X_train, y_train)
- L2 regularization is 'on' by default (like ridge regression)
- Parameter C controls amount of regularization (default 1.0)
- As with regularized linear regression, it can be important to normalize all features
so that they are on the same scale.
Machine Learning
Page 23 of 28
1. Kernelized Support Vector Machines (SVMs)
It is a very powerful extension of linear support vector machines, it can
provide more complex models that can go beyond linear decision
boundaries.
SVMs can be used for both classification and regression.
one way to think about what kernelized SVMs do, is they take the
original input data space and transform it to a new higher dimensional
feature space, where it becomes much easier to classify the transform
to data using a linear classifier. (eg. instead of y(x) it became y(x,x^2)
like polynomial feature). The above figure shows at the right that the
points can be separated by a straight line after converting it to a two
dimensional space, while on the left is the original one dimensional
points in which the straight line is converted to a parabola.
An example of how it can be done using scikit-learn in Python.
from sklearn.svm import SVC
from adspy_shared_utilities import
plot_class_regions_for_classifier
X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2,
random_state = 0)
# The default SVC kernel is radial basis function (RBF)
#! SVC() = SVC(kernel = 'rbf', gamma=1, C=1)
plot_class_regions_for_classifier(SVC().fit(X_train, y_train), X_train,
y_train, None, None, 'Support Vector Classifier: RBF kernel')
# Compare decision boundaries with polynomial kernel, degree = 3
plot_class_regions_for_classifier(SVC(kernel = 'poly', degree =
3).fit(X_train, y_train), X_train, y_train, None, None, 'Support Vector
Classifier: Polynomial kernel, degree = 3')
Calling the fit method with the training data to train the model.
There is an SVC parameter called kernel, that allows us to set the
kernel function used by the SVM. The polynomial kernel takes
additional parameter degree that controls the model complexity
and the computational cost of this transformation.
Small gamma means a larger similarity radius (give broader,
smoother decision regions). So that points farther apart are
considered similar . Which results in more points being group
together. Small values of gamma. While larger values of gamma
give smaller, more complex decision regions, tightly constrained
decision boundaries.
SVMs also have a regularization parameter, C, that controls the
tradeoff between satisfying the maximum margin criterion to find
the simple decision boundary, and avoiding misclassification
errors on the training set. The C parameter is also an important
one for kernelized SVMs, and it interacts with the gamma
parameter.
Pros:
Can perform well on a range of datasets.
Versatile: different kernel functions can be specified, or custom kernels
can be defined for specific data types.
Works well for both low-and high-dimensional data.
Cons:
Machine Learning
Page 24 of 28
Efficiency (runtime speed and memory usage) decreases as training set
size increases (e.g. over 50000 samples).
Needs careful normalization of input data and parameter tuning.
Does not provide direct probability estimates (but can be estimated
using e.g. Platt scaling).
Difficult to interpret why a prediction was made.
1. Decision tree
- It can be used for both regression and classification.
- It learns a series of explicit `if then` rules on feature values that result in a
decision that predicts the target value. In this way any given object can be
categorized as either matching the target object the first person is thinking of or
not, according to its features as determined by asking the series of yes or no
questions. We can form these questions into a tree with a node representing one
question and the yes or no possible answers as the left and right branches from
that node that connect the node to the next level of the tree. One question being
answered at each level. At the bottom of the tree are nodes called leaf nodes that
represent actual objects as the possible answers. For any object there's a path from
the root of the tree to that object that is determined by the answers to the specific
yes or no questions at each level.
```python
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
Over Fitting
There is a problem with decision tree which is overfitting, due to its
complexity and essentially memorizing the training data.
One strategy to prevent overfitting is to prevent the tree from
becoming really detailed and complex by stopping its growth early. This
is called pre-pruning.
Machine Learning
Page 25 of 28
Another strategy is to build a complete tree with pure leaves but then
to prune back the tree into a simpler form. This is called post-pruning or
sometimes just pruning.
Feature Importance
Another way of analyzing the tree instead of looking at the whole tree
at once is to do what's called a feature important calculation.
one of the most useful and widely used types of summary analysis you
can perform on a supervised learning model.
typically a number between 0 and 1 that's assigned to an individual
feature.
It indicates how important that feature is to the overall prediction
accuracy.
A feature importance of zero means that the feature is not used at all in
the prediction. A feature importance of one, means the feature
perfectly predicts the target.
Typically, feature importance numbers are always positive and they're
normalized so they sum to one.
In scikit-learn, feature importance values are stored as a list in an
estimated property called feature_importances_.
Pros:
Easily visualized and interpreted.
No feature normalization or scaling typically needed.
Work well with datasets using a mixture of feature types (continuous,
categorical, binary)
Cons:
Even after tuning, decision trees can often still overfit.
Machine Learning
Page 26 of 28
Usually need an ensemble of trees for better generalization
performance.
ML Data Readiness
Productizing a ML Model
Aspects to consider
Model Hosting
Model deployment
Pipeline to provide features vectors
Code to provide low-latency and/or high-volume predictions
Model and data updating and versioning
Quality monitoring and alarming
Data and model security and encryption
Customer privacy, fairness, and trust
Data provider contractual constraints (eg., attribution, cross-fertilization)
Overfitting
Underfitting
The model is too simple for the actual trends that are present in the data. It
doesn't even do well on the training data and thus, is not at all likely to
generalize well to test data.
Machine Learning
Page 27 of 28
Types of Production environments
Batch predictions
Useful if all possible inputs known a priori (eg., all product categories for
which demand is to be forecast, all keywords to bid)
Predictions can still be served real-time, simply read from pre-computed
values
Online Predictions
Online training
Sometimes training data patterns change often, so need to train online (eh.,
fraud detection)
Business metrics may not be the same as the performance metrics that are
optimized during training. Why?
click-through rate
Ideally, performance metrics are highly correlated with business metrics.
Storage
Row-oriented formats:
comma/tab-separated values (CSV/TSV)
Read-only DB (RODB): Internal read-only file-based store with fast key-
based access
Avro: allows schema evolution for Hadoop
Column-oriented formats:
Parquet: Type-aware and indexed for Hadoop
Optimized row columnar (ORC): Type-aware, indexed, and with
statistics for Hadoop
User-defined formats:
JavaScript object notation (JSON): For key-value objects
Hierarchical data format 5 (HDF5): Flexible data model with chunks
Compression can be applied to all formats
Usual trad-offs: Read/write speeds, size, platform-dependency, ability for
schema to evolve, schema/data separability, type richness
Machine Learning
Page 28 of 28
Scikit-learn: Uses the Python pickle method to serialize/deserialize
Python objects
Spark MLlib: Transformers and Estimators implement ML Writable
TensorFlow (deep learning library): Allows saving of MetaGraph
MxNet (deep learning library): Saves into JSON
Model Deployment
Monitoring
Maintenance
Common Mistakes
Machine Learning