Introduction To Statistical Machine Learning
Introduction To Statistical Machine Learning
Unit – I
• Statistical modeling is applying statistics on data to find underlying hidden relationships by analyzing the
significance of the variables.
• Zero-one loss is L0-1 = 1 (m <= 0); in zero-one loss, value of loss is 0 for m >= 0 whereas 1 for
m < 0.
• The difficult part with this loss is it is not differentiable, non-convex, and also NP- hard.
• Surrogate losses used for machine learning in place of zero-one loss
• Types are
• Squared loss (for regression)
• Hinge loss (SVM)
• Logistic/log loss (logistic regression)
Dr.S.Veena, Associate Professor, SRMIST 58
Types of Losses
• Squared loss is a loss function that can be used in the learning setting in which we are predicting a
real-valued variable y given an input variable x.
• The hinge loss is a loss function used for training classifiers, most notably the SVM. ... A negative
distance from the boundary incurs a high hinge loss. This essentially means that we are on the
wrong side of the boundary, and that the instance will be classified incorrectly.
• Log Loss is the most important classification metric based on probabilities.For any given problem, a
lower log loss value means better predictions. Mathematical interpretation: Log Loss is the negative
average of the log of corrected predicted probabilities for each instance. Log-loss is indicative of how
close the prediction probability is to the corresponding actual/true value (0 or 1 in case of
binary classification). The more the predicted probability diverges from the actual value, the higher is
the log-loss value
The following Python code splits the data into training and the remaining
data. The remaining data will be further split into validation and test datasets:
tr_data,rmng = train_test_split(dat,train_size =
trnr,random_state=42)
... vl_data, ts_data =
train_test_split(rmng,train_size =
vlnr,random_state=45)
... return (tr_data,vl_data,ts_data)
Dr.S.Veena, Associate Professor, SRMIST 66
Train, validation, and test data
Implementation of the split function on the original data to create three datasets (by 50 percent, 25
percent, and 25 percent splits) is as follows:
Predict using the best parameters of grid search:
• >>> y_pred = grid_search.predict(X_test)
• Supervised learning: This is where an instructor provides feedback to a student on whether they have performed well in an examination or not. In
which target variable do present and models do get tune to achieve it. Many machine learning methods fall in to this category
• Classification problems
• Logistic regression
• Lasso and ridge regression
• Decision trees (classification trees)
• Bagging classifier
• Random forest classifier
• Boosting classifier (adaboost, gradient boost, and xgboost)
• SVM classifier
• Recommendation engine
• Regression problems
• Linear regression (lasso and ridge regression)
• Decision trees (regression trees)
• Bagging regressor
• Random forest regressor
• Boosting regressor - (adaboost, gradient boost, and xgboost)
• SVM regressor
• Unsupervised learning: Similar to the teacher-student analogy, in which the instructor does not
present and provide feedback to the student and who needs to prepare on his/her own.
Unsupervised learning does not have as many are in supervised learning:
• Principal component analysis (PCA)
• K-means clustering
• Reinforcement learning: This is the scenario in which multiple decisions need to be taken by an
agent prior to reaching the target and it provides a reward, either +1 or -1, rather than notifying
how well or how badly the agent performed across the path
• Markov decision process
• Monte Carlo methods
• Temporal difference learning
Logistic regression:
• This is the problem in which outcomes are discrete classes rather than continuous values.
• For example, a customer will arrive or not, he will purchase the product or not, and so on.
• In statistical methodology, it uses the maximum likelihood method to calculate the parameter of
individual variables.
• In contrast, in machine learning methodology, log loss will be minimized with respect to β coefficients
(also known as weights).
• Logistic regression has a high bias and a low variance error
Linear regression:
• This is used for the prediction of continuous variables such as customer income and so on.
• It utilizes error minimization to fit the best possible line in statistical methodology.
• However, in machine learning methodology, squared loss will be minimized with respect to β coefficients.
• Linear regression also has a high bias and a low variance error
Bagging:
• This is an ensemble technique applied on decision trees in order to minimize the variance error and
at the same time not increase the error component due to bias.
• In bagging, various samples are selected with a subsample of observations and all variables
(columns), subsequently fit individual decision trees independently on each sample and later
ensemble the results by taking the maximum vote (in regression cases, the mean of outcomes
calculated
Random forest:
• In bagging, all the variables/columns are selected for each sample, whereas in random forest a few
subcolumns are selected.
• The reason behind the selection of a few variables rather than all was that during each independent
tree sampled, significant variables always came first in the top layer of splitting which makes all the
trees look more or less similar and defies the sole purpose of ensemble:
• Random forest has both low bias and variance errors
Boosting:
• This is a sequential algorithm that applies on weak classifiers such as a decision stump (a one-level
decision tree or a tree with one root node and two terminal nodes) to create a strong classifier by
ensembling the results.
• The algorithm starts with equal weights assigned to all the observations, followed by subsequent
iterations where more focus was given to misclassified observations by increasing the weight of
misclassified observations and decreasing the weight of properly classified observations.
• In the end, all the individual classifiers were combined to create a strong classifier. Boosting might
have an overfitting problem, but by carefully tuning the parameters, we can obtain the best of the self
machine learning model
Support vector machines (SVMs):
• This maximizes the margin between classes by fitting the widest possible hyperplane between them.
• In the case of non-linearly separable classes, it uses kernels to move observations into higher-
dimensional space and then separates them linearly with the hyperplane there
Recommendation engine:
• This utilizes a collaborative filtering algorithm to identify high-probability items to its respective
users, who have not used it in the past, by considering the tastes of similar users who would be
using that particular item.
• It uses the alternating least squares (ALS) methodology to solve this problem
K-means clustering:
• This is an unsupervised algorithm that is mainly utilized for segmentation exercise.
• K-means clustering classifies the given data into k clusters in such a way that, within the cluster,
variation is minimal and across the cluster, variation is maximal