ML Answers Updated

1.
Suppose you are implementing a Polynomial Regression algorithm for a task and
identify that the learning curves show a large gap between the training error and
validation error. What is happening in this case? [100 Words]
Based on evidences from literature review, explain different methodologies used to

deal with this scenario. [400 Words]
Polynomial regression is employed for datasets that are usually non-linear. If we use linear
regression with non-linear datapoints, the loss function of the predictive model increases, and
the accuracy decreases significantly. In polynomial regression, we model the expected
dependent variable and the independent variables as follows:-
Dependent variable y= B0 +B1x +B2x2 +B3x3 +…..+ Bnxn +c
Where B0, B1, etc. are unknown parameters
X1, x2, etc. are the independent variables
The independent variables usually have a correlation of 0.97 when they are uniformly
distributed.
In cases where we apply polynomial regression and then observe that there is a massive gap
between the training errors and the validation errors, then we may infer:-
 There were some mistakes committed in the data labeling processes.

 Dataset was not adequately shuffled.
 Training and validation datasets do not comprise of equal distribution of values.
 The data sets have inherited bias on a large scale.
 The algorithm was not executed correctly.
However, to mitigate these risks and reduce the massive gap between the training and
validations errors, any one or combination of the following steps may be employed:-
 If there is a considerable gap between the training & validation errors, it would mean
that the data is not free from bias. We may perform a stratified random sampling of the
data set and then check the learning curves for accuracy.
 We may try to smoothen the datasets by using regularization, thus trying to reduce the
variance. This would allow the training dataset to act more like real-world data and
hence reduce the overfit developed.
 The degree of polynomial regression could be increased if the case underfitting of data
persists.
 Employing regularization methods such as Dropouts
 Test the data in smaller batches.
 Maximize the number of iterations when finding the accuracy of training data.
The following methodologies can be employed in order to mitigate or reduce the gap between
the training and the validation errors:
The cost function of a linear regression can be defined as :-

M M P
∑ ¿¿)2 = ∑ ¿¿-∑ w− z)2

i=1 i=1 j=0
Where x = dependent variable
y = independent variable
M = number of instances
P=features
y = w[0]*x[0] + w[1]*x[1] + … + w[n]*x[n] +b
w = coefficient
w[0] = slope
b = intercept
 Ridge regression:- The ridge regression method tries to reduce the coefficients. This in
turn would reduce the collinearity of the model. It is done by introducing a penalty to
the cost function which is equivalent to the square of the magnitude of the coefficients.
Thus, our cost function would turn out to be:-
M M P P
∑ ¿¿) = ∑ ¿¿-∑ w− z)
2 2
+ λ∑ w j2
i=1 i=1 j=0 j=0
When λ become close to zero, the above cost functions starts behaving much more like
a linear cost function.
 Lasso regression:- The lasso regression is similar to the ridge regression and differs only
in the way the penalty is imposed upon the cost function. In Lasso regression, instead of
taking the square of the magnitudes of the coefficient, their absolute values are taken
into consideration. Therefore, our cost function would look like:-
M M P P
∑ ¿¿) = ∑ ¿¿-∑ w− z)
2 2
+ λ∑ ¿ w j|
i=1 i=1 j=0 j=0
The difference between Lasso & Ridge regression is that ridge regression helps in
reducing the overfitting of the final linear model and reducing the overall coefficients to
almost zero but the lasso regression methods provides an additional flexibility of
reducing or controlling the number of features too.
 Elastic Net Regularization:- Elastic net regularization refers to the method of employing
both lasso as well as ridge regression to the linear cost function. This would enable us to
reduce the coefficients to near zero without eliminating any of the features of the
model.
Thus, elastic net regularization = Ridge + Lasso
The cost function would now be:-
M M P P P
∑ ¿¿) = ∑ ¿¿-∑ w− z)
2 2
+ λ∑ ¿ w j|+ λ∑ w j2
i=1 i=1 j=0 j=0 j=0
2. Review and reflect on your understanding of kernel tricks in SVM. [200 Words]
Use a literature review to demonstrate where and how the kernel tricks improved or
changed performance for a task. [300 Words]
Before understanding kernel tricks, we need to understand the basics of support vector
machines. Support vector machines are supervised machine learning approaches that are used
majorly for classification. They basically mean creating a boundary on the hyperplane so as to
divide data points efficiently and in a manner, such that the distance between the hyperplane
and the nearest data point is maximized. This may sound simple but it proves tricky when we
need to divide data across multiple dimensions and this is where kernel tricks prove useful.
For example, we may assume two points x and y in 3 dimensions which need to be mapped to a
9-dimensional space, the calculations would become really tedious.
X = (x1, x2, x3)T
Y = (y1, y2, y3)T
Ф(X) =(x12, x1x2, x1x3, x2x1, x22, x2x3, x3x1, x3x2, x32)T
Ф(Y) =(y12, y1y2, y1y3, y2y1, y22, y2y3, y3y1, y3y2, y32)T
3
Ф(X)T Ф(Y) = ∑ xi x j yi y j
i , j=1
The above calculations become really complex and this is where the kernel tricks come into
play. If we apply kernel trick to the above situation, we may be able to get the same result by
calculating the dot product of x transpose and y with the 3d space. Thus, kernel tricks make it
easier for us to transform data with avoiding much of the complexity.
K(x,y) =(xTy)2
=(x1y1 + x2y2 + x3y3)2

3
= ∑ xi x j yi y j
i , j=1
There are various types of kernel functions such as polynomial kernel function or radial basis
kernel function.
The polynomial kernel function can be defined as:
K(x,y) = (xTy +c)d
Where x and y are the vectors and d refers to the degree of freedom.
The radial basis kernel function can be defined as:
K(x1,x2) = exp(-||x1-x2||2/2Ϭ2
Where ||x1-x2||= Euclidian distance
Ϭ= Variance
We have ample of options when it comes to different kernel functions in order to decrease the
complexity while mapping data points in higher dimensions. However, it is very critical to
choose which kernel function we would like to employ depending upon the data type that
needs to be mapped. Moreover, we need to choose the kernel functions with much scrutiny so
that we do not overfit the data and the machine algorithm fails to respond accurately to new
real world data sets.
Kernel functions are being employed in a wide range of applications in diverse fields. One of the
application of Kernel function is finding termite infested areas in the forests and farms using
acoustic signals i.e. energy & entropy derived within a stipulated time period. The hyperplane
function was found to be:-
N
f(xi) = ∑ α nynK(xn,xi)+b
n =1
where xn = support vector data
αn = Lagrange multiplier
yn = label of membership class
Kernel function was employed in this problem in order to separate the acoustic signals emitted
by the termites from the signals emitted by the environment. Support Vector Machine in
combination with Radial Basis Function kernel is being used for the early detection of cancer by
analyzing the genome expression data in the patients.
References:
Achirul Nanda, M., Boro Seminar, K., Nandika, D. and Maddu, A., 2018. A comparison study of kernel
functions in the support vector machine and its application for termite detection. Information, 9(1), p.5.
Huang, S., Cai, N., Pacheco, P.P., Narrandes, S., Wang, Y. and Xu, W., 2018. Applications of support
vector machine (SVM) learning in cancer genomics. Cancer genomics & proteomics, 15(1), pp.41-51.
3. Based on a literature study, discuss in detail the ensemble learning algorithm(s) for decision
tree modelling. [250 Words]
Show examples from literature how these techniques have improved performance in
different domains. [250 Words]
Decision tree processes employ a number of inputs to predict the value of a single desired
output. They follow a structure similar to a flow chart. These flow charts contain the tactical
decisions as well as the outcomes of each decision, thus embodying a tree in nature. The most
common example of decision tree models is data mining. Decision trees use the concept of
information gain to check the homogeneity of data points and thus splitting them into different
classes.
Information gain can be defined mathematically as:-
G(Q,ϴ) = (Nleft/Nm)*H(Qleft(ϴ))+(Nright/Nm)*H(Qright(ϴ))
Where G = Information Gain
Q = value of data present at the node
ϴ= threshold value
Nleft = number of data points at the left of the threshold
Nright = number of data points at the right of the threshold
H = function representing the criteria of split
However, the computation may become much more complex with the increase in data points.
Hence, the model would become more liable to overfitting and the accuracy of the model
would be affected in a way that the algorithm won’t provide accurate results for real world
data. In order to overcome these issues, various ensemble methods are used which employ
bagging & boosting to the decision tree models. Bagging is the process where the training data
is divided into several subsets in a random fashion with the aim of reducing variance in the
training data, thus making the decision tree model much more robust. In contrast, boosting
refers to the process of combining weak subsets of data in order to achieve a robust data set
with the aim of reducing bias present.
These ensemble techniques have proved useful to us in diverse fields such as satellite imagery
& medical research. One of the implemantations of ensemble techniques such as boosting and
bagging has been to identify different samples of tissue based upon training set of data that
constitutes of genome expressions. These ensemble techniques have also proved useful by
taking attributes such as age, weakness, obesity, muscle stiffness etc and combined with
classifier techniques in identifying the risk of diabetes at a very early stage and hence proper
diagnosis is possible. They have also been helpful in distinguishing features using satellite
imagery.
Ensemble methods have further proved useful in identifying disaster prone areas such as areas
prone to landslides or floods. Hybrid ensemble models such as frequency ratio & logistic
regression, SVM, weights of evidence and firefly algorithms are being deployed to better
identify areas prone to floods by studying the factors that influence the susceptibility to such a
natural calamity and the historical figures to draw a pattern & relationship.
Figure 1 Flash Flood Susceptibility Maps in Vietnam using geospatia data
Figure 2 Quantitative Analysis of the above flood susceptibilty maps
References:
Pham, B.T., Jaafari, A., Van Phong, T., Yen, H.P.H., Tuyen, T.T., Van Luong, V., Nguyen, H.D., Van Le,
H. and Foong, L.K., 2021. Improved flood susceptibility mapping using a best first decision tree integrated
with ensemble learning techniques. Geoscience Frontiers, 12(3), p.101105.
Ghiasi, M.M. and Zendehboudi, S., 2021. Application of decision tree-based ensemble learning in the
classification of breast cancer. Computers in Biology and Medicine, 128, p.104089.
4. Prepare a case study example task where an unsupervised clustering algorithm need to be
used. Give all the details involved in this task including features, model parameters,
learning/training algorithm and other relevant details. [250 Words]*
*This is not a programming task. The idea is to demonstrate the theoretical understanding
of the algorithm not the coding skills.
Based on learnings from the literature, critically analyse your choice of the clustering
algorithm and parameters for the same. Justify how these choices are expected to work
better than others. [350 Words]
Suppose you own a delivery agency that has been operating in full fledged capacity across
major metropolitan cities in India. However, you have zero knowledge about who really are
your customers. What are the attributes of the customers who render your services on a
regular basis and customers who rarely use your services. What is the product category for
which the customers employ your services the most? Which product category has been driving
your business the least? What is the average size of the basket for which the customer uses
your services and what are the most usual and unusual modes of payments? What channel
does the customer use to reach out to you and how do they interact with you throughout the
process? Are we able to correctly identify their met and unmet needs?
It is imperative from the above statements that the each and every customer is unique in his or
her own ways. Therefore, there is no definite set of attributes that we can employ to define all
the customers. So, how would you segment your customers based upon their purchasing and
interacting habits using machine learning?
We have learned from the above statements that each and every customer is diverse in their
own ways. There may be some data points that are explicitly available to us. In contrast, there
would be some other attributes which are unknown to us. For instance, in case the customer
orders some healthcare products, then we as a delivery company would not have access to the
exact product that the customer has ordered owing to patient discrepancy i.e. there is some
unlabeled data present. At this juncture, unsupervised machine learning, basically k-means
clustering comes to our rescue. The algorithm would be as follows:-
 The first step would be to identify the most relevant metrics for our business and
available data about our customers. Some of the parameters can be as follows:-
1. Average number of orders
2. Basket size
3. Product category
4. Customer demographics
5. Channel used by customer
6. Mode of payment
7. Acceptance rate
8. Return rate
9. Cumulative spending per customer
10. Count of distinct category products and many more.
 The next step would be to divide them into initial distinct categories.
 Apply k-means clustering on the data available which segments the data showing
squared distances of each cluster.
After visualizing the data from k clustering means, we would be able to segment the customer
data in much more efficient way revealing anomalies or similarities in their purchasing habits
and what do the customers really yearn for.
The k means clustering is supposed to work much better than other unsupervised machine
learning processes due to the following advantages:-
 We can fix the number of clusters beforehand as it can be difficult to create different
strategies for a large number of clusters with minimal differentiations.
 It is relatively easy to implement than other unsupervised learning algorithms.
 It is relatively easy to generalize to clusters of different shapes & sizes.
 K means clustering would easily adapt to large datasets and provide accurate results.
5. The IT department in a company requested you to build a spam mail classifier

based on the length of the message. Design a logistic regression classifier for this
task. Define the cost function and derive gradient descent update rules for this
Logistic Regression Classifier. [400 Words]
Hint: This is not a programming task and tests your theoretical knowledge. Use the
least mean squared error. The derivative of sigmoid function is Sigma(x)*(1-
Sigma(x)).
Which error metrics could be used to test the performance of your classifier?
Describe these metrics in detail and explain what you would achieve by optimising
each one of them. [150 Words]
The unsupervised learning aspect of machine learning would prove to be helpful in designing an
algorithm for the detection of spam emails. Using the unsupervised approach, the algorithm would be
provided with data sets that contain only the inputs, and the algorithm would itself try to find
correlations or anomalies within the training data. The task would start with the collection of training
data, i.e., sample email messages, which would allow the algorithm to find oddities such as frequency of
emails being sent from a similar IP address, use of identical context words in the email, etc.
The flow of the algorithm would be as follows:-
 The initial step would be to cleanse the training data and remove irrelevant words from the
emails that are to be provided as the training dataset.
 The second step would be lemmatization, where different forms of a form would be considered
as a single entity.
 Employ the machine learning classification algorithm, which is logical regression in our case.
 Identifying the accuracy of the algorithm employed.
The output of the machine learning algorithm would be binary, i.e., either spam or not spam. The
probability of an email being spam would lie between 0 to 1.
The logistic change for the probability of an email being spam = log(πi) = ln(πi/1- πi)
Now the probability of an email being spam lies between 0 to 1.
Therefore, -∞ < ln(πi/1- πi) < +∞
However, the characteristic change can also be denoted as:
ln(πi/1- πi) = α + β1xi1 + β2xi2 +…..+βjxij
Now, by solving the above equation for πi, we would get:
πi = exp(α + βx)/1+ exp(α + βx)
The probability of an email to be spam can be:- πi /(1- πi)
The accuracy of the classifier can be predicted using different methods, the most popular of which is
confusion matrix.
The above representation depicts the confusion matrix, where:
TP= True Positive
FP = False Positive
FN = False Negative
TN = True Negative
The following metrics could be employed to assess the performance of a regression classifier:
Accuracy: (TP + TN)/(TP+TN+FP+FN)
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
Specificity: TN/(TN+FP)
F1 SCORE: 2*Precision*Recall/(Precision+Recall)
True Positive Rate: TP/(TP+FN)
False Positive Rate: FP/(FP+TN)
The performance assessing metrics would differ for different classification algorithms. However,
an accuracy of 95% and above would be expected from logistic regression. If the accuracy falls
below the threshold, the algorithm must be checked for any biases or functional errors.

ML Answers Updated

Uploaded by

Copyright:

Available Formats

ML Answers Updated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Answers Updated

Uploaded by

Copyright:

Available Formats

1.

Based on evidences from literature review, explain different methodologies used to

Dependent variable y= B0 +B1x +B2x2 +B3x3 +…..+ Bnxn +c

Where B0, B1, etc. are unknown parameters

X1, x2, etc. are the independent variables

 There were some mistakes committed in the data labeling processes.

The cost function of a linear regression can be defined as :-

∑ ¿¿)2 = ∑ ¿¿-∑ w− z)2

Where x = dependent variable

y = w[0]*x[0] + w[1]*x[1] + … + w[n]*x[n] +b

changed performance for a task. [300 Words]

X = (x1, x2, x3)T

Y = (y1, y2, y3)T

=(x1y1 + x2y2 + x3y3)2

The polynomial kernel function can be defined as:

K(x,y) = (xTy +c)d

The radial basis kernel function can be defined as:

Where ||x1-x2||= Euclidian distance

where xn = support vector data

yn = label of membership class

Information gain can be defined mathematically as:-

Where G = Information Gain

Q = value of data present at the node

Nleft = number of data points at the left of the threshold

Nright = number of data points at the right of the threshold

H = function representing the criteria of split

5. The IT department in a company requested you to build a spam mail classifier

The flow of the algorithm would be as follows:-

Now the probability of an email being spam lies between 0 to 1.

Therefore, -∞ < ln(πi/1- πi) < +∞

However, the characteristic change can also be denoted as:

ln(πi/1- πi) = α + β1xi1 + β2xi2 +…..+βjxij

Now, by solving the above equation for πi, we would get:

πi = exp(α + βx)/1+ exp(α + βx)

The probability of an email to be spam can be:- πi /(1- πi)

TP= True Positive

Accuracy: (TP + TN)/(TP+TN+FP+FN)

True Positive Rate: TP/(TP+FN)

False Positive Rate: FP/(FP+TN)

You might also like

y = w[0]x[0] + w[1]x[1] + … + w[n]*x[n] +b