ML Unit-2
ML Unit-2
ML Unit-2
Learning Step
Ans:
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. o It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm. o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
Algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
➢ Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Ans:
• The measure of attribute selection is a heuristic in nature for selecting the splitting criterion
that “best” separates a given data partition, D, of classlabelled training tuples into individual
classes.
• It determines how the tuples at a given node are to be split.
• The attribute selection measure provides a ranking for each attribute describing the given
training tuples.
➢ The three methods are used for attribute selection as follows:
1. Information Gain
2. Gain Ratio
3. Gini Index
Information Gain:
• The Information gain is used to select the splitting attribute in each node in the decision tree.
• It follows the method of entropy while aiming at reducing the level of entropy, starting from
the root node to the leaf nodes.
• The attribute with the highest information gain is chosen as the splitting attribute for the
current node.
• It is biased towards the multi-valued attribute.
• The information gained on attribute A is the mutual information that exists between the
attribute Class and attribute A.
It is defined as follows:
❖ Infromation Gain (A)=H(Class)−H(Class|A)
Gain Ratio:
• It is an unbalanced split.
• In this one partition is much smaller than the other partition.
• The gain ratio on attribute A is the ratio of the information gained on A over the expected
information of A, normalizing uncertainty across attributes.
It is defined as follows:
Gain Ratio (A)=H(Class)−H(Class |A)H(A)
H(A)
Gini Index:
• The Gini index measures uses binary split for each attribute.
• In this, partitions are equal.
• The attribute with the minimum Gini index is selected as the splitting attribute.
• It is also biased toward the multi-valued attribute.
• It cannot manage a large number of classes.
• The Gini function measures the impurity of an attribute with respect to classes.
The Gini index of A defined below, is the difference between the impurity of Class and the average
impurity of A regarding the classes, representing a reduction of impurity over the choice of attribute
A.
Ans:
Logistic Regression:
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique.
o It is used for predicting the categorical dependent variable using a given set of independent
variables. o The outcome must be a categorical or discrete value. It can be either Yes or No, 0
or 1, true or False, etc, but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1. o Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1). o Logistic Regression is a significant machine
learning algorithm because it has the ability to provide probabilities and classify new data
using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Ans:
Linear Regression:
➢ Linear regression is one of the easiest and most popular Machine Learning algorithms.
➢ It is a statistical method that is used for predictive analysis. Linear regression makes predictions
for real or numeric variables such as sales, salary, age, product price, etc.
➢ Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression.
➢ Since linear regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
➢ The linear regression model provides a sloped straight line representing the
relationship between the variables.
➢ Mathematically, we can represent a linear regression as: y= a0+a1x+ ε
Here:
Linear regression can be further divided into two types of the algorithm:
• The first widely used software package for fitting these models was called GLIM. Because of
this program, "GLIM" became a well-accepted abbreviation for generalized linear models.
• Generalized Linear Models, a bunch of general machine learning models for supervised
learning problems (both for regression and classification).
• Easy to interpret
• Can easily be deployed in spreadsheet format
• Handles different response/target distributions
• Is commonly used in insurance ratemaking
GLM Disadvantages
Ans:
K-Nearest Neighbour:
Ans:
Regression trees
➢ Regression trees are decision trees in which the target variables can take continuous values
instead of class labels in leaves. Regression trees use modified split selection criteria and
stopping criteria.
➢ By using a regression tree, you can explain the decisions, identify possible events that might
occur, and see potential outcomes. The analysis helps you determine what the best decision
would be.
To divide the data into subsets, regression tree models use nodes, branches, and leaves.
➢ Usage: The main advantage of regression trees is their human-readability. Regression trees
not only predict attribute values of targets, they also explain which attributes are used and
how the attributes are used to reach the predictions.
➢ Functions for regression trees :The regression tree algorithm is implemented in the REGTREE,
GROW_REGTREE, PRUNE_REGTREE, and PREDICT_REGTREE stored procedures. To print
regression trees, use the PRINT_MODEL stored procedure.
Example for regression tree:
Ans:
• In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
• The following are a few binary classification applications, where the 0 and 1 columns are two
possible classes for each observation:
Application Observation 0 1
Medical Diagnosis Patient Healthy Diseased
Financial Data
Transaction Not Fraud Fraud
Analysis
Website Won't
Marketing visitor Buy Will Buy
Not
Image Classification Image Hotdog
Hotdog
Quick example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input
features and predict whether the patient is healthy or has the disease. The possible outcomes of the
diagnosis are positive and negative.
•If the model successfully predicts the patients as positive, this case is called True Positive (TP).
•If the model successfully predicts patients as negative, this is called True Negative (TN).
•If a diseased patient is classified as healthy by a negative test result, this error is called False
Negative (FN).
•Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called
False Positive(FP).
• True Positive (TP): The patient is diseased and the model predicts "diseased"
•False Positive (FP): The patient is healthy but the model predicts "diseased"
•True Negative (TN): The patient is healthy and the model predicts "healthy"
•False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as follows:
In machine learning many methods utilize binary classification. The most common are
• Naive Bayes
• Nearest Neighbour
• Decision Trees
• Logistic Regression
• Neural Networks
11. Explain in detail about ranking and probability estimation tree.
Ans:
Ranking:
A deterministic ranking algorithm is one in which the order of the items in the ranked list is fixed and
does not change, regardless of the input data. An example of a deterministic ranking algorithm is the
rank-by-feature algorithm. In this algorithm, each item is assigned a rank based on its feature value.
In a probabilistic ranking algorithm, the order of the items in the ranked list may vary, depending on
the input data. An example of a probabilistic ranking algorithm is the rank-by-confidence algorithm.
In this algorithm, each item is assigned a rank based on its confidence value.
There are many different types of ranking algorithms, each with its own set of advantages and
disadvantages. Some of the most common types of ranking algorithms are:
Binary Ranking Algorithms: Binary ranking algorithms are the simplest type of ranking algorithm. A
binary ranking algorithm ranks items in a dataset according to their relative importance. The two most
common types of binary ranking algorithms are the rank-by-feature and the rank-by-frequency
algorithms. Rankby-feature algorithms rank items by the number of features that they have in
common with the reference item. Rank-by-frequency algorithms rank items by the number of times
that they occur in the dataset.
Ranking by Similarity: Ranking by similarity is a type of probabilistic ranking algorithm that ranks items
in a dataset according to their similarity to a reference item.
Ranking by Distance: Ranking by distance algorithms are a type of probabilistic ranking algorithm that
rank items in a dataset according to their distance from a reference item.
Ranking by Preference: Preferential ranking algorithms are a type of probabilistic ranking algorithm
that rank items in a dataset according to their preference for a reference item.
Ranking by Probability: Ranking by probability is a type of probabilistic ranking algorithm that ranks
items in a dataset according to their probability of being a positive example. Probability estimation
tree(PETs)
• Error rate does not consider the probability of the prediction, so in PET
• Instead of predicting a class, the leaves give a probability
• Very useful when we do not want just the class, but examples most likely to belong to a class (e.g.
direct marketing)
➢ Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. It assumes that there is a linear relationship
between the dependent variable and the independent variables. The intuition behind linear
regression is that it tries to find the line of best fit that minimizes the distance between the
actual data points and the predicted values.
➢ One intuition of linear regression is that it can be used to make predictions. For example, if we
have data on the age and height of a group of individuals, we can use linear regression to
predict the height of an individual given their age.
➢ Another intuition of linear regression is that it can be used to identify the strength and direction
of the relationship between the dependent variable and the independent variables. For
example, if we have data on the amount of time spent studying and the grade received on a
test, linear regression can be used to determine whether there is a positive or negative
relationship between the two variables.
➢ Linear regression is sensitive to outliers, which are data points that are significantly different
from the other data points. Outliers can have a large effect on the line of best fit and can cause
the regression to be less accurate. There are several ways to handle outliers in linear
regression, including:
➢ Removing the outliers: One way to handle outliers is to remove them from the data set.
However, this should be done with caution, as removing too many outliers can cause the
regression to lose important information.
➢ Transforming the data: Another way to handle outliers is to transform the data so that the
outliers are less extreme. For example, we can take the logarithm of the data or use a square
root transformation.
➢ Using robust regression: Robust regression methods, such as the Huber loss function or the
least absolute deviation method, are less sensitive to outliers than ordinary least squares
regression.
➢ For example, let's say we are trying to predict the price of a house based on its size in square
feet. We have data on 10 houses, and their prices and sizes are as follows:
We can use linear regression to model the relationship between the price and the size of the house.
The regression equation is:
If we add an outlier to the data set, such as a house that is 8000 square feet and priced at 800000 USD,
the regression equation becomes:
The addition of the outlier has dramatically changed the slope of the line of best fit, making the
regression less accurate. To handle this outlier, we can remove it from the data set or use a robust
regression method.
13. Demonstrate the importance of scoring and ranking in assessing the performance of
classification tasks.
Ans:
➢ Scoring and ranking are crucial components in assessing the performance of classification
tasks, as they provide a quantitative and objective way of evaluating the effectiveness of a
classification model.
➢ When evaluating a classification model, it is important to distinguish between two types of
errors: false positives and false negatives. False positives occur when the model predicts a
positive result when the actual result is negative, while false negatives occur when the model
predicts a negative result when the actual result is positive.
➢ Scoring and ranking techniques provide a way to measure the performance of a classification
model by considering both false positives and false negatives. For example, one commonly
used scoring metric is accuracy, which measures the proportion of correct predictions made
by the model. However, accuracy alone does not provide enough information to fully evaluate
the performance of a model.
➢ To better assess the effectiveness of a classification model, various scoring and ranking
techniques can be used, such as precision, recall, F1 score, ROC curves, and AUC. Precision
measures the proportion of true positives among all positive predictions made by the model,
while recall measures the proportion of true positives among all actual positive cases. The F1
score is the harmonic mean of precision and recall, and provides a balance between these two
metrics.
➢ ROC curves and AUC are commonly used in binary classification tasks to evaluate the
performance of a model across different thresholds for the predicted probabilities. ROC curves
plot the true positive rate (TPR) against the false positive rate (FPR) for various threshold
values, while AUC measures the area under the ROC curve. A higher AUC value indicates better
performance of the model in distinguishing between positive and negative cases.
➢ In conclusion, scoring and ranking are critical in assessing the performance of classification
tasks as they provide a more complete evaluation of a model's effectiveness beyond just
accuracy. By using various scoring and ranking techniques, one can better understand the
strengths and weaknesses of a classification model and make informed decisions about its use
in real-world applications.