Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views16 pages

ML Unit-2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

UNIT- 2

1.Define classification? Explain general approach to solving a classification problem.


Ans:
Classification:
Classification is a form of data analysis that extracts models describing important data classes. Such
models, called classifiers, predict categorical (discrete, unordered) class labels. For example, we can
build a classification model to categorize bank loan applications as either safe or risky. Such analysis
can help provide us with a better understanding of the data at large. Many classification methods have
been proposed by researchers in machine learning, pattern recognition, and statistics.

General Approach for Classification:

• Data classification is a two-step process, consisting of a learning step (where a classification


model is constructed) and a classification step (where the model is used to predict class labels
for given data).
• In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the classifier
by analysing or “learning from” a training set made up of database tuples and their associated
class labels.
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
• In the second step, the model is used for classification. First, the predictive accuracy of the
classifier is estimated. If we were to use the training set to measure the classifier’s accuracy,
this estimate would likely be optimistic, because the classifier tends to overfit the data.
• Accuracy rate is the percentage of test set samples that are correctly classified by the model.

Learning Step

Fig: Classification Step


2.Explain in detail about Decision Tree with an example?

Ans:

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
o It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. o It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm. o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.

Algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

➢ Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer).

3.Explain about attribute selection measures?

Ans:

Attribute selection measures:

• The measure of attribute selection is a heuristic in nature for selecting the splitting criterion
that “best” separates a given data partition, D, of classlabelled training tuples into individual
classes.
• It determines how the tuples at a given node are to be split.
• The attribute selection measure provides a ranking for each attribute describing the given
training tuples.
➢ The three methods are used for attribute selection as follows:
1. Information Gain
2. Gain Ratio
3. Gini Index

Information Gain:
• The Information gain is used to select the splitting attribute in each node in the decision tree.
• It follows the method of entropy while aiming at reducing the level of entropy, starting from
the root node to the leaf nodes.
• The attribute with the highest information gain is chosen as the splitting attribute for the
current node.
• It is biased towards the multi-valued attribute.
• The information gained on attribute A is the mutual information that exists between the
attribute Class and attribute A.
It is defined as follows:
❖ Infromation Gain (A)=H(Class)−H(Class|A)
Gain Ratio:
• It is an unbalanced split.
• In this one partition is much smaller than the other partition.
• The gain ratio on attribute A is the ratio of the information gained on A over the expected
information of A, normalizing uncertainty across attributes.
It is defined as follows:
Gain Ratio (A)=H(Class)−H(Class |A)H(A)
H(A)
Gini Index:
• The Gini index measures uses binary split for each attribute.
• In this, partitions are equal.
• The attribute with the minimum Gini index is selected as the splitting attribute.
• It is also biased toward the multi-valued attribute.
• It cannot manage a large number of classes.
• The Gini function measures the impurity of an attribute with respect to classes.

The impurity function is defined as:

The Gini index of A defined below, is the difference between the impurity of Class and the average
impurity of A regarding the classes, representing a reduction of impurity over the choice of attribute
A.

4.Explain information gain with example.


Ans:
Information Gain:
The Information gain is used to select the splitting attribute in each node in the decision tree.
• It follows the method of entropy while aiming at reducing the level of entropy, starting from
the root node to the leaf nodes.
• The attribute with the highest information gain is chosen as the splitting attribute for the
current node.
• It is biased towards the multi-valued attribute.
• The information gained on attribute A is the mutual information that exists between the
attribute Class and attribute A.
• It is defined as follows:

❖ Information Gain (A)=H(Class)−H(Class|A)

Information Gain=Entropy before splitting-Entropy after splitting


Example:
5.Explain about logistic regression?

Ans:
Logistic Regression:
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique.
o It is used for predicting the categorical dependent variable using a given set of independent
variables. o The outcome must be a categorical or discrete value. It can be either Yes or No, 0
or 1, true or False, etc, but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1. o Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1). o Logistic Regression is a significant machine
learning algorithm because it has the ability to provide probabilities and classify new data
using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification.

Logistic Regression Equation:


• The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:

• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

6.Explain about linear regression?

Ans:

Linear Regression:

➢ Linear regression is one of the easiest and most popular Machine Learning algorithms.
➢ It is a statistical method that is used for predictive analysis. Linear regression makes predictions
for real or numeric variables such as sales, salary, age, product price, etc.
➢ Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression.
➢ Since linear regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
➢ The linear regression model provides a sloped straight line representing the
relationship between the variables.
➢ Mathematically, we can represent a linear regression as: y= a0+a1x+ ε

Here:

Y=Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

Types of Linear Regression:

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line:


A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:

Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship.

7.Explain about generalized linear models.


Ans:

• The first widely used software package for fitting these models was called GLIM. Because of
this program, "GLIM" became a well-accepted abbreviation for generalized linear models.
• Generalized Linear Models, a bunch of general machine learning models for supervised
learning problems (both for regression and classification).

The following are three popular examples of GLMs:


GLM Advantages

• Easy to interpret
• Can easily be deployed in spreadsheet format
• Handles different response/target distributions
• Is commonly used in insurance ratemaking

GLM Disadvantages

• Does not select features (without stepwise selection)


• Strict assumptions around distribution shape and randomness of error terms
• Predictor variables need to be uncorrelated
• Unable to detect non-linearity directly (although this can manually be addressed through
feature engineering)
• Sensitive to outliers
• Low predictive power

8. Describe Nearest neighbour classification in detail.

Ans:
K-Nearest Neighbour:

➢ K-Nearest Neighbours (KNN)


➢ Simple, but a very powerful classification algorithm
➢ Classifies based on a similarity measure
➢ Non-parametric
➢ Lazy learning Does not “learn” until the test example is given Whenever we have a new data to
classify, we find its K-nearest neighbours from the training data

KNN: Classification Approach


➢ Classified by “MAJORITY VOTES” for its neighbour classes Assigned to the most common class
amongst its K-nearest neighbours (by measuring “distant” between data)

K Nearest Neighbours — Pseudocode


1. Load the training and test data
2. Choose the value of K
3. For each point in test data:
- find the Euclidean distance to all training data points
- store the Euclidean distances in a list and sort it
- choose the first k points
- assign a class to the test point based on the majority of classes present in the chosen points
4. End

9.Write a detailed note on regression trees.

Ans:

Regression trees

➢ Regression trees are decision trees in which the target variables can take continuous values
instead of class labels in leaves. Regression trees use modified split selection criteria and
stopping criteria.
➢ By using a regression tree, you can explain the decisions, identify possible events that might
occur, and see potential outcomes. The analysis helps you determine what the best decision
would be.
To divide the data into subsets, regression tree models use nodes, branches, and leaves.

➢ Usage: The main advantage of regression trees is their human-readability. Regression trees
not only predict attribute values of targets, they also explain which attributes are used and
how the attributes are used to reach the predictions.

➢ Functions for regression trees :The regression tree algorithm is implemented in the REGTREE,
GROW_REGTREE, PRUNE_REGTREE, and PREDICT_REGTREE stored procedures. To print
regression trees, use the PRINT_MODEL stored procedure.
Example for regression tree:

10.Explain about binary classification and related tasks.

Ans:
• In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
• The following are a few binary classification applications, where the 0 and 1 columns are two
possible classes for each observation:

Application Observation 0 1
Medical Diagnosis Patient Healthy Diseased

Email Analysis Email Not Spam Spam

Financial Data
Transaction Not Fraud Fraud
Analysis

Website Won't
Marketing visitor Buy Will Buy

Not
Image Classification Image Hotdog
Hotdog
Quick example

In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input
features and predict whether the patient is healthy or has the disease. The possible outcomes of the
diagnosis are positive and negative.

Evaluation of binary classifiers

•If the model successfully predicts the patients as positive, this case is called True Positive (TP).

•If the model successfully predicts patients as negative, this is called True Negative (TN).

•If a diseased patient is classified as healthy by a negative test result, this error is called False
Negative (FN).

•Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called
False Positive(FP).

We can evaluate a binary classifier based on the following parameters:

• True Positive (TP): The patient is diseased and the model predicts "diseased"

•False Positive (FP): The patient is healthy but the model predicts "diseased"

•True Negative (TN): The patient is healthy and the model predicts "healthy"

•False Negative (FN): The patient is diseased and the model predicts "healthy"

After obtaining these values, we can compute the accuracy score of the binary classifier as follows:

In machine learning many methods utilize binary classification. The most common are

• Support Vector Machines

• Naive Bayes

• Nearest Neighbour

• Decision Trees

• Logistic Regression

• Neural Networks
11. Explain in detail about ranking and probability estimation tree.

Ans:
Ranking:

➢ Ranking is a regression machine learning technique.


➢ Ranking is a machine learning technique to rank items.
➢ Ranking is useful for many applications in information retrieval such as ecommerce, social
networks, recommendation systems, and so on.
➢ The ranking technique directly ranks items by training a model to predict the ranking of one
item over another item.
➢ In the training model, it is possible to have items, ranking one over the other by having a
"score" for each item.
➢ Higher ranked items have higher scores and lower ranked items have lower scores. Using
these scores, a model is built to predict which item ranks higher than the other.
Deterministic ranking algorithms:

A deterministic ranking algorithm is one in which the order of the items in the ranked list is fixed and
does not change, regardless of the input data. An example of a deterministic ranking algorithm is the
rank-by-feature algorithm. In this algorithm, each item is assigned a rank based on its feature value.

Probabilistic ranking algorithms:

In a probabilistic ranking algorithm, the order of the items in the ranked list may vary, depending on
the input data. An example of a probabilistic ranking algorithm is the rank-by-confidence algorithm.
In this algorithm, each item is assigned a rank based on its confidence value.

Types of Ranking Algorithms

There are many different types of ranking algorithms, each with its own set of advantages and
disadvantages. Some of the most common types of ranking algorithms are:

Binary Ranking Algorithms: Binary ranking algorithms are the simplest type of ranking algorithm. A
binary ranking algorithm ranks items in a dataset according to their relative importance. The two most
common types of binary ranking algorithms are the rank-by-feature and the rank-by-frequency
algorithms. Rankby-feature algorithms rank items by the number of features that they have in
common with the reference item. Rank-by-frequency algorithms rank items by the number of times
that they occur in the dataset.

Ranking by Similarity: Ranking by similarity is a type of probabilistic ranking algorithm that ranks items
in a dataset according to their similarity to a reference item.

Ranking by Distance: Ranking by distance algorithms are a type of probabilistic ranking algorithm that
rank items in a dataset according to their distance from a reference item.
Ranking by Preference: Preferential ranking algorithms are a type of probabilistic ranking algorithm
that rank items in a dataset according to their preference for a reference item.

Ranking by Probability: Ranking by probability is a type of probabilistic ranking algorithm that ranks
items in a dataset according to their probability of being a positive example. Probability estimation
tree(PETs)
• Error rate does not consider the probability of the prediction, so in PET
• Instead of predicting a class, the leaves give a probability
• Very useful when we do not want just the class, but examples most likely to belong to a class (e.g.
direct marketing)

• No additional effort in learning PET compared to DTs


• Requires different evaluation methods
Example:

12.Demonstrate the importance of scoring and ranking in assessing the performance of


classification tasks.
Ans:

➢ Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. It assumes that there is a linear relationship
between the dependent variable and the independent variables. The intuition behind linear
regression is that it tries to find the line of best fit that minimizes the distance between the
actual data points and the predicted values.
➢ One intuition of linear regression is that it can be used to make predictions. For example, if we
have data on the age and height of a group of individuals, we can use linear regression to
predict the height of an individual given their age.
➢ Another intuition of linear regression is that it can be used to identify the strength and direction
of the relationship between the dependent variable and the independent variables. For
example, if we have data on the amount of time spent studying and the grade received on a
test, linear regression can be used to determine whether there is a positive or negative
relationship between the two variables.
➢ Linear regression is sensitive to outliers, which are data points that are significantly different
from the other data points. Outliers can have a large effect on the line of best fit and can cause
the regression to be less accurate. There are several ways to handle outliers in linear
regression, including:
➢ Removing the outliers: One way to handle outliers is to remove them from the data set.
However, this should be done with caution, as removing too many outliers can cause the
regression to lose important information.
➢ Transforming the data: Another way to handle outliers is to transform the data so that the
outliers are less extreme. For example, we can take the logarithm of the data or use a square
root transformation.
➢ Using robust regression: Robust regression methods, such as the Huber loss function or the
least absolute deviation method, are less sensitive to outliers than ordinary least squares
regression.
➢ For example, let's say we are trying to predict the price of a house based on its size in square
feet. We have data on 10 houses, and their prices and sizes are as follows:

Size(square feet) Price(USD)


1000 150000
1500 200000
2000 250000
2500 300000
3000 350000
3500 400000
4000 450000
4500 500000
5000 550000
5500 600000

We can use linear regression to model the relationship between the price and the size of the house.
The regression equation is:

Price = 113.636 + 63.636 * Size

If we add an outlier to the data set, such as a house that is 8000 square feet and priced at 800000 USD,
the regression equation becomes:

Price = -225384.615 + 146.154 * Size

The addition of the outlier has dramatically changed the slope of the line of best fit, making the
regression less accurate. To handle this outlier, we can remove it from the data set or use a robust
regression method.

13. Demonstrate the importance of scoring and ranking in assessing the performance of
classification tasks.
Ans:
➢ Scoring and ranking are crucial components in assessing the performance of classification
tasks, as they provide a quantitative and objective way of evaluating the effectiveness of a
classification model.
➢ When evaluating a classification model, it is important to distinguish between two types of
errors: false positives and false negatives. False positives occur when the model predicts a
positive result when the actual result is negative, while false negatives occur when the model
predicts a negative result when the actual result is positive.
➢ Scoring and ranking techniques provide a way to measure the performance of a classification
model by considering both false positives and false negatives. For example, one commonly
used scoring metric is accuracy, which measures the proportion of correct predictions made
by the model. However, accuracy alone does not provide enough information to fully evaluate
the performance of a model.
➢ To better assess the effectiveness of a classification model, various scoring and ranking
techniques can be used, such as precision, recall, F1 score, ROC curves, and AUC. Precision
measures the proportion of true positives among all positive predictions made by the model,
while recall measures the proportion of true positives among all actual positive cases. The F1
score is the harmonic mean of precision and recall, and provides a balance between these two
metrics.
➢ ROC curves and AUC are commonly used in binary classification tasks to evaluate the
performance of a model across different thresholds for the predicted probabilities. ROC curves
plot the true positive rate (TPR) against the false positive rate (FPR) for various threshold
values, while AUC measures the area under the ROC curve. A higher AUC value indicates better
performance of the model in distinguishing between positive and negative cases.
➢ In conclusion, scoring and ranking are critical in assessing the performance of classification
tasks as they provide a more complete evaluation of a model's effectiveness beyond just
accuracy. By using various scoring and ranking techniques, one can better understand the
strengths and weaknesses of a classification model and make informed decisions about its use
in real-world applications.

You might also like