Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Supervised Learning Algorithm

Supervised machine learning can be used for classification and prediction problems. Classification predicts discrete class labels while prediction predicts continuous values. Some typical applications include credit approval, target marketing, medical diagnosis, and treatment effectiveness analysis. The classification process involves constructing a model from a training dataset, and then using that model to classify new data. Accuracy is often used to evaluate classification methods.

Uploaded by

renus25j
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Supervised Learning Algorithm

Supervised machine learning can be used for classification and prediction problems. Classification predicts discrete class labels while prediction predicts continuous values. Some typical applications include credit approval, target marketing, medical diagnosis, and treatment effectiveness analysis. The classification process involves constructing a model from a training dataset, and then using that model to classify new data. Accuracy is often used to evaluate classification methods.

Uploaded by

renus25j
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Machine Learning

Supervised learning
Classification Vs predictions
• Supervised learning may be used for classification and
prediction

Classification and Prediction


• What is classification? What is prediction?

• Classification:
– predicts categorical class labels (discrete or nominal)
e.g. Yes and NO, Treatment A, B, C, Safe and risky etc.

– classifies data (constructs a model) based on the training set


and the values (class labels) in a classifying attribute and
uses it in classifying new data
Classification Vs predictions cont..
Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
– E.g. Amount of loan safe, Customer spent amount for
purchase

• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification Process steps

• Model construction

• describing a set of predetermined classes

– Each tuple/sample is assumed to belong to a predefined class,


as determined by the class label attribute

– The set of tuples used for model construction is training set

– The model is represented as classification rules, decision trees,


or mathematical formulae
Example Dataset
• q
Age income student Credit_rating Buys_Computer
Youth high no fair no
youth high no excellent no
Middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
Middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
Middle_aged medium no excellent yes
Middle_aged high yes fair yes
senior medium no excellent no
Classification Process steps cont..
• Model usage

• for classifying future or unknown objects


– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model

• Accuracy rate is the percentage of test set samples that


are correctly classified by the model

• Test set is independent of training set, otherwise over-


fitting will occur

– If the accuracy is acceptable, use the model to classify data


tuples whose class labels are not known
Model Construction: Phase 1

Training phase
Classification
Algorithms
Training
Data

NAME RANK experience senior Classifier


VPS Assistant 17 Yes (Model)
Professor
PK Goyal Associate 7 yes
Shubhas Professor 2 yes
Denendra Associate 7 yes IF rank = ‘professor’
Dinesh Assistant 6 no OR experience > 6
Kapil Associate 3 no THEN Senior = ‘yes’
Use the Model in Prediction: Phase2
• Testing Phase
Classifier
Testing
Data

Unseen Data
NAME RANK experienc Senior
e
Rk Assistant 2 no
Professor
(RP, Professor, 4)
Dimple Associate 7 no
KV Professor 8 yes Senior ?
Rajesh Assistant 7 no
Evaluating Classification Methods
• Accuracy
• Speed
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
Example Dataset
Age income student Credit_rating Buys_Computer
Youth high no fair no
youth high no excellent no
Middle_aged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
Middle_aged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
Middle_aged medium no excellent yes
Middle_aged high yes fair yes
senior medium no excellent no
Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test

• Leaf nodes represent class labels or class distribution


• E.g., decision making
• Who buys a computer?
A Decision Tree for “buys_computer”

• Example
age?

youth Middle_aged senior

student? no credit rating?

no yes excellent fair

no yes no yes
Decision Trees cont..
Decision tree induction
– Many Algorithms
• Hunt’s Algorithm
• – One of the earliest methods
• ID3 and its successor, C4.5 , CART ,SLIQ and SPRINT

• Greedy strategy
• Split the records based on an attribute test that optimizes a certain
criterion
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-
conquer manner
• The algorithm is called with three parameters: D, attribute_
list, and Attribute_ selection method
– At start, all the training examples are at the root or node
ND
• Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
– Node ND is labeled with the splitting criterion and branch
is grown from node ND
Algorithm for Decision Tree Induction cont..
Splitting: how to specify the attribute test condition?

 Depends on attribute types

• Discrete

• Continuous

Depends on number of ways to split

• Binary split

• Multi-way split
Algorithm for Decision Tree Induction cont..
Algorithm for Decision Tree Induction cont..

Conditions for stopping partitioning

• All samples for a given node belong to the same class

• There are no remaining attributes for further partitioning

– majority voting is employed for classifying the leaf

• There are no samples left


Attribute Selection Measure

• Determine the best split?

Information gain

• Select the attribute with the highest information gain


• Construct simple tree

• The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class


Ci and is estimated by |Ci,D|/|D|.
Information gain
• After the partitioning , how much info still we
need

• Information gain is defined as


Example
• Let’s consider Buys_ computer example

• There are two distinct classes (that is, m = 2). Let class C1
correspond to yes and class C2 correspond to no

• expected information needed to classify a tuple in D


Example cont..
• Now ,Expected information requirement for each attribute.
• Let’s start with the attribute age

• Hence, the gain in information from such a partitioning would


be
Example cont..

• Similarly, Gain(income) = 0.029 bits

• Gain(student) = 0.151 bits, and

• Gain(credit rating) = 0.048 bits.

• Because age has the highest information gain among the


attributes, it is selected as the splitting attribute.
• Continue recursively to grow the tree until stop conditions are
met
Example cont..
• So node will be labeled by age and branches are grown for
each value of attributes
Example cont..

• For continuous valued attributes midpoint between each pair of


adjacent values is consider as a possible split point and find as
follows

• For each possible split point of A we calculate InfoA(D)

• Point with min. expected information for A is selected as


split_point of A
Attribute Selection Measure cont..

Gini Index(CART)
• Consider binary split for each attributes
• Gini index measures impurities in D as

• To determined best binary split of A consider all possible


subsets(excluding Power and empty subset set of attribute A)
• For discrete valued attributes, the subset that gives the
minimum Gini index for that attribute is selected as its
splitting subset.
Gini Index cont..
• And computes weighted sum of impurities for each
resulting partition as follows

• For continuous-valued attributes, split point giving the


minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point
• The reduction in impurity
Gini(A) = Gini(D)-GiniA(D)
• The attribute that maximizes the reduction in impurity (or,
equivalently, has the minimum Gini index) is selected as
the splitting attribute
Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Decision tree advantages

Advantages:
• Inexpensive to construct(does not required parameter setting
or domain knowledge)
• Extremely fast at learning step and also for classifying
unknown records
• Easy to interpret for small-sized trees

• Accuracy is comparable to other classification


• Techniques for many simple data sets
• Very good average performance over many
datasets(multidimentional)
Overfitting in Classification
What is overfitting?

• Happens when a model learns the detail and noise in the


training data to the extent that it negatively impact the
performance of the model on new data

• That is, noise and random fluctuation in training data is picked


up and learn as concepts by the model

• An induced tree may overfit the training data


– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
Solutions for overfitting
Pruning : to remove the least reliable branches.
• Two approaches to avoid overfitting

– Prepruning: Halt tree construction early—do not split a node


if this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a “fully grown” tree—


get a sequence of progressively pruned trees
– E.g. Cost complexity Algorithm( finding cost at each node of
subtree) with and without pruning.
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
• Use minimum description length (MDL) principle
– halting growth of the tree when the encoding is minimized
Bayesian Classification

Introduction
• These are statistical classifiers

• Predict class membership probabilities, such as the probability


that a given tuple belongs to a particular class.

• Based on Bayes’ theorem

• Example of simple Bayesian classifier known as the naïve


Bayesian classifier assumes class conditional independence.

• high accuracy and speed when applied to large databases.


Bayes’ Theorem
• Let X be a data sample whose class label is unknown

• Let H be a hypothesis that X belongs to class C

• For classification problems, determine P(H|X): the probability


that the hypothesis holds given the observed data sample X

• P(H): prior probability of hypothesis H (i.e. the initial


probability before we observe any data

• P(X): prior probability of X


• P(X|H) : Posterior probability of observing the sample X,
given that the hypothesis holds
Bayes’ Theorem cont..

• Given training data X, posteriori probability of a hypothesis H,


P(H|X) follows the Bayes’ theorem

• P(X/H) P(H)
P(H/X)=
P(X)

• Informally, this can be written as


posterior =likelihood x prior / evidence

• P(H), P(X/H), and P(X) may be estimated from the given data
Naïve Bayes Classifier
• Let D be a training set of tuples and their associated class
labels
• Suppose that there are m classes, C1, C2, …. , Cm. Given tuple
X , naïve Bayesian classifier predicts that tuple X belongs to
the class Ci if and only if
• P(Ci/X) > P(Cj/X) for 1≤ j≤ m; j ≠ i
• Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is
maximized is called the maximum posteriori hypothesis. By
Bayes’ theorem

• P(X/Ci)P(Ci)
• P(Ci/X) =
P(X)
Naïve Bayes Classifier cont..
• As P(X) is constant for all classes, only P(X/Ci)P(Ci) need be
maximized
• Classes are equally likely, that is, P(C1) = P(C2) = = P(Cm),
and we would therefore maximize P(X/Ci). Otherwise, we
maximize P(X/Ci)P(Ci)
• P(Ci)=|Ci,D|/|D|,where |Ci,D| is the number of training tuples
of class Ci in D.
• A simplified assumption: attributes are conditionally
independent so
n
• ∏ P(xk/Ci)
• P(X/Ci) = k=1

=P(x1/Ci)P(x2/Ci) P(xn/Ci)
• Here , xk refers to the value of attribute Ak for tuple X.
Naïve Bayes Classifier cont..

•If Ak is categorical, then P(xk/Ci) is the number of tuples of class


Ci in D having the value xk for Ak, divided by |Ci,D|, the number of
tuples of class Ci in D
•If Ak is continuous-valued, assumed to have a Gaussian
distribution with a mean μ and standard deviation sigma ∂,given by
Naïve Bayes Classifier cont..

•. Topredict the class label of X, P(X/Ci)P(Ci) is evaluated for each


class Ci
• Classifier prediction will be class Ci if and only if
Naïve Bayes Classifier cont..
• What happens if we should end up with a probability value of
zero for some P.xkjCi/?
• Plugging this zero value into Eq would return a zero
probability for P(X/Ci)
• zero probability cancels the effects of all the other (posteriori)
probabilities (on Ci) involved in the product
K-Nearest Neighbor classifier

• Eager learners, when given a set of training tuples, will


construct a generalization (i.e., classification) model before
receiving new (e.g., test) tuples to classify

• When given a training tuple, a lazy learner simply stores it (or


does only a little minor processing)
• waits until it is given a test tuple
• lazy learners store the training tuples or “instances,” they are
also referred to as instance based learners
KNN cont..
• Labor intensive when given large training sets

• Based on learning by comparing a given test tuple with


training tuples that are similar to it
• When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples
that are closest to the unknown tuple.
• These k training tuples are the k “nearest neighbors” of the
unknown tuple.

• Closeness” is defined in terms of a distance metric, such as


Euclidean distance.
KNN cont..
• Euclidean distance between two points or tuples, say, X1 =
(x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is
• n
dist(X1,X2) = ∑ (x1i - x2i)2
• i=1

• Normalize the values of each attribute before using Equation

• unknown tuple is assigned the most common class among its k


nearest neighbors.
Height(in Weight(in T shirt size
cms kgs) Example
158 58 M
158 59 M
158 63 M
160 59 M
New customer Akshay height 161 cm
160 60 M
weight 61kg
163 60 M
T shirt size= ?
163 61 M
160 64 L
163 64 L Euclidean distance b/w first
165 61 L observation and new customer akshay
165 62 L is = SQRT((161-158)^2+(61-58)^2)
165 65 L
168 62 L
168 63 L
168 66 L
Heigh Weight(i
t(in n kgs)
T shirt
size
Distanc Rank
e
Example cont..
cms
• E
158 58 M 4.2
158 59 M 3.6
158 63 M 3.6
160 59 M 2.2 4
160 60 M 1.4 1
163 60 M 2.2 3 Similarly calculate
163 61 M 2.0 2 distance of all tuples
160 64 L 3.2 5
163 64 L 3.6 and give smallest
165 61 L 4.0 distance ranked 1
165 62 L 4.1
165 65 L 5.7
168 62 L 7.1
168 63 L 7.3
168 66 L 8.6
Example cont..
• Let k=5 then algorithm searches for 5 customer closet to
Akshay
• Look their categories and assigned Akshay and assigned most
common class to akshay
• E.g 4 having T shirt size M and 1 L, then classification for
Akshay is M

• Also used for numeric or real-valued prediction for a given


unknown tuple

• The classifier returns the average value of the real-valued


labels associated with the k nearest neighbors of the unknown
tuple.
KNN cont..
• If variables in training sample measures different units then
need to standardized variable before finding distance
• One influence the other variable in distance calculation
• Min –Max and Z-score normalization

• V’= v-MinA/MaxA-MinaA

• how can distance be computed for attributes that not numeric,


but categorical, such as color?

• Many methods e.g. Same color distance zero and one for
different
KNN cont..
• For missing value of either one or both variable distance be
taken as 1

• Good value for K, Trial and error method

• Poor accuracy with noisy and irrelevant attribute

• Slow with classification of test tuples


Regression Model
Predictions ?
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value

• Major method for prediction is regression


– Linear and multiple regression
– Non-linear regression

• Prediction is different from classification


– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Regression Model cont..
• Predictive modeling: Predict data values or construct
generalized linear models based on the database data

Introduction
• Regression analysis is a statistical technique used to describe
relationships among variables

• The simplest case to examine is one in which a variable y ,


referred to as the dependent or target variable, may be related to
one variable x, called an independent or explanatory variable, or
simply a Regressor
Linear regression

• Linear regression: If the relationship between y and x is


believed to be linear, then the equation
y = w0+w1x;
– Two parameters , w0 and w1 are regression coefficients
specify the line and are to be estimated by using the data at
hand
• In simplest terms, the purpose of regression is to try to find the
best line or equation that expresses the relationship between y
and x
– using the least squares criterion to the known values of y1,
y2, …, x1, x2, …. To solve coefficients
• Minimizes the error between the actual data and the estimate
of the line.
Linear regression cont..

• To estimate coefficients use following equation


Linear regression

Consider the following data points


x( Experience) 3 8 9 13 3 6 11 21 1 16

y(Salary) 30 57 64 72 36 43 59 90 20 83

For given data we compute x=9.1 and y = 55.4


Linear regression cont..

• The equation of the least squares line is estimated by

y = 23.6+3.5x.

• Using this equation, we can predict


• Salary of a college graduate with, say, 10 years of experience
will be $ 58,600
Multiple Linear regression
• Extension of straight-line regression and involve more than
one predictor variable

• Example
• Say, 2 predictor variables or attributes, A1, A2 describing a
tuple, X then multiple linear regression model will be as

• y = w0+w1x1+w2x2,
• Least squares can be extended to solve but equations are
tedious to solve
• Statistical software packages such as SPSS etc.. Used to solve
Nonlinear Regression
• To model data that does not show a linear dependence
• e.g. Response variable and predictor variable relationship
can be modeld by polynomial function

• y = w0+w1x+w2x²+w3x³

• For linear transformation of above equation, define new


variables
• x1=x, x2=x², x3=x³
• we get y= w0 +w1x1 +w2x2 +w3x3, can be solved using least
square
Logistic regression
• Linear regression is used to model continuous-valued functions

• To predict categorical labels?” Generalized linear models such as Logisitc


regression

• Models the probability of some event occurring as a linear function of a set of


predictor variables.

• Used for classification problems e.g. binary classification 0 or 1


• Called linear regression model but used complex cost function define as sigmoid
function or logistic function
• Used for mapping predicted values to probabilities that is map any real value
between 0 and 1

• 1
• 𝜎=
1+e−z
Logistic regression cont..
• Its limits as z tends to ∞, σ ( z) tends to 1, because e−z tends to 0.

• As z tends to −∞, σ ( z) tends to 0 because e−z tends to ∞. Now


exactly at z=0, σ ( z) is equal to 0.5.

• So it is monotonic function.

• Plot of the sigmoid curve


• y will always lie between 0 and 1.
Logistic regression cont..

• Linear function fail to represent it because its value may be


greater than 1 and less than 0
• When used linear regression formula is y= w. x
• For logistic regression modified formula will be

• y=σ (w. x), where w includes w0 ,w1 ,...wn if you have n


features, and x here is x1 , x2 ,... xn.
• y= σ (z) =σ (w0+w1 x)

• y= 1
1+e− (w0+w1 x)
Logistic regression cont..
Decision boundary
• Classifier give us a set of outputs or classes based on
probability score between 0 and 1
• E.g we have classes 1- Red and 0- orange
• We decide a threshold value above which we classify values
into class red and below which in class orange

• So if we chose threshold as 0.5 then 0.7 value return by


prediction function is classified as Red class and 0.2 is as
orange

You might also like