Unit 3 (DWDM)
Unit 3 (DWDM)
Unit 3 (DWDM)
Classification is a form of data analysis that extracts models describing important data classes. Such
models, called classifiers, predict categorical (discrete, unordered) class labels. For example, we
canbuildaclassificationmodeltocategorizebankloanapplicationsaseithersafeorrisky.Suchanalysiscanhelp
provideuswithabetterunderstandingofthedataatlarge.Many classification methods have been proposed
by researchers in machine learning, pattern recognition, and statistics.
Data Mining: Data mining in general terms means .In the process of data mining, large data sets are first
sorted, then patterns are identified and relationships are established to perform data analysis and solve
problems.
Classification is a task in data mining that involves assigning a class label to each instance.
There are two main types of classification:
1.binary classification
Binary classification involves classifying instances into two classes, such as “spam” or “not spam”.
2.multi-class classification
multi-class classification involves classifying instances into more than two classes
Classification: It is a data analysis task, i.e. the process of finding a model that describes and distinguishes
data classes and concepts. Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a training set of data containing observations
and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is required to
predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve it. It is a two-
step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the training set
available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on test
data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
Classification Works:
With the help of the bank loan application that we have discussed above, let us understand the working of
classification. The Data Classification process includes two steps −
Advantages:
Mining Based Methods are cost-effective and efficient
Helps in identifying criminal suspects
Helps in predicting the risk of diseases
Helps Banks and Financial Institutions to identify defaulters so that they may approve Cards, Loan,
etc.
Disadvantages:
Privacy: When the data is either are chances that a company may give some information about their customers
to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best accuracy and result.
APPLICATIONS:
Generalapproachtosolveaclassificationproblem:
--A classification technique is a systematic approach to build classification models based on adataset.
--Examples are decision tree classifiers, rule-based classifiers, neural networks, support
vectormachinesand naïveBayes classifier.
--Each technique employs a learning algorithm to identify a model that best fits the
relationshipbetweentheattributesetand the class labelof theinput data.
--A training set consists of records whose class labels are known must be provided. The trainingtest
is used to build a classification model, which is applied to the test set. The test set consists ofrecords
whoseclass labelis unknown
--Based on the entries in the confusion matrix, the total number of correct predictions made
bythemodel is (f11+f00) andthetotal numberof incorrect predictions is(f01+f10).
--Although a confusion matrix provides the information needed to determine how well
aclassificationmodelperforms, summarizingthis informationwith asinglenumberwould
makeitmoreconvenient to comparethe performanceof different models.
--Thiscanbedoneusingaperformance metric.
--Accuracycanbe expressesas:
Accuracy= Numberofcorrectpredictions/Totalnumberofpredictions
. Accuracy=(f11+f00)/(f11+f10+f00+f01)
--Equivalently,Errorratecan beexpresses as:
Errorrate=Number ofwrongpredictions/Total number ofpredictions
. Errorrate =(f10+f01)/(f11+f10+f00+f01)
DecisionTreeInduction:
Decision tree induction is the learning of decision trees from class-labeled trainingtuples. A decision
tree is a flowchart-like tree structure, where each internal node (non leafnode) denotes a test on an
attribute, each branch represents an outcome of the test, and eachleaf node (or terminal node) holds a
class label. The topmost node in a tree is the root
node.Internalnodesaredenotedbyrectangles,andleafnodes are denotedbyovals.
“How are decision trees used for classification?” Given a tuple, X, for which
theassociatedclasslabelisunknown,theattributevaluesofthetuplearetestedagainstthedecisiontree.Apathistr
acedfromtheroottoaleafnode,whichholdstheclasspredictionforthattuple.Decisiontrees
caneasilybeconvertedtoclassification rules.
During tree construction, attribute selection measures are used to select the attributethat best
partitions the tuples into distinct classes. When decision trees are built, many of thebranches may
reflect noise or outliers in the training data. Tree pruning attempts to identifyand
removesuchbranches,withthegoalofimprovingclassificationaccuracyonunseendata.
Thetreehas threetypesofnodes.
i) Aroot nodehas no incomingedgesand zero or moreoutgoingedges.
ii) Internalnodes,eachofwhichhasexactlyoneincomingedgeandtwoormoreoutgoingedges.
iii) Leaforterminalnodes,eachofwhichhasexactlyoneincomingedgeandnooutgoingedges.
Fig:Adecisiontreeformammalclassificationproblem
Fig:Classifyinganun labelledvertebrate
Buildingof adecisiontree:
i) Hunt’salgorithm
ii) ID3(IterativeDichotomiser3)
iii) C4.5(Classification4.5)
iv) CART(ClassificationAlgorithmandRegressionTree)
--Thesealgorithms usually employ agreedy strategy thatgrows adecision tree by making aseries
of locally optimum decisions about which attribute to use for partitioning the data.
Onesuchalgorithm isHuntsalgorithm.
Hunt’salgorithm
--In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the
trainingrecordsinto subsets.
--Let Dt be a set of training records that are associated with node t and y={y 1,y2,…,yc} be
theclasslabels.
--Therecursiveprocedureforhunt’salgorithm is as follows:
STEP1
If Dt contains records that belong to more than one class, an attribute test condition is selected
topartition the records into smaller subsets. A child node is created for each outcome and
therecords in Dt are distributed based on the outcomes. The algorithm is then recursively applied
foreachnode.
6
Fig:Trainingsetforpredictingborrowerswhowilldefaultonloanpayments
--In the above data set, the class labels for all the 10 records are not same, so step 1 cannot
besatisfied.Weneed to construct thedecision treeusingstep 2.
--Select one of the attribute as root node, say, home owner since home owner with entry
“yes”need not require any further splitting. There are 3 records with home owner =yes and
recordswithhomeowner=no.
--The records with home owner=yes are classified and we now need to classify other 7
recordsi.e., home owner=no. The attribute test condition can be applied either on marital status or
annualincome.
--Let us select marital status, where we apply binary split. Here marital status=married need
notrequirefurther splitting.
The records with marital status=married are classified and we now need to classify other
4recordsi.e.,home owner=no and marital status=single, divorced.
7
--The left out attribute is annual income. Here we select the range since it is a
continuousattribute.
--Nowtheother 4recordsarealsoclassified.
i) It is possible for some of the child nodes created in step 2 to be empty; i.e., there are
norecords associated with these nodes. In such cases assign the same class label as
themajority class of training records associated with its parent node; i.e., in our
examplemajorityclass is no, soassign ‘no’forthe newrecord.
ii) If all the records in Dthave identical attributevalues but the class label is different
insuchcases,assign themajorityclass label.
Methodsforexpressingattributetest conditions:
Decisiontreeinductionalgorithms
mustprovideamethodforexpressinganattributetestconditionanditscorresponding
outcomesfordifferentattributetypes.
Thefollowing arethemethods forexpressingattribute test conditions. Theyare:
i) Binary attribute: The test condition for binary attribute generate two outcomes as
shownbelow:
ii) Nominal attributes: since a nominal attribute can have many values, its test
conditioncanbeexpressed in two waysas shown below:
8
For a multi way split, the number of outcomes depends on the number of
distinctvaluesfor the correspondingattribute.
Some algorithms, such as CART supports only binary splits. In such case we
canpartitionthe k-attributevalues into 2k-1-1 ways.
Forexample,marital statusisa3-attributevalue, wecansplitit in22-1-1;i.e., 3ways.
iii) Ordinal attribute:It can also produce binary or multi way splits. Ordinal attributevalues
can be grouped as long as the grouping does not violate the order property
oftheattributevalues.
In the above example, condition a and condition b satisfies order but condition
cviolatestheorder property.
iv) Continuous attributes: The test condition can be expressed as a comparison test
(A<v)or(A>=v)withbinaryoutcomes,orarangequerywithoutcomesoftheformvi<=A<vi
+1fori=1,2,…,k.
9
Measuresforselecting thebestsplit:
There aremanymeasures that can be used to determinethe best wayto split the records.
Let P(i|t) denote the fraction of records belonging to class i at a node t. the measures for
selectingthe best split are often based on the degree of impurity of the child nodes. The smaller
thedegree of impurity, the more skewed the class distribution. For example, a node with
classdistribution (0,1) has zero impurity, whereas a node with uniform class distribution (0.5,0.5)
hasthehighest impurity.
The 3 measures attain maximum values when the class distribution is uniform and
minimumwhenall the records belongto sameclass.
Compare the degree of impurity of the parent node with the degree of impurity of the child
node.The larger their difference, the better thetest condition. The gain, ∆, isa criterion thatcan
10
beusedto determinethegoodness of asplit.
11
Where I(.) is the impurity measure of a given node, N is the total number of records at the
parentnode, k is the attribute values and N(v j) is the number of records associated with node v j.
whenentropyis used asimpuritymeasurethe differencein entropyis knownasinformation gain,
∆info.
Splittingofbinaryattributes
Suppose there are two ways to split the data into smaller subsets, say, A and B. before
splittingtheGINIindexis 0.5 sincethereareequal number ofrecords fromboth theclasses.
ForattributeA,
[(2/5)2+(3/5)2]=0.48
TheaverageweightedGINIindex is (7/12)(0.4898)+(5/12)(0.48)=0.486
ForattributeB,theaverageweightedGINIindexis0.375,sincethesubsetsforattributeBhavesmallerGI
NIindexthan A, attributeBis preferable.
Splittingofnominalattributes
12
A nominal attributecan produceeitherbinaryor multi waysplit.
The computation of GINI index is same as for binary attributes. The smaller the average
GINIindex is the best split. In our example, multi way split has the lowest GINI index, so it is the
bestsplit.
Splittingofcontinuousattributes
Inordertosplitacontinuousattribute,weselecta range.
Inourexample,thesortedvaluesrepresentstheascendingorderofdistinctvaluesincontinuousattribute.
Splitpositions representmeanbetweentwoadjacentsortedvalues.
Algorithmfordecision treeinduction:
13
i) The create node() function extends the decision tree by creating anew node. A node
inthedecisiontreehaseitheratestcondition,denotedasnode.test_cond,oraclasslabel,denot
ed as node.label.
ii) The find.best_split () function determines which attribute should be selected as the
testconditionforsplittingthe trainingrecords.
iii) Theclassify()functiondeterminestheclasslabeltobeassignedtoaleafnode.
iv) The stopping_cond() function is used to terminate the tree-growing process by
testingwhetherall therecords areclassified or not.
ModelOverfitting:
i) Trainingerrors
ii) Generalizationerrors
--Generalization errors is the expected error of the model. For example, the class label for
therecord in the test data is known but it is wrongly predicted. This type of errors is known
asGeneralizationerrors.
--Agoodmodelshouldhavelowtrainingerrorsaswellaslow testingerrors.
--Thetrainingandtesterrorratesarelargewhenthesizeofthetreeisverysmall.Thissituationisknown
asmodel underfitting.
14
--When the tree becomeslarge, the testerrorrate increasesandtraining error rate
decreases.Thissituationis known asmodel overfitting.
--Considerthetrainingandtestsetsforthemammalclassificationproblem.Twoofthetenrecords
aremislabeled.Batsandwhales areclassified asnonmammalsinsteadofmammals.
15
--The classlabelfor {name=’human’,body-temperature=’warm-blooded’,givesberth=’yes’,four-
legged=’no’, hibernates=’no’} is non-mammals from above decision tree. But humans
aremammals. Theprediction is wrongdueto presenceof noise in data.
Overfittingduetolack of representativesamples:
Fig:trainingdata
--Thedecision treefortheabovetrainingdata is asfollows:
16
Fig:decision
tree
Fig:testset
--From the above decision tree, humans, elephants and dolphins are misclassified since tree
isconstructedwith less numberof records.
17
Evaluatingtheperformanceofaclassifier:
--A classification algorithm should be judged before using it for real time data. The accuracy
anderror rate is judged by finding the class labels of test sets whose class labels are already
known inadvance.
--Thefollowingmethodsareusedforevaluatingtheperformanceof a classifier:
--Holdoutmethod
--Randomsubsampling
--Cross-validation
--Bootstrap
HoldoutMethod:
In this method the original data sets is divided into two parts, 50% or 2/3 rdof original data
isconsidered astraining sets and another 50% or 1/3 rd of original data astest sets
respectively.Now, the classification model is trained on training tests and then applied on test
sets.
Theperformanceoftheclassificationalgorithmisbasedonnumberofcorrectpredictionsmadeonthe
test set.
Limitations
1) Lessnumberofsamplesfortraining(sincetheoriginalsamplesarespitted)
2) Themodel is highlydependent onthecomposition of thetrainingand test sets
RandomSampling:
Multiple repetition of holdout method is known as random sampling. Here the original data
isdivided randomly into training sets and test sets and the accuracy is calculated as in
holdoutmethod. This random sampling is then repeated k times and the accuracy is calculated
foreachtime. Theoverall accuracyis:
18
--Hereacciis the model accuracyduringi th iteration
Limitations
1) Lessnumberofsamplesfortraining(sincetheoriginalsamplesarespitted)
2) Arecord maybeused morethan oncein trainingand test tests.
Cross-Validation:
--Therearethreevariationsofcross-validationapproach
In this approach data is partitioned into two parts. The first part is considered
as training setand the second part as test set. Now they are swapped and the
first part is considered as testsetand second oneas trainingset. Thetotalerror is
thesum of both theerrors.
b) K-fold crossvalidation
In this approach the data is partitioned into k subsets. One of the partitions is
considered astest set and remaining sets are considered as training set. This
process is repeated k times andthe totalerror is the sum of all the k runs.
c) Leave-one-outapproach
In this approach one record is considered as test set and rest of the samples are
considered astraining set. This process is repeated k times (k= number of
records) and the total error is thesumof all the k runs. Butthis process is
computationallyveryexpensive.
Bootstrap
In this approach a record may be sampled more than once. Means a record when
sampled isagain kept back in the original data. So it is likely that the record may be
sampled again andagain. Consider original data of size N. The probability of a
record to be chosen as bootstrapsample is 1-(1-1/N) N.When the Size of N is very
large then the probability is 1-e-1. Thesamplingis repeatedBtimes to generateb
bootstrap samples.
Classification:AlternativeTechniques:
BayesianClassification:
Bayesianclassifiersarestatisticalclassifiers.
Theycanpredictclass membershipprobabilities,
suchastheprobabilitythatagiventuple belongs toaparticularclass.
BayesianclassificationisbasedonBayes’theorem.
19
Bayes’Theorem:
LetXbeadatatuple. InBayesianterms,
Xisconsidered―“evidence”anditisdescribedbymeasurementsmade ona
setofnattributes.
LetHbesomehypothesis,suchasthat thedatatupleXbelongstoaspecified classC.
Forclassificationproblems,wewanttodetermineP(H|
X),theprobabilitythatthehypothesisH holds
giventhe―evidence‖orobserveddatatupleX.
P(H|X)istheposterior probability, oraposterioriprobability, ofHconditionedonX.
Bayes’theoremisusefulinthatitprovidesawayofcalculatingtheposterior
𝑷𝑿𝑯𝑷(𝑯)
probability,P(H|X),fromP(H),P(X|H),andP(X).
𝑷𝑯𝑿=
𝑷(𝑿)
NaïveBayesianClassification:
ThenaïveBayesianclassifier,orsimpleBayesianclassifier,worksasfollows:
1. Let=be a training set of tuples and their associated class labels. As usual, each
tuple isrepresented by an n-dimensional attribute vector, X = (x1, x2, …,xn),
depicting nmeasurementsmadeonthetuplefromnattributes,respectively,A1,A2,
…,An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier
willpredictthatXbelongstotheclasshavingthehighestposteriorprobability,conditi
onedonX.Thatis,thenaïveBayesianclassifierpredictsthattupleXbelongstotheclass
𝑷𝑪𝒊𝑿 >𝑃𝑪𝒋𝑿𝒇𝒐𝒓𝟏<𝒋<𝒎,𝒋≠𝒊.
Ciifandonlyif
Thuswemaximize(𝐶𝑗|𝑋).TheclassCiforwhich(𝐶𝑗|𝑋).ismaximizediscalledthe
maximumposteriorihypothesis.
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If
the classprior probabilities are not known, then it is commonly assumed that
the classes areequally likely, that is, P(C1) = P(C2) = …= P(Cm), and we
would therefore maximizeP(X|Ci).Otherwise,wemaximizeP(X|Ci)P(Ci).
4. Givendatasetswithmanyattributes,itwouldbeextremelycomputationallyexpensiv
eto compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the
naiveassumption of class conditional independence is made. This presumes
that the valuesof the attributes are conditionally independent of one another,
𝒏
given the class label ofthetuple.Thus,
𝑷𝑿𝑪𝒊=𝑷𝒙𝒌𝑪𝒊
𝒌=𝟏
=𝑷𝒙𝟏𝑪𝟏×𝑷𝒙𝟐𝑪𝟐×……. .×𝑷𝒙𝒏𝑪𝒊
20
5. WecaneasilyestimatetheprobabilitiesP(x1|Ci),P(x2|Ci), :::,P(xn|
Ci)fromthetrainingtuples.
6. For eachattribute,welook at
whethertheattributeiscategoricalorcontinuous-valued.Forinstance,to
computeP(X|Ci),weconsiderthefollowing:
If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci
in=havingthevaluexkforAk,dividedby|Ci,D| the
numberoftuplesofclassCiin D.
If Ak is continuous-valued, then we need to do a bit more work, but the
calculation isprettystraightforward.
Example:
21
WeneedtomaximizeP(X|Ci)P(Ci), fori=1,2.
P(Ci),thepriorprobabilityofeachclass,canbe computedbasedonthetrainingtuples:
TocomputeP(X|Ci),fori=1,2, wecomputethefollowingconditionalprobabilities:
Usingtheseprobabilities,weobtain
P(X|buyscomputer=yes) =P(age=youth| buyscomputer=yes)
×P(income=medium| buyscomputer=yes)
×P(student=yes| buyscomputer=yes)
×P(credit rating=fair|buyscomputer=yes)
=0.222×0.444×0.667×0.667=0.044.
Similarly,
P(X|buys computer=no)=0.600×0.400×0.200× 0.400=0.019.
22
BayesianBeliefNetworks
23