Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
101 views

4 - Data Analytics Using DM and ML Algorithms - 1

This document provides an introduction to data analytics and data mining algorithms including classification and clustering. It discusses how classification algorithms like decision trees can be used for predictive modeling to predict categorical class labels. Clustering algorithms like K-means are used for descriptive modeling to group similar data together. The document outlines common data mining tasks like predictive modeling for tasks like credit scoring and fraud detection, as well as descriptive modeling. It provides examples of classification, clustering, and association rule mining as basic data mining algorithms.

Uploaded by

Tariku Wodajo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

4 - Data Analytics Using DM and ML Algorithms - 1

This document provides an introduction to data analytics and data mining algorithms including classification and clustering. It discusses how classification algorithms like decision trees can be used for predictive modeling to predict categorical class labels. Clustering algorithms like K-means are used for descriptive modeling to group similar data together. The document outlines common data mining tasks like predictive modeling for tasks like credit scoring and fraud detection, as well as descriptive modeling. It provides examples of classification, clustering, and association rule mining as basic data mining algorithms.

Uploaded by

Tariku Wodajo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Introducing Data Science and

Data Analytics
Using Data mining algorithms for Data
Analytics
Objectives
•At the end of this unit we plan to understand
– What data analytics is
– classification algorithms
• Decision tree classification
– clustering algorithms
• K-means clustering
DATA MINING; a step in the
process of Data Analytics
• Data analytics deals with every step in the process of
a data-driven model, including data mining
• Data mining is therefore a step in the process of data
analytics
– Predictive modeling
– Descriptive modeling
What is Data Mining?
• DM is the process of discovery of useful and
hidden patterns in large quantities of data
using machine learning algorithms
– It is concerned with the non-trivial extraction of
implicit, previously unknown and potentially useful
information and knowledge from data
– It discovers meaningful patterns that are valid,
novel, useful and understandable.
• The major task of data mining includes:
– Classification
– Clustering
– Association rule discovery
Data Mining Main Tasks
Prediction Methods
create a model to predict Description Methods
the class of unknown or construct a model
new instances. that can describe the
existing data.

6
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is a function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model
7
Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Predictive Modeling: Customer Scoring
• Goal: To predict whether a customer is a high risk
customer or not.
– Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees, etc
– Many, many applications of this nature 10
Classification

• Example: Credit scoring


– Differentiating between low-risk and high-risk customers from
their income and savings
Discriminant rule: IF income > θ1 AND savings > θ2
THEN low-risk
ELSE high-risk
Predictive Modeling: Fraud Detection
• Goal: Predict fraudulent cases in credit card
transactions.
– Credit card losses in the US are over 1 billion $ per year
– Roughly 1 in 50 transactions are fraudulent
• Approach:
– Use credit card transactions and the information on its
account-holder as attributes.
• When does a customer buy, what does he buy, how often
he pays on time, etc
– Label past transactions as fraud or fair transactions. This
forms the class attribute.
– Learn a model for the class of the transactions.
– Use this model to detect fraud by observing credit card
transactions on an account.
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
– e.g., a model that could simulate the data if needed
EM ITERATION 25
4.4

• Descriptive model identifies

Red Blood Cell Hemoglobin Concentration


4.3

patterns or relationship in data 4.2

– Unlike the predictive model, a 4.1

descriptive model serves as a way


to explore the properties of the
4

data examined, not to predict new 3.9

properties 3.8

3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume

• Description Methods find human-interpretable patterns that


describe and find natural groupings of the data.
• Methods used in descriptive modeling are: clustering,
summarization, association rule discovery, etc.
13
Example of Descriptive Modeling
• goal: learn directed relationships among p variables
• techniques: directed (causal) graphs
• challenge: distinguishing between correlation and causation
– Example: Do yellow fingers cause lung cancer?

smoking hidden cause:


smoking

yellow fingers cancer


?

14
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships among data)

• Given market basket data we might discover that


– If customers buy wine and bread then they buy cheese with
probability 0.9

• Methods used in pattern discovery include:


– Association rules, Sequence discovery, etc.

15
Basic Data Mining algorithms
• Classification: which is also called Supervised learning,
maps data into predefined groups or classes to enhance the
prediction process
• Clustering: which is also called Unsupervised learning,
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
• Association Rule: is also known as market-basket
analysis
– It discovers interesting associations between attributes contained in
a database.
– Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event. 16
Classification
Classification is a data mining (machine
learning) technique used to predict group
membership of new data instances.

17
OVERVIEW OF CLASSIFICATION
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the
class.
– construct a model for class attribute as a function of the
values of other attributes.
– Given a data D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
the Classification Problem is to define a mapping f:DgC
where each ti is assigned to one class.
• Goal: previously unseen records should be assigned a class
as accurately as possible. A test set is used to determine
the accuracy of the model.
– Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it. 18
Classification Examples
• Teachers classify students’ grades as A, B, C, D, or F.
• Predict whether the weather on a particular day will
be “sunny”, “rainy” or “cloudy”.
• Identify individuals with credit risks.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Document classification into the predefined classes,
such as politics, sport, social, economy, law, etc.

19
CLASSIFICATION: A TWO-STEP PROCESS
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known 21
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Refund Marital Taxable

Classification Example No
Status

Single
Income Cheat

75K ?

l l Yes Married 50K ?


ir ca ir ca ous
o o nu No Married 150K ?
g g ti
te te n s
ca ca co lc as Yes Divorced 90K ?
Tid Refund Marital Taxable No Single 40K ?
Status Income Cheat
No Married 80K ?
1 Yes Single 125K No 10

2 No Married 100K No
3 No Single 70K No Test
4 Yes Married 120K No Set
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Model Apply
8 No Single 85K Yes Classifier
9 No Married 75K No
10 No Single 90K Yes
Training Learn
Set
Classifier
10
Confusion matrix and performance evaluation

Most widely-used metric is measuring Accuracy of the


system :
Compute confusion matrix and effectiveness of
the model
Outlook Temp Windy Predicted Real
class class
overcast mild yes YES YES
Confusion matrix
rainy mild no YES YES
TRUE CLASS
rainy cool yes YES NO
YES NO
sunny mild no NO YES PREDICTED YES 2 (TP) 1 (FP)
sunny cool no NO NO CLASS
NO 1 (FN) 3 (TN)
sunny hot no NO NO
Total: 3 (Yes) 4 (No)
sunny hot yes NO NO

Compute accuracy, recall and precision?


• Accuracy = (2+3)/7 = 5/7 * 100 = 71%
Classification methods
• Goal: Predict class Ci = f(x1, x2, .. xn)
• There are various classification methods. Popular
classification techniques include the following.
– K-nearest neighbor
– Decision tree classifier: divide decision space into
piecewise constant regions.
– Neural networks: partition by non-linear boundaries
– Bayesian network: a probabilistic model
– Support vector machine

26
Classification methods
Decision tree
• Decision Tree is a popular supervised learning technique in machine learning,
serving as a hierarchical if-else statement based on feature comparison operators.
• It is used for regression and classification problems, finding relationships between
predictor and response variables.
• The tree structure includes Root, Branch, and Leaf nodes, representing all
possible outcomes based on specific conditions or rules.
• The algorithm aims to create homogenous Leaf nodes containing records of a
single type in the outcome variable. However, sometimes restrictions may lead to
mixed outcomes in the Leaf nodes.
• To build the tree, the algorithm selects features and thresholds by optimizing a
loss function, aiming for the most accurate predictions.
• Decision Trees offer interpretable models and are widely used for various
applications, from simple binary classification to complex decision-making tasks
• Simply put, it takes the form of a tree with branches representing the potential
answers to a given question. There are metrics used to train decision trees. One
of them is information gain
Example of cost function in a decision tree
• Let’s do one thing: I offer you beads and we perform an experiment. I have a box full of an
equal number of beads of two colors: blue and green. You may choose either of the colors
but with eyes closed. The fun part is: in case you get the green then you are enjoy a bonus
of 5 points, or if you get the blue then you would have to submit one more assignment.
• This dilemma where you would have to decide and this decision of yours that can lead to
results with equal probability is nothing else but said to be the state of maximum
uncertainty.
• In case, I had only green or blue then we know what the outcome would have been and
hence the uncertainty (or surprise) will be zero.
• The probability of getting each outcome of a green or blue is:
– P(bead == green) = 0.50
– P(bead == blue) = 1 – 0.50 = 0.50
• When we have only one result either green or blue, then in the absence of uncertainty,
the probability of the event is:
– P(bead== green) = 1
– P(bead== blue) = 1 – 1 = 0
• There is a relationship between heterogeneity and uncertainty; the more heterogeneous
the event the more uncertainty. On the other hand, the less heterogeneous, or so to say,
the more homogeneous the event, the lesser is the uncertainty. The uncertainty is
expressed as Entropy.
Entropy
• In machine learning, entropy measures the impurity or randomness
present in a dataset.
• It is commonly used in decision tree algorithms to evaluate the
homogeneity of data at a particular node.
• A higher entropy value indicates a more heterogeneous dataset
with diverse classes, while a lower entropy signifies a more pure
and homogeneous subset of data.
• Decision tree models can use entropy to determine the best splits
to make informed decisions and build accurate predictive models.
• Considering N data set, the entropy can be calculated by the
following formula where pi is the probability of randomly selecting
an example in class i
Example calculation of entropy
• We shall estimate the entropy for three different scenarios.
• The event Y is getting a green. The heterogeneity or the impurity formula for
two different classes is as follows:
• H(X) = – [(pi * log2 pi) + (qi * log2 qi)]
• where,
– pi = Probability of Y = 1 i.e. probability of success of the event
– qi = Probability of Y = 0 i.e. probability of failure of the event
• Case 1 Bead color quantities probability
Green 7 0.7
Blue 3 0.3
Total 10 1

• H(X) = – [(0.70 * log2 (0.70)) + (0.30 * log2 (0.30))] = 0.88129089


• This value 0.88129089 is the measurement of uncertainty when given the box
full of beads and asked to pull out one of the pouches when there are seven
pouches of green and three pouches of blue.
Example calculation of entropy cont’d
• Case 2
Bead color quantities probability
Green 5 0.5
Blue 5 0.5
Total 10 1

H(X) = – [(0.50 * log2 (0.50)) + (0.50 * log2 (0.50))] = 1

• Case 3
Bead color quantities probability
Green 10 1
Blue 0 0
Total 10 1

H(X) = – [(1.0 * log2 (1.0) + (0 * log2 (0)] = 0


Example calculation of entropy cont’d
• In scenarios 2 and 3, can see that the entropy is 1 and 0, respectively.
• In scenario 3, when we have only one color of the bead pouch, green, and have
removed all the pouches of blue color, then the uncertainty or the surprise is also
completely removed and the aforementioned entropy is zero.
– We can then conclude that the information is 100% present
Use of Entropy in Decision Tree
• we can measure the uncertainty available when choosing between any one of the bead pouches from
the box.
• Now, how does the decision tree algorithm use this measurement of impurity to build the tree?
• in decision trees the cost function is to minimize the heterogeneity in the leaf nodes. Therefore, the
aim is to find out the attributes and within those attributes the threshold such that when the data is
split into two, we achieve the maximum possible homogeneity or in other words, results in the
maximum drop in the entropy within the two tree levels.
• At the root level, the entropy of the target column is estimated via the formula proposed by Shannon
for entropy. At every branch, the entropy computed for the target column is the weighted entropy.
• The weighted entropy means taking the weights of each attribute. The weights are the probability of
each of the classes. The more the decrease in the entropy, the more is the information gained.
• Information Gain is the pattern observed in the data and is the reduction in entropy. It can also be
seen as the entropy of the parent node minus the entropy of the child node. It is calculated as 1 –
entropy. The entropy and information gain for the above three scenarios is as follows:

Entropy Information gain


Case 1 0.88129089 0.11870911
Case 2 1 0
Case 3 0 1
Another Example
• Let’s have a dataset made up of three colors; red, purple, and yellow.
• If we have one red, three purple, and four yellow observations in our set, our
equation become

• Where pr, pp and py are the probabilities of choosing a red, purple and yellow
example respectively and our equation becomes:

• Entropy is now 1.41


• Such data has an enormous impurity
• What if the data is composed of all purple? What is E? 0, why?
– Such a dataset has no impurity. This implies that such a dataset would not be useful for learning.
However, if we have a dataset with say, two classes, half made up of yellow and the other half
being purple, the entropy will be one, then learning is important (not trivial)
Information gain
• We can define information gain as a measure of how much
information a feature provides about a class.
• Information gain helps to determine the order of attributes in the
nodes of a decision tree.
• The main node is referred to as the parent node, whereas sub-nodes
are known as child nodes. We can use information gain to determine
how good the splitting of nodes in a decision tree
• It can help us determine the quality of splitting, as we shall soon see.
• The calculation of information gain should help us understand this
concept better
Gain = Eparent−Echildren

The term Gain represents information gain


– Eparen is the entropy of the parent node and E{children}, average entropy of child
node
– Example (next slide)
Example
• Suppose we have a dataset with two classes. This dataset has 5 purple and 5 green
examples. The initial value of entropy will be given by the equation below. Since
the dataset is balanced, we expect the answer to be 1
Einitial=−((0.5log20.5)+(0.5log20.5)) = 1

• Say we split the dataset into two branches. One branch ends up having four values
while the other has six. The left branch has four purples while the right one has five
green and one purple.
• We mentioned that when all the observations belong to the same class, the
entropy is zero since the dataset is pure. As such, the entropy of the left branch

• On the other hand, the right branch has five green and one purple. Thus:
Example cont’d
• A perfect split would have five examples on each branch.
• This is clearly not a perfect split, but we can determine how good the split is.
• We know the entropy of each of the two branches. We weight the entropy of each
branch by the number of elements each contains.
• This helps us calculate the quality of the split.
• The one on the left has 4, while the other has 6 out of a total of 10. Therefore, the
weighting goes as shown below

• The entropy before the split was initially 1, after splitting, the current value is 0.39
• We can now get our information gain, which is the entropy we “lost” after splitting

• The more the entropy removed, the more the information gained, the better the
split
Using information gain to build decision tree
• Since we now understand entropy and information gain, building decision
trees becomes a simple process.
• Let’s list them
1. An attribute with the highest information gain from a set should be
selected as the parent (root) node. From the image below, it is attribute A.

2. Build child nodes for every value of attribute A.

3. Repeat iteratively until you finish constructing the whole tree.


Information Gain
• The Information Gain measures the expected
reduction in entropy due to splitting on an attribute A
 k ni 
GAIN split  Entropy ( S )    Entropy (i ) 
 i 1 n 
Parent Node, S is split into k partitions; ni is number of
records in partition i

• Information Gain: Measures Reduction in Entropy


achieved because of the split. Choose the split that
achieves most reduction (maximizes GAIN)
Example 1: The problem of “Sunburn”
• You want to predict whether another person is likely to get
sunburned if he is back to the beach. How can you do this?
• Data Collected: predict based on the observed properties of the
people who have been to a beach
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None
Attribute Selection by Information Gain to construct the
optimal decision tree

•Entropy: The Disorder of Sunburned

D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”})

3 3 5 5
 D(3 ,5 )   log 2  log 2  0.954
8 8 8 8
Calculate the Average Disorder related to Hair Colour
IG(HC) = E(beforeHC) – E(afterHC)
= 0.954 – (E(blonde)+E(brown)+E(red))
= 0.954 – (E(2,2)+E(0,3)+E(1,0))
= 0.954 – (4/8*1 + 3/8*0 + 1/8*0) =
• So the average disorder created when splitting on ‘hair colour’ is
0.5+0+0=0.5
IG(HC) = 0.954 – 0.5 = 0.454

IG(lotion) = E(beforeLotion) – E(afterLotion)


= 0.954 – (E(yes) + E(no))
= 0.954 – (E(0,3) + E(3,2))
= 0.954 – (3/8*0 + 5/8(-3/5*log3/5 -2/5*log2/5))
= 0.954 – (0 + 0.61) = 0.344
IG(weight) = ??
IG(Height) = ??
Which attribute minimises the disorder?
Test Average Disorder of attributes
Hair 0.50
height 0.69
weight 0.94
lotion 0.61
• Which decision variable maximises the Info Gain then?
• Remember it’s the one which minimises the average disorder.
 Gain(hair) = 0.954 - 0.50 = 0.454
 Gain(height) = 0.954 - 0.69 =0.264
 Gain(weight) = 0.954 - 0.94 =0.014
 Gain (lotion) = 0.954 - 0.61 =0.344
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Sunburned None
(2,2)? (Alex, Pete,John)
(Emily)
Sunburned = Sarah, Annie,
None = Dana, Katie
• Once we have finished with hair colour we then need to calculate the
remaining branches of the decision tree. Which attributes is better to classify
the remaining ?
• E(2,2) = 1
• IG(lotion) = E(beforeLotion)-E(afterLotion)
= 1 – (E(yes) + E(no))
= 1 – (E(0,2) + E(2,0)) = 1 – (2/4*0 + 2/4*0) = 1
IG(weight) = 1 – (E(weight)) = 1 – (E(heavy)+E(light)+E(average))
= 1 – (E(0,0) + E(1,1) + E(1,1)) = 1-(0+2/4*1+2/4*1)= 1-1=0
The best Decision Tree
• This is the simplest and optimal one possible and it makes a lot of
sense.
• It classifies 4 of the people on just the hair colour alone.

is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes

Sunburned None
(Sarah, Annie) (Dana, Katie)
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.

If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Takeaway problem
• Find out the calculation error (if any) in solving
the sunburn decision tree problem and share
your corrected version
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
•Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training time ­ Cannot handle complicated
+ Easy to understand & interpret relationship between
+ Easy to generate rules & features
implement ­ Simple decision boundaries
+ Can handle large number of ­ Problems with lots of
features: missing data
50
Clustering
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster

• Given a set of data points, each x x x


having a set of attributes, and a x x x x x x x
similarity measure among them, x xx x x x x
group the points into some number x x x x x x
of clusters, so that x x x xx x
– Data points in the same cluster are
x x
similar to one another.
x x x x
– Data points in separate clusters are
x x x
dissimilar to one another.
x
52
Example: Clustering
• The example below demonstrates the clustering of padlocks of
same kind. There are a total of 10 padlocks which various in
color, size, shape, etc.

• How many possible clusters of padlocks can be identified?


– There are three different kind of padlocks; which can be
grouped into three different clusters.
– The padlocks of same kind are clustered into a group as shown
below.
Clustering: Document Clustering
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach:
 Identify content-bearing terms in each document.
 Form a similarity measure based on the frequencies of different
terms and use it to cluster documents.
• Application:
 Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.
Quality: What Is Good Clustering?
• The quality of a clustering result Intra-cluster
depends on both the similarity distances are
measure used by the method and its minimized
implementation
– Key requirement of clustering:
Need a good measure of similarity
between instances.
• A good clustering method will
produce high quality clusters with
– high intra-class similarity
– low inter-class similarity Inter-cluster
distances are
Inter
maximized
55
Evaluation based on internal information
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are still
the key.
56
Cluster Evaluation: Hard Problem
• The quality of a clustering is very hard to evaluate
because
– We do not know the correct clusters/classes
• Some methods are used:
– Direct evaluation (using either User inspection or Ground
Truth)
– Indirect Evaluation

• User inspection
– Study centroids of the cluster, and spreads of data items in
each cluster
– For text documents, one can read some documents in clusters
to evaluate the quality of clustering algorithms employed.

57
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.

• After clustering, a confusion matrix is constructed. From


the matrix, we compute various measurements,
entropy, purity, precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The clustering
method produces k clusters, which divides D into k disjoint
subsets, D1, D2, …, Dk.
58
Evaluation of Cluster Quality using Purity
• Quality measured by its ability to discover some or all of the
hidden patterns or latent classes in gold standard data
• Assesses a clustering with respect to ground truth … requires
labeled data
• Assume documents with C gold standard classes, while our
clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni
members
• Simple measure: purity, the ratio between the dominant class
in the cluster πi and the size of cluster
1 ωi
Purity (i )  max j (nij ) j C
ni

• Others are entropy of classes in clusters (or mutual


information between classes and clusters)
Purity example

     
     
    
Cluster I Cluster II Cluster III

• Assume that we cluster three category of data items (those colored


with red, blue and green) into three clusters as shown in the above
figures. Calculate purity to measure the quality of each cluster.
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83%
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67%
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60%
Exercise: Cluster Quality Measure
• Assume we have a text collection D of 900 documents from
three topics (or three classes). Science, Sports, and Politics.
Each class has 300 documents. Each document in D is labeled
with one of the topics/classes. We use this collection to
perform clustering to find three clusters, and the result is
shown in the following table:
– Measure the effectiveness of the clustering algorithm using
purity and also entropy?
Cluster Science Sports Politics Purity Entropy
1 250 20 10
2 20 180 80
3 30 100 210
Total 300 300 300
Indirect Evaluation
• In some applications, clustering is not the primary task, but
used to help perform another task.
• We can use the performance on the primary task to compare
clustering methods.
• For instance, in designing a recommender system, if the
primary task is to provide recommendations on book
purchasing to online shoppers.
– If we can cluster books according to their features, we might be
able to provide better recommendations.
– We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
– Here, we assume that the recommendation can be reliably
evaluated.

62
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of distance
“farness” or “nearness” measurement between data points.
– Distances are normally used to measure the similarity or dissimilarity
between two data objects
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
• Popular similarity measure is: Minkowski distance:
n q
dis( X ,Y )  q  (| x  y |)
i 1 i i

where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,… 63
Similarity & Dissimilarity Between Objects
• If q = 1, dis is Manhattan distance
n
dis ( X , Y )   (| xi  yi |
i 1

• If q = 2, dis is Euclidean distance:

n
dis( X ,Y )   (| x  y |) 2

i 1 i i

64
Example: Similarity measure
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1= 25

||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)½ =(42)½ =


6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)½ = (17)½ =
4.12

cos(d1, d2 ) = 0.94

66
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the
cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
67
The K-Means Clustering Algorithm
 Given k (number of clusters), the k-means algorithm is
implemented as follows:
– Select K cluster points randomly as initial centroids
– Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)

68
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5)
A6(6, 4) A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the
three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each
point in one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers. We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we have three points and needs to take
average of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9)
• For Cluster 2, we have three points. The new centroid is:
((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• Since centroids changes in Iteration1 (epoch1), we go to the
next Iteration (epoch2) using the new means we computed.
– The iteration continues until the centroids do not change anymore..
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
THANK YOU

You might also like