4 - Data Analytics Using DM and ML Algorithms - 1
4 - Data Analytics Using DM and ML Algorithms - 1
Data Analytics
Using Data mining algorithms for Data
Analytics
Objectives
•At the end of this unit we plan to understand
– What data analytics is
– classification algorithms
• Decision tree classification
– clustering algorithms
• K-means clustering
DATA MINING; a step in the
process of Data Analytics
• Data analytics deals with every step in the process of
a data-driven model, including data mining
• Data mining is therefore a step in the process of data
analytics
– Predictive modeling
– Descriptive modeling
What is Data Mining?
• DM is the process of discovery of useful and
hidden patterns in large quantities of data
using machine learning algorithms
– It is concerned with the non-trivial extraction of
implicit, previously unknown and potentially useful
information and knowledge from data
– It discovers meaningful patterns that are valid,
novel, useful and understandable.
• The major task of data mining includes:
– Classification
– Clustering
– Association rule discovery
Data Mining Main Tasks
Prediction Methods
create a model to predict Description Methods
the class of unknown or construct a model
new instances. that can describe the
existing data.
6
DM Task: Predictive Modeling
• A predictive model makes a prediction/forecast about
values of data using known results found from different
historical data
– Prediction Methods use existing variables to predict unknown
or future values of other variables.
• Predict one variable Y given a set of other variables X.
Here X could be an n-dimensional vector
– In effect this is a function approximation through learning the
relationship between Y and X
• Many, many algorithms for predictive modeling in
statistics and machine learning, including
– Classification, regression, etc.
• Often the emphasis is on predictive accuracy, less
emphasis on understanding the model
7
Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Predictive Modeling: Customer Scoring
• Goal: To predict whether a customer is a high risk
customer or not.
– Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use machine learning to rank new customers as a
function of p(mortgage | customer data)
• Customer data
– History of transactions with the bank
– Other credit data (obtained from Experian, etc)
– Demographic data on the customer or where they live
• Techniques
– Binary classification: logistic regression, decision trees, etc
– Many, many applications of this nature 10
Classification
properties 3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
14
Pattern (Association Rule) Discovery
• Goal is to discover interesting “local” patterns
(sequential patterns) in the data rather than to
characterize the data globally
– Also called link analysis (uncovers relationships among data)
15
Basic Data Mining algorithms
• Classification: which is also called Supervised learning,
maps data into predefined groups or classes to enhance the
prediction process
• Clustering: which is also called Unsupervised learning,
groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of data.
– Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
• Association Rule: is also known as market-basket
analysis
– It discovers interesting associations between attributes contained in
a database.
– Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event. 16
Classification
Classification is a data mining (machine
learning) technique used to predict group
membership of new data instances.
17
OVERVIEW OF CLASSIFICATION
• Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the
class.
– construct a model for class attribute as a function of the
values of other attributes.
– Given a data D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
the Classification Problem is to define a mapping f:DgC
where each ti is assigned to one class.
• Goal: previously unseen records should be assigned a class
as accurately as possible. A test set is used to determine
the accuracy of the model.
– Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it. 18
Classification Examples
• Teachers classify students’ grades as A, B, C, D, or F.
• Predict whether the weather on a particular day will
be “sunny”, “rainy” or “cloudy”.
• Identify individuals with credit risks.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Document classification into the predefined classes,
such as politics, sport, social, economy, law, etc.
19
CLASSIFICATION: A TWO-STEP PROCESS
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known 21
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Refund Marital Taxable
Classification Example No
Status
Single
Income Cheat
75K ?
2 No Married 100K No
3 No Single 70K No Test
4 Yes Married 120K No Set
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Model Apply
8 No Single 85K Yes Classifier
9 No Married 75K No
10 No Single 90K Yes
Training Learn
Set
Classifier
10
Confusion matrix and performance evaluation
26
Classification methods
Decision tree
• Decision Tree is a popular supervised learning technique in machine learning,
serving as a hierarchical if-else statement based on feature comparison operators.
• It is used for regression and classification problems, finding relationships between
predictor and response variables.
• The tree structure includes Root, Branch, and Leaf nodes, representing all
possible outcomes based on specific conditions or rules.
• The algorithm aims to create homogenous Leaf nodes containing records of a
single type in the outcome variable. However, sometimes restrictions may lead to
mixed outcomes in the Leaf nodes.
• To build the tree, the algorithm selects features and thresholds by optimizing a
loss function, aiming for the most accurate predictions.
• Decision Trees offer interpretable models and are widely used for various
applications, from simple binary classification to complex decision-making tasks
• Simply put, it takes the form of a tree with branches representing the potential
answers to a given question. There are metrics used to train decision trees. One
of them is information gain
Example of cost function in a decision tree
• Let’s do one thing: I offer you beads and we perform an experiment. I have a box full of an
equal number of beads of two colors: blue and green. You may choose either of the colors
but with eyes closed. The fun part is: in case you get the green then you are enjoy a bonus
of 5 points, or if you get the blue then you would have to submit one more assignment.
• This dilemma where you would have to decide and this decision of yours that can lead to
results with equal probability is nothing else but said to be the state of maximum
uncertainty.
• In case, I had only green or blue then we know what the outcome would have been and
hence the uncertainty (or surprise) will be zero.
• The probability of getting each outcome of a green or blue is:
– P(bead == green) = 0.50
– P(bead == blue) = 1 – 0.50 = 0.50
• When we have only one result either green or blue, then in the absence of uncertainty,
the probability of the event is:
– P(bead== green) = 1
– P(bead== blue) = 1 – 1 = 0
• There is a relationship between heterogeneity and uncertainty; the more heterogeneous
the event the more uncertainty. On the other hand, the less heterogeneous, or so to say,
the more homogeneous the event, the lesser is the uncertainty. The uncertainty is
expressed as Entropy.
Entropy
• In machine learning, entropy measures the impurity or randomness
present in a dataset.
• It is commonly used in decision tree algorithms to evaluate the
homogeneity of data at a particular node.
• A higher entropy value indicates a more heterogeneous dataset
with diverse classes, while a lower entropy signifies a more pure
and homogeneous subset of data.
• Decision tree models can use entropy to determine the best splits
to make informed decisions and build accurate predictive models.
• Considering N data set, the entropy can be calculated by the
following formula where pi is the probability of randomly selecting
an example in class i
Example calculation of entropy
• We shall estimate the entropy for three different scenarios.
• The event Y is getting a green. The heterogeneity or the impurity formula for
two different classes is as follows:
• H(X) = – [(pi * log2 pi) + (qi * log2 qi)]
• where,
– pi = Probability of Y = 1 i.e. probability of success of the event
– qi = Probability of Y = 0 i.e. probability of failure of the event
• Case 1 Bead color quantities probability
Green 7 0.7
Blue 3 0.3
Total 10 1
• Case 3
Bead color quantities probability
Green 10 1
Blue 0 0
Total 10 1
• Where pr, pp and py are the probabilities of choosing a red, purple and yellow
example respectively and our equation becomes:
• Say we split the dataset into two branches. One branch ends up having four values
while the other has six. The left branch has four purples while the right one has five
green and one purple.
• We mentioned that when all the observations belong to the same class, the
entropy is zero since the dataset is pure. As such, the entropy of the left branch
• On the other hand, the right branch has five green and one purple. Thus:
Example cont’d
• A perfect split would have five examples on each branch.
• This is clearly not a perfect split, but we can determine how good the split is.
• We know the entropy of each of the two branches. We weight the entropy of each
branch by the number of elements each contains.
• This helps us calculate the quality of the split.
• The one on the left has 4, while the other has 6 out of a total of 10. Therefore, the
weighting goes as shown below
• The entropy before the split was initially 1, after splitting, the current value is 0.39
• We can now get our information gain, which is the entropy we “lost” after splitting
• The more the entropy removed, the more the information gained, the better the
split
Using information gain to build decision tree
• Since we now understand entropy and information gain, building decision
trees becomes a simple process.
• Let’s list them
1. An attribute with the highest information gain from a set should be
selected as the parent (root) node. From the image below, it is attribute A.
3 3 5 5
D(3 ,5 ) log 2 log 2 0.954
8 8 8 8
Calculate the Average Disorder related to Hair Colour
IG(HC) = E(beforeHC) – E(afterHC)
= 0.954 – (E(blonde)+E(brown)+E(red))
= 0.954 – (E(2,2)+E(0,3)+E(1,0))
= 0.954 – (4/8*1 + 3/8*0 + 1/8*0) =
• So the average disorder created when splitting on ‘hair colour’ is
0.5+0+0=0.5
IG(HC) = 0.954 – 0.5 = 0.454
is_sunburned
Hair colour
blonde brown
red
Alex,
Emily
Lotion used Pete,
John
no yes
Sunburned None
(Sarah, Annie) (Dana, Katie)
Sunburn sufferers are ...
• You can view Decision Tree as an IF-THEN_ELSE
statement which tells us whether someone will suffer
from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-
used=“No”) then
return (sunburned = yes)
else
return (false)
Takeaway problem
• Find out the calculation error (if any) in solving
the sunburn decision tree problem and share
your corrected version
Why decision tree induction in DM?
• Relatively faster learning speed (than other classification
methods)
• Convertible to simple and easy to understand classification if-
then-else rules
•Comparable classification accuracy with other methods
• Does not require any prior knowledge of data distribution,
works well on noisy data.
Pros Cons
+ Reasonable training time Cannot handle complicated
+ Easy to understand & interpret relationship between
+ Easy to generate rules & features
implement Simple decision boundaries
+ Can handle large number of Problems with lots of
features: missing data
50
Clustering
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster
• User inspection
– Study centroids of the cluster, and spreads of data items in
each cluster
– For text documents, one can read some documents in clusters
to evaluate the quality of clustering algorithms employed.
57
Cluster Evaluation: Ground Truth
• We use some labeled data (for classification)
– Assumption: Each class is a cluster.
Cluster I Cluster II Cluster III
62
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of distance
“farness” or “nearness” measurement between data points.
– Distances are normally used to measure the similarity or dissimilarity
between two data objects
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
• Popular similarity measure is: Minkowski distance:
n q
dis( X ,Y ) q (| x y |)
i 1 i i
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,… 63
Similarity & Dissimilarity Between Objects
• If q = 1, dis is Manhattan distance
n
dis ( X , Y ) (| xi yi |
i 1
n
dis( X ,Y ) (| x y |) 2
i 1 i i
64
Example: Similarity measure
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1= 25
cos(d1, d2 ) = 0.94
66
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the
cluster
• K is the number of clusters to partition the dataset
• Means refers to the average location of members of a
particular cluster
– k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster
67
The K-Means Clustering Algorithm
Given k (number of clusters), the k-means algorithm is
implemented as follows:
– Select K cluster points randomly as initial centroids
– Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)
68
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5)
A6(6, 4) A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the
three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2, 10) be
placed in? The one, where the point has the shortest distance to the mean – i.e.
mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each
point in one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers. We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we have three points and needs to take
average of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9)
• For Cluster 2, we have three points. The new centroid is:
((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• Since centroids changes in Iteration1 (epoch1), we go to the
next Iteration (epoch2) using the new means we computed.
– The iteration continues until the centroids do not change anymore..
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
THANK YOU