Clustering Categorical Data: The Case of Quran Verses
Clustering Categorical Data: The Case of Quran Verses
Presented By
Muhammad Al-Watban
IS 598
1
Outline
Introduction
Preprocessing of Quran Verses
Similarity Measures
Assisting Clusters Similarities
Shortcomings of Traditional clustering methods with
categorical data
ROCK - Major definitions
ROCK clustering Algorithm
ROCK example
Conclusion and future work
2
Introduction
The holy Quran covers a wide range of
topics.
Quran does not cover each topic by a set of
sequenced verses or sura’s.
A single verse usually deals with many
subjects
Project goal: to cluster the verses of The
Holy Quran based on the verse’s subjects.
3
Preprocessing of Quran Verses
it is necessary to perform manual preprocessing for the
Quran text to capture the subjects of the verses into a
tabular format
Verses in the Holy Quran can be viewed as records and
the related subjects as attributes of the record. This is
demonstrated by the following table:
4
Similarity Measures
Two types of attributes:
1. Continuous attributes:
range of attribute value is continuous and ordered
includes Attributes with numeric values (e.g. salary)
also includes attributes whose allowed set of values are
thought to be part of an ordered set of a meaningful
sequence (e.g. professional ranks, disease severity
levels)
The similarity (or dissimilarity) between objects is
computed based on distance between them.
the most commonly used distance measure is
Euclidean distance, and Manhattan distance
5
Similarity Measures
2. Categorical attributes:
consists of attributes whose underlying domain is not ordered
Examples : colors, blood type.
If the attribute has only two states (namely 0 and 1), then it is
called binary; if it has more than two states, it is called nominal.
6
Similarity Measures
Where does the verses treasures data fit?
Each verse can be represented by a record with
Boolean attributes, each attribute corresponds to
a single subject
The attribute corresponding to a subject is T if the
verse contains that subjects; otherwise, it is F
As we said, Boolean attributes are a special case
of categorical attributes
7
Assisting Clusters Similarities
Many clustering algorithm(such as hirarchical
clustering) requires computing distance between
clusters (rather than elements)
There are several standard methods:
1- Single linkage:
D(r,s): distance between clusters r and s is defined as the
distance between the closest pair of objects
D(r,s)
8
Assisting Clusters Similarities
2. Complete linkage
distance is defined as the distance between the farthest pair of
objects
D(r,s)
3. Average linkage
distance is defined as the average of distances between all pairs
of objects r and s, where r and s belong to different clusters
9
Assisting Clusters Similarities
4. Centroid Linkage:
distance between clusters is defined as the
distance between the pair of cluster
centroids.
D(r,s)
10
Shortcomings of Traditional clustering
methods with categorical data
Example
Consider the following 4 market basket transactions
T1= {1, 2, 3, 4}
T2= {1, 2, 4}
T3= {3}
T4= {4}
converting these transactions to Boolean points, we get:
P1= (1, 1, 1, 1)
P2= (1, 1, 0, 1)
P3= (0, 0, 1, 0)
P4= (0, 0, 0, 1)
using Euclidean distance to measure the closeness between
all pairs of points, we find that d(p1,p2) is the smallest distance
:
d ( p1, p2) (|11|2 |11|2 |1 0 |2 |11|2 ) 1
11
Shortcomings of Traditional clustering
methods with categorical data
If we use the centroid-based hierarchical algorithm then we
merge P1 and P2 and get a new cluster (P12) with (1, 1, 0.5, 1)
as a centroid
Then, using Euclidean distance again, we find:
d(p12,p3)= 3.25
d(p12,p4)= 2.25
d(p3,p4)= 2
So, we should merge P3 and P4 since the distance between
them is the shortest.
However, T3 and T4 don't have even a single common item.
So, using distance metrics as similarity measure for categorical
data is not appropriate
The solution is ROCK
12
ROCK - Major definitions
Similarity function
Neighbors
Links
Criterion function
Goodness measure
13
Similarity function
Let Sim (Pi, Pj) be a similarity function that is
used to measure the closeness between
points pi and Pj.
ROCK assumes that Sim function is
normalized to return a value between 0 and 1
For Quran treasures data, a possible
definition for the sim function is based on the
Jaccard coefficient:
sim( Pi, Pj) | Pi Pj |
| Pi Pj |
14
Example : similarity function
Suppose two verses (P1 and P2) contain the
following subjects
P1={ judgment, faith, prayer, fair}
P2={ fasting, faith, prayer}
Sim(P1,P2)= | P1 P2| / | P1P2|
= 2 / 5 = 0.40
15
Major definitions
Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
16
Neighbors and Links
17
Example : neighboring and linking
Example :
Assume that we have three distinct points: p1,p2 and p3;
where
neighbor(p1)={p1,p2}
neighbor(p2)={p1,p2,3}
neighbor(p3)={p3,p2}
Neighboring graph
To define the number of links between two points, say p1 and
p3, we have to find the number of their common neighbors;
hence, we can define the linkage function between p1 and p3
to be:
Link (p1,p3) = | neighbor(p1) neighbor(p3) |= | {P2}|
Or Link (p1,p3) = 1
18
Example : minimum linkages
If we have four points:P1,P2,P3,P4
suppose that similarity threshold () is equal to 1
Then, Two Points are neighbors if sim(Pi,Pj)>=1
hence, points are considered neighbors only to
identical points (i.e. only to themselves)
To find Link(P1,P2):
neighbor(P1)={P1}
neighbor(P2)={P2}
link (P1,P2)= |neighbor(p1) neighbor(p2) | =0
19
The following table shows the number of links
(common neighbors) between the four points:
20
Example : maximum linkages
If we have four points:P1,P2,P3,P4
suppose that similarity threshold () is equal to 0
Then, Two Points are neighbors if sim(Pi,Pj)>=0
hence, any pair of points are neighbors
To find Link(P1,P2):
neighbor(P1)={P1,P2,P3,P4}
neighbor(P2)={P1,P2,P3,P4}
link (P1,P2)= |neighbor(P1) neighbor(P2) | =4
21
The following table shows the number of links
(common neighbors) between the four points:
22
Example :illustrating links
from the previous example, we have:
neighbor(P1)={P1,P2,P3,P4}
neighbor(P3)={P1,P2,P3,P4}
23
Major definitions
Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
24
Criterion function
to get the best clusters, we have to maximize this
Criterion Function
k link ( p , p )
E n
q
1 2 f ( )
r
n
l i
i 1 p q , p r Ci
Where Ci denotes cluster i i
ni is the number of points in Ci
k is the number of clusters
is the similarity threshold
25
Criterion function
By maximizing this criterion function, we are
maximizing the sum of links of intra cluster point
pairs and at the same time minimizing the sum of
links among pairs of points belonging to different
clusters (i.e. among inter cluster point pairs)
26
Major definitions
Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure
27
Goodness measure
Goodness Function
link[C , C ]
g (C , C ) i j
(n n ) n n
i j
1 2 f ( ) 1 2 f ( ) 1 2 f ( )
i j i j
29
ROCK Clustering algorithm
30
ROCK Clustering algorithm
2. Perform a hierarchical agglomerative clustering
algorithm:
ROCK performs the following steps which are
common to all hierarchical agglomerative
clustering algorithms, but with different definition to
the similarity measures:
33
ROCK Example
Suppose we have four verses contains some subjects , as follows:
P1={ judgment, faith, prayer, fair}
P2={ fasting, faith, prayer}
P3={ fair, fasting, faith}
P4={ fasting, prayer, pilgrimage}
the similarity threshold = 0.3, and number of required cluster is 2.
using Jaccard coefficient as a similarity measure,
we obtain the following similarity table :
34
ROCK Example
35
ROCK Example
we compute the goodness
link [ Pi , Pj ]
measure for all adjacent g ( P , P )
i j
points ,assuming that f() (n m)1 2 f ( ) n1 2 f ( ) m1 2 f ( )
=1- / 1+
we obtain the
following table:
we have an equal
goodness measure for
merging ((P1,P2), (P2,P1),
(P3,P1))
36
ROCK Example
Now, we start the hierarchical algorithm by merging,
say P1 and P2.
A new cluster (let’s call it C(P1,P2)) is formed.
It should be noted that for some other hierarchical
clustering techniques, we will not start the clustering
process by merging P1 and P2, since Sim(P1,P2) =
0.4,which is not the highest. But, ROCK uses the
number of links as the similarity measure rather than
distance.
37
ROCK Example
Now, after merging P1 and P2,
we have only three clusters.
The following table shows the
number of common neighbors
for these clusters:
38
ROCK Example
Since the number of required clusters is 2,
then we finish the clustering algorithm by
merging C(P1,P2) and P3, obtaining a new
cluster C(P1,P2,P3) which contains
{P1,P2,P3} leaving P4 alone in a separate
cluster.
39
Conclusion and future work (1/3)
We aim to apply a clustering technique on the
verses of the Holy Quran
We should first perform manual
preprocessing for the Quran text to capture
the subjects of the verses into a tabular
format.
Then we can apply a clustering algorithm
which clusters each set of similar verses into
the same group.
40
Conclusion and future work (2/3)
Most traditional clustering algorithm uses distance
based similarity measures which is not appropriate
for clustering our categorical-type datasets.
we will apply the general framework of the ROCK
algorithm.
The ROCK (RObust Clustering using linKs)
algorithm is an agglomerative hierarchical clustering
algorithm for clustering categorical data. It presents
a new notion of link to measure similarity between
data objects.
41
Conclusion and future work (3/3)
We will adopt JAVA language to implement
ROCK clustering algorithm.
During testing, will try to form clusters of
verses belonging to a single sura, and verses
belonging to many different suras.
Insha Allah, we will achieve success in
performing this mission.
42
Thank You for your attention
43