Document Clustering Method Based On Visual Features
Document Clustering Method Based On Visual Features
Document Clustering Method Based On Visual Features
Visual Features
Yucong Liu, Bofeng Zhang, Kun Xing, Bo Zhou
School of Computer Engineering & Science
Shanghai University
Shanghai, China
e-mail: liuyucong@163.com, bfzhang@shu.edu.cn
AbstractThere are two important problems worth conduct-
ing research in the fields of personalized information services
based on user model. One is how to get and describe user per-
sonal information, i.e. building user model, the other is how to
organize the information resources, i.e. document clustering. It
is difficult to find out the desired information without a proper
clustering algorithm. Several new ideas have been proposed in
recent years. But most of them only took into account the text
information, but some other useful information may have more
contributions for documents clustering, such as the text size,
font and other appearance characteristics, so called visual fea-
tures. This paper proposes a method to cluster the scientific
documents based on visual features, so called VF-Clustering
algorithm. Five kinds of visual features of documents are de-
fined, including body, abstract, subtitle, keyword and title. The
thought of crossover and mutation in genetic algorithm is used
to adjust the value of k and cluster center in the k-means algo-
rithm dynamically. Experimental result supports our approach
as better concept. In the five visual features, the clustering
accuracy and steadiness of subtitle are only less than that of
body, but the efficiency is much better than body because the
subtitle size is much less than body size. The accuracy of clus-
tering by combining subtitle and keyword is better than each
of them individually, but is a little less than that by combining
subtitle, keyword and body. If the efficiency is an essential
factor, clustering by combining subtitle and keyword can be an
optimal choice.
Keywods-document clustering; k-means; visual features;
genetic algorithm
I. INTRODUCTION
In recent years, personalized information services play an
important role in peoples life. There are two important prob-
lems worth researching in the fields. One is how to get and
describe user personal information, i.e. building user model,
the other is how to organize the information resources, i.e.
document clustering. Personal information is described ex-
actly only if user behavior and the resource what they look
for or search have been accurately analyzed. The effective-
ness of a personalized service depends on completeness and
accuracy of user model. The basic operation is organizing the
information resources. In this paper we focus on document
clustering.
At present, as millions of scientific documents available
on the Web. Indexing or searching millions of documents
and retrieving the desired information has become an in-
creasing challenge and opportunity with the rapid growth of
scientific documents. Clustering plays an important role in
analysis of user interests in user model. So high-quality
scientific document clustering plays a more and more impor-
tant role in the real word applications such as personalized
service and recommendation systems.
Clustering is a classical method in data mining research.
Scientific document clustering [6][8][9] is a technique which
puts related papers into a same group. The documents within
each group should exhibit a large degree of similarity while
the similarity among different clusters should be minimized.
In general, there are lots of algorithms about clustering
[1][5][10][13], including partitioning methods[5] (k-means,
k-medoids etc), hierarchical methods [16] (BIRCH, CURE,
etc), density-based methods (DBSCAN, OPTICS, etc), grid-
based methods (STING, CLIQUE, etc) and model-based
methods, etc.
In 1967, MacQueen first put forward the k-means
[2][3][4][7] clustering algorithm.The k-means method has
shown to be effective in producing good clustering results
for many practical applications. However it suffers from
some major drawbacks that make it inappropriate for some
applications. One major disadvantage is that the number of
cluster k must be specified prior to application. And another
is the sensitivity to initialization. The two drawbacks of k-
means not only affect the efficiency of the algorithm but also
influence clustering accuracy.
There are many existing document representation ap-
proaches [11], including Boolen Approach, Vector Space
Model (VSM), Probabilistic Retrieval Model and Language
Model. At present the most popular document representation
is Vector Space Model (VSM). In the 1960s, G. Salton and
other people proposed VSM. VSM is an algebraic model for
representing text documents as vectors of identifiers. Docu-
ments are represented as vectors, such as
= (
, ,
, ,
= ((t
, w
), (
), , (
), (
))
where d
represents the
weight of the j-th keyword in the i-th document.
This paper adopts classical TF-IDF as the clustering
keywords weight calculation method because it has an ad-
vantage in considering words occurrence frequency not only
in a document but also in the whole date set. Furthermore, in
this paper the size of each document is also taken into ac-
count, and the parameter weight is defined by (2).
=
,
_
.
()
()
where size (i) means the number of effective characters of
the i-th document
()
shows the average size of all
the document in date set, ,
_
expresses the words
occurrence frequency of keyword
_
appeared in document
||
||||
||
For example
= (1,2,2,1,0)
= (0,1,2,1,1)
= 1 0 +2 1 +2 2 + 1 1 +0 1 = 7
||
|| = (1 1 + 2 2 + 2 2 + 1 1 + 0 0)
||
|| = (0 0 + 1 1 + 2 2 + 1 1 + 1 1)
(, ) =
7
10 7
= 0.7
III. DOCUMENT CLUSTERING ALGORITHMBASED ON
VISUAL FEATURES
The main characteristics of document clustering algo-
rithm based on visual features, so called VF-Clustering are as
follows:
1) Five kinds of visual features are defined according to
the analysis of content and structure of scientific document,
including body (B), abstract (A), subtitle (S), keyword (K)
and title (T). And the importance of these features to scientif-
ic document clustering will be compared through experi-
ments.
2) In view of the two drawbacks of k-means algorithm,
the thought of crossover and mutation in genetic algorithm is
used to improve the k-means algorithm. Adjust the values of
k and cluster center dynamically by merging and adding
cluster centers in the process of clustering.
The implementation of clustering algorithm introduces
below.
459
A. Document Presentation Based on Visual Features
As the most widely used document presentation method,
the mentioned model VSM represents document in two ways.
In one way we can segment words and select clustering
keywords according to words frequency by mainly analyz-
ing the body of the document, or put clustering keywords
selected in the first time into selection from the whole docu-
ment, and according to the clustering keywords position,
their weight shall be adjusted if they occurrences in title or
abstract. In the other way, only title and abstract are analyzed
to retrieve clustering keywords and do further clustering,
though effective, the result obtained in this way is not accu-
rate enough.
In this paper, a document representation based on visual
features is defined with a full consideration of the impor-
tance of each visual feature in the whole document. There-
fore, we segment words on the basis of every visual feature
independently and retrieve clustering keywords from each
part with features extraction method introduced above. And
according to the importance of every visual feature, it shall
be adjusted for the clustering keywords weight (
) of
comprehensive document representation, with
be ob-
tained by (4) and comprehensive document presentation
shown in (5).
= (
), (
), , (
), , (
)
where
= (, 2), (, 3)
= (, 3), (, 3)
, ,
, ,
= ((A, 3), (B, 3), (C, 0.33))
Step 4 Calculate the similarity for every pair of new clus-
ter centers obtained in step 3. The thought of crossover in
genetic algorithm is used in here. Two clusters have to be
merged if the similarity between them is bigger than . For
example, there are 2 cluster centers: center1= ((A, 3), (B, 4),
(C, 2)), center2= ((A, 2), (B, 4), (C, 3)), and the two merged
into one cluster center, that is,
= ,
, ,
, ,
= ((A, 2.5), (B, 4), (C, 2.5))
Step 5 Execute step 2, step 3 and step 4 once more, and
finish this process if cluster center reaches a stable value or
maximize iteration times, or else return to step 2 and contin-
ue to execute this process.
IV. IMPLEMENTATION OF VF-CLUSTERING IN CHINESE
SCIENTIFIC DOCUMENT CLUSTERING
A. Evaluation of Clustering Results
There is still no uniform standard for the evaluation of
document clustering results, however, precision rate and
recall rate which reflect two different aspects of quality clus-
tering must be taken into account together. Since F1 test
value combines the two precisely, we use the most common-
ly evaluation, precision rate, recall rate and F1 test value to
evaluate the effect of the document clustering.
Each artificial labeled theme T
i
in data set corresponds to
a clustering result set
(6)
(7)
1 =
)
(8)
460
B. Experiment and Result Analysis
Text data sets are from 195 articles of Chinese scientific
and technical document in CNKI, including 47 articles of
Clustering Algorithm (CA), 58 articles of Data Mining (DM)
43 articles of Cloud Computing (CC) and 47 articles of Ge-
netic Algorithm (GA). We pre-treatment the data set, we
separately extract five visual features of each document to a
save to the database table. The part of the experimental orig-
inal data is shown in Fig. 1.
Figure 1. The part of the experiment original data
where the label is artificial classified marks, while nl shows
the clustering result.
The first step of the experiment: firstly, make word seg-
ment for five visual features independently, remove stop
words and extract clustering keywords; then, make a cluster-
ing for each visual feature that represents documents inde-
pendently. The experimental result is shown in TABLE I.
Where the k-means shows the basic clustering algorithm and
make the body representing the documents, all others adopt
the improved algorithm.
TABLE I. RESULTS OF CLUSTERING BY FIVE VISUAL FEATURES
k-means(%) B (%) A (%) S (%) K (%) T (%)
CA
R 76.60 76.60 74.47 76.60 55.32 53.19
P 83.72 83.72 53.03 85.71 92.86 48.93
F1 80.00 80.00 61.95 80.90 69.33 50.97
DM
R 93.10 93.10 91.38 93.10 86.25 87.93
P 90.00 90.00 82.81 88.52 84.85 82.26
F1 91.53 91.53 86.89 90.76 85.54 85.00
CC
R 93.02 93.02 90.69 90.70 95.35 93.02
P 90.69 90.69 88.38 86.04 97.62 78.43
F1 91.84 91.84 89.50 88.30 96.47 85.11
GA
R 93.62 93.62 40.43 89.36 100.00 91.79
P 84.62 84.62 95.00 86.27 79.66 58.11
F1 88.89 88.89 56.72 87.79 88.68 71.07
Through the analysis of the first step of the experimental
results, we could conclude as follows:
1) Because the value of the k is equal to 4, so the basic
algorithm and the improved algorithm to clustering have the
same results when make the body representing document
independently. But the clustering running time are reduced
when use the improved algorithm.
2) The clustering performance by visual features body
and subtitle are best in representing documents independent-
ly, and good steady is exhibited in these types of data sets.
Whats more, the visual feature body has slightly better clus-
tering results than subtitle.
3) The visual feature keyword is better than abstract and
title in clustering effect, moreover, abstract and title are poor
in the stability of the clustering result by representing docu-
ment independently. Among these three visual features, clus-
tering has a good effect in a new subject or a subject with
fewer applications. However, it has a relatively poor effect in
subject with extensive applications.
4) The visual features title has poor clustering results in
subject with extensive applications.
Under the first step of the experimental results, we make
an analysis of clustering results obtained through different
visual features representing the document independently. We
make different combinations of visual features to represent
the document and clustering. The result is shown in TABLE
II.
Summarize the analysis of the results of the second step
of experiment as follows:
1) From the whole analysis of the two results in TABLE I
and TABLE II, its obviously draw that the clustering result
of the comprehensive visual features is better than any single
visual feature in representing documents.
461
TABLE II. RESULTS OF CLUSTERING BY DIFFERENT COMBINATION
S, K (%) B, S (%) B, S, K (%) B, S, K, A (%) B, S, A, K, T (%)
CA
R 80.85 85.11 95.74 95.74 95.74
P 90.48 93.02 91.84 90.00 93.75
F1 85.39 88.89 93.75 92.78 94.74
DM
R 94.83 98.28 93.10 93.10 94.83
P 96.49 90.48 100.00 96.43 98.21
F1 95.65 94.21 96.43 94.74 96.49
CC
R 97.67 100.00 100.00 97.67 97.67
P 97.67 100.00 100.00 100.00 100.00
F1 97.67 100.00 100.00 98.82 98.82
GA
R 95.74 93.62 100.00 95.74 100.00
P 84.91 95.65 95.92 95.74 95.92
F1 90.00 94.62 97.92 95.74 97.92
2) Although the clustering results of visual features that
consist of subtitle and keyword are slightly better than the
visual feature body representing documents independently,
the effective number of characters of subtitle and keyword is
less than the bodys, so it greatly enhances the efficiency of
feature words selection when making words segment. This
way could be used to meet the high requirements of the clus-
tering results and efficiency.
3) The integrated independent visual feature includes
body, subtitle and keyword, in which each one has the best
clustering results to express text, and its clustering results is
almost the same as the one that integrate five visual features
to represent a document. Moreover, the clustering results of
these two combinations are the best, although the efficiency
is not so good. In order to meet the higher requirement of the
clustering results, we could combine body, subtitle and key-
word together to represent a document.
V. CONCLUSION
This paper implements a method to cluster the scientific
documents based on visual features (VF-Clustering). And
through the deep analysis of these clustering results we find
some useful information as follows:
1) In the five visual features, body representing docu-
ments independently to cluster have the best accuracy and
steadiness, and subtitle is next. However, the clustering ef-
fect of abstract, keyword and title are not very good, espe-
cially in the widely applied field of knowledge clustering.
2) The accuracy of clustering by combining subtitle and
keyword is better than each of them individually. Moreover,
operation time can be saved greatly for the less effective
characters in the two parts. If the efficiency is an essential
factor, clustering by combining subtitle and keyword can be
an optimal choice.
3) If the higher accuracy is demanded, clustering combin-
ing body, subtitle and keyword is a better choice.
This paper also uses the thought of crossover and muta-
tion in genetic algorithm to improve the k-means algorithm
and heightens the efficiency greatly by adjusting the values
of k and cluster center dynamically in the process of cluster-
ing.
ACKNOWLEDGMENT
This work is supported by Shanghai Leading Academic
Discipline Project (J50103) and Innovation Program of
Shanghai Municipal Education Commission (11ZZ85).
REFERENCES
[1] S. Guha, R. Rastogi, and K. Shim, An efficient clustering algorithm
for large databases, ACM SIGMOD international conference on
Management of data, Volume 27 Issue 2, June 1998.
[2] A. Likasa, and N. Vlassisb, Verbeekb. The global k-means
clustering algorithm. Pattern Recognition, 2003, pp. 451 461.
[3] J. A. Hartigan, and M. A. Wong, A K-Means Clustering Algorithm,
Journal of the Royal Statistical Society, Series C (Applied Statistics),
Vol. 28, No. 1,1979, pp.100-108.
[4] K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl, Constrained K-
means Clustering with Background Knowledge, Proceedings of the
Eighteenth International Conference on Machine Learning, 2001, pp.
577-584.
[5] R. Dutta, I. Ghosh, A. Kundu, and D. Mukhopadhyay, An Advanced
Partitioning Approach of Web Page Clustering utilizing Content &
Link Structure, Journal of Convergence Information Technology
Volume 4, Number 3, 2009.
[6] J. L. Neto, A. D. Santos, and C. A. A. Kaestner, Alex A. Freitas,
Document Clustering and Text Summarization, Information
Processing and Management, 2000.
[7] J.M. Pena , J.A. Lozano, and P. Larranaga, An empirical comparison
of four initialization methods for the K-Means algorithm, Pattern
Recognition Letters, 1999, pp.1027-1040.
[8] L. Yanjun, M. Chung , and D. Holt, Text document clustering based
on frequent word meaning sequences, Data & Knowledge
Engineering, 2008, pp. 381404.
[9] E Rasmussen, P. Hall, and E. Cliffs, Clustering algorithms,
Information Retrieval, 1992, pp.419-442.
[10] A. K. Jain, and M. N. Murty, Data Clustering: A Review, ACM
Computing Surveys (CSUR), 1999, pp.264323.
[11] W. B. Cavnar, Using An N-Gram-Based Document Representation
With A Vector Processing Retrieval Model, Proc. of TREC-3 (Third
Text REtrieval Conference), Gaithersburg, 1994.
[12] G. Salton, and C. Buckley, Term-weighting approaches in automatic
text retrieval, Information Processing and Management 24, 513-523.
1988. Reprinted in: Sparck Jones, K. and Willet, P. Eds. Readings in
Information Retrieval, 1997, pp.323-328.
[13] N. Grira, Crucianu, and M. Boujemaa, Unsupervised and semi-
supervised clustering: a brief survey, 7th ACM SIGMM
international workshop on Multimedia information retrieval, 2005,
pp.9-16.
[14] U. Maulik, and S. Bandyopadhyay, Genetic algorithm-based
clustering technique, Pattern Recognition, 2000, pp.1455-1465.
[15] K. Krishna, and M. Narasimha Murty, Genetic K-Means Algorithm,
Item Identifer S, 1999, pp.1083-4419.
[16] J. F. Navarro, C. S. Frenk, and S. D. M. White, A universal density
profile from hierarchical clustering, The astrophysical jouranl, 1997,
pp.490-493.
462