Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Survey On Clustering Algorithms For Sentence Level Text

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014

ISSN: 2231-2803 http://www.ijcttjournal.org Page61



Survey on Clustering Algorithms for Sentence Level Text
Saranya.J M.Phil
1
, Arunpriya.C M.Sc,M.Phil
2

1
Research Scholar, Department of Computer Applications, PSGR Krishnammal College For Women, Coimbatore
2
Assistant Professor, Department of Computer Science, PSGR Krishnammal College For Women, Coimbatore
Abstract: Clustering is an extensively studied data
mining problem in the text domains. The difficulty
finds numerous applications in customer
segmentation, classification, collaborative
filtering, visualization, document organization,
and indexing. In text mining, clustering the
sentence is one of the processes and used within
general text mining tasks. Several clustering
methods and algorithms are used for clustering
the documents at sentence level. In this article, the
sentence level based clustering algorithm is
discussed as a survey. The main goal of this survey
is to present an overview of the sentence level
clustering techniques. This demonstration of these
techniques is used to obtain the efficient scheme
for clustering for sentence level text. We can
obtain the more efficient method or we may
propose the new technique to overcome the
problems in these existing approaches. This
survey article is intended to provide easy
accessibility to the main ideas for non-experts.
Keywords: sentence level clustering, Fuzzy
relational clustering, Sentence Similarity, ranking
and clustering of sentences and Median Fuzzy C-
Means Clustering
I. INTRODUCTION
In many text processing activities, Sentence
clustering plays an important role. For instance,
various authors have argued that incorporating
sentence clustering into extractive multidocument
summarization helps avoid problems of content
overlap, leading to better coverage [1], [2], [3], [4].
On the other hand, sentence clustering can also be
used within more general text mining tasks. For
instance, regard as web mining [5], where the
specific objective might be to discover some novel
information from a set of documents initially
retrieved in response to some query. By clustering the
sentences of those documents we would intuitively
expect at least one of the clusters to be closely related
to the concepts described by the query terms; though,
other clusters may contain information pertaining to
the query in some way hitherto unknown to us, and in
such a case we would have successfully mined new
information. Irrespective of the specific task (e.g.,
summarization, text mining, etc.), most documents
will contain interrelated topics or themes, and many
sentences will be related to some degree to a number
of these. Nevertheless, clustering text at the sentence
level poses specific challenges not present when
clustering larger segments of text, such as documents.
We now highlight some important differences
between clustering at these two levels, and examine
some existing approaches to fuzzy clustering.
Clustering text at the document level is well
established in the Information Retrieval (IR)
literature, where documents are typically represented
as data points in a highdimensional vector space in
which each dimension corresponds to a unique
keyword [6], leading to a rectangular representation
in which rows represent documents and columns
represent attributes of those documents (e.g., tf-idf
values of the keywords). This type of data, which we
refer to as attribute data, is amenable to clustering
by a large range of algorithms. Since data points lie
in a metric space, we can readily apply prototype-
based algorithms such as k-Means [7], Isodata [8],
Fuzzy c-Means (FCM) [9], [10] and the closely
related mixture model approach [11], all of which
represent clusters in terms of parameters such as
means and covariances, and therefore assume a
common metric input space. Since pairwise
similarities or dissimilarities between data points can
readily be calculated from the attribute data using
similarity measures such as cosine similarity, we can
also apply relational clustering algorithms such as
Spectral Clustering [12] and Affinity Propagation
[13], which take input data in the form of a square
matrix W ={wij} (often referred to as the affinity
matrix), where wij is the (pairwise) relationship
between the ith and jth data object.
To discriminate it from attribute data, also refer to
such data as relational data. A broad range of
hierarchical clustering algorithms [14] can also be
applied. The vector space model has been flourishing
in IR because it is able to effectively capture much of
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page62

the semantic content of document-level text. This is
since documents that are semantically correlated are
likely to include lots of words in common, and
consequently are found to be similar according to
popular vector space measures such as cosine
similarity, which are based on word co-occurrence
[15]. Conversely, while the assumption that
(semantic) similarity can be measured in terms of
word co-occurrence may be valid at the document
II. CLUSTERING TECHNIQUES
A. Fuzzy relational clustering algorithm based on the
fuzzy C-means algorithm
In this work, showed how one can take advantage of
the stability and effectiveness of object data
clustering algorithms when the data to be clustered
are available in the form of mutual numerical
relationships between pairs of objects [16]. More
specifically, here propose a new fuzzy relational
algorithm, based on the popular fuzzy C-means
(FCM) algorithm, which does not require any
particular restriction on the relation matrix. Here
describe the application of the algorithm to four real
and four synthetic data sets, and show that this
algorithm performs better than well-known fuzzy
relational clustering algorithms on all these sets.
B. Novel Fuzzy Relational Clustering Algorithm
In association with hard clustering methods, in which
a pattern belongs to a single cluster, fuzzy clustering
algorithms allow patterns to belong to all clusters
with differing degrees of membership. This is
important in domains such as sentence clustering,
since a sentence is likely to be related to more than
one theme or topic present within a document or set
of documents. However, because most sentence
similarity measures do not represent sentences in a
common metric space, conventional fuzzy clustering
approaches based on prototypes or mixtures of
Gaussians are generally not applicable to sentence
clustering. Andrew Skabar and Khaled Abdalgader
[17] presented a novel fuzzy clustering algorithm
called FRECCA that operates on relational input
data; i.e., data in the form of a square matrix of
pairwise similarities between data objects. The
algorithm uses a graph representation of the data, and
operates in an Expectation-Maximization framework
in which the graph centrality of an object in the graph
is interpreted as likelihood [17].

C. Clustering Using Parts-of-Speech
Clustering algorithms are used in many Natural
Language Processing (NLP) tasks. They have proven
to be popular and effective tools to use to discover
groups of similar linguistic items [18]. In this
exploratory paper, propose a new clustering
algorithm to automatically cluster together similar
sentences based on the sentences part-of-speech
syntax. The algorithm generates and merges together
the clusters using a syntactic similarity metric based
on a hierarchical organization of the parts-of-speech.
Here demonstrate the features of this algorithm by
implementing it in a question type classification
system, in order to determine the positive or negative
impact of different changes to the algorithm.
D. Embedded graph based sentence clustering
In this paper, a document summarization framework
for storytelling is proposed to extract essential
sentences from a document by exploiting the mutual
effects between terms, sentences and clusters [19].
There are three phrases in the framework: document
modeling, sentence clustering and sentence ranking.
The story document is modeled by a weighted graph
with vertexes that represent sentences of the
document. The sentences are clustered into different
groups to find the latent topics in the story. To
alleviate the influence of unrelated sentences in
clustering, an embedding process is employed to
optimize the document model. The sentences are then
ranked according to the mutual effect between terms,
sentence as well as clusters, and high-ranked
sentences are selected to comprise the summarization
of the document. The experimental results on the
Document Understanding Conference (DUC) data
sets demonstrate the effectiveness of the proposed
method in document summarization. The results also
show that the embedding process for sentence
clustering render the system more robust with respect
to different cluster numbers.
E. Fuzzy-based sentence-level document clustering
Contradiction Analysis is one of the popular text-
mining operations in which a document whose
content is contradictory to the theme of a set of
documents is identified [20]. It is a means to
identifying Outlier documents that do not confirm to
the overall sense conveyed by other documents. Most
of the existing techniques perform document-level
comparisons, ignoring the sentence-level semantics,
often leading to loss of vital information.
Applications in domains like Defense and Healthcare
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page63

require high levels of accuracy and identification of
micro-level contradictions are vital. In this paper,
propose an algorithm for identifying contradictory
documents using sentence-level clustering technique
along with an optimization feature. A visualization
scheme is also suggested to present the results to an
end-user.
F. Sentence-level event classification
The ability to correctly classify sentences that
describe events is an important task for many natural
language applications such as Question Answering
(QA) and Text Summarization. In this paper, treat
event detection as a sentence level text classification
problem [21]. Overall, here compare the performance
of discriminative versus generative approaches to this
task: namely, a Support Vector Machine (SVM)
classifier versus a Language Modeling (LM)
approach. Here also investigate a rule-based method
that uses handcrafted lists of trigger terms derived
from WordNet. Two datasets are used in the
experiments to test each approach on six different
event types, i.e., Die, Attack, Injure, Meet, Transport
and Charge-Indict.
G. Sentence Similarity Based on Semantic Nets
Yuhua Li [22] Sentence similarity measures play an
increasingly important role in text-related research
and applications in areas such as text mining, Web
page retrieval, and dialogue systems. Existing
methods for computing sentence similarity have been
adopted from approaches used for long text
documents. These methods process sentences in a
very high-dimensional space and are consequently
inefficient, require human input, and are not
adaptable to some application domains. This paper
focuses directly on computing the similarity between
very short texts of sentence length. It presents an
algorithm that takes account of semantic information
and word order information implied in the sentences.
The semantic similarity of two sentences is calculated
using information from a structured lexical database
and from corpus statistics. The use of a lexical
database enables our method to model human
common sense knowledge and the incorporation of
corpus statistics allows our method to be adaptable to
different domains. The proposed method can be used
in a variety of applications that involve text
knowledge representation and discovery.
The proposed method derives text similarity from
semantic and syntactic information contained in the
compared texts. A text is considered to be a sequence
of words each of which carries useful information.
The words, along with their combination structure,
make a text convey a specific meaning. Texts
considered in this paper are assumed to be of
sentence length. Unlike existing methods that use a
fixed set of vocabulary, the proposed method
dynamically forms a joint word set only using all the
distinct words in the pair of sentences. For each
sentence, a raw semantic vector is derived with the
assistance of a lexical database. A word order vector
is formed for each sentence, again using information
from the lexical database. Since each word in a
sentence contributes differently to the meaning of the
whole sentence, the significance of a word is
weighted by using information content derived from
a corpus. By combining the raw semantic vector with
information content from the corpus, a semantic
vector is obtained for each of the two sentences.
Semantic similarity is computed based on the two
semantic vectors. An order similarity is calculated
using the two order vectors. Finally, the sentence
similarity is derived by combining semantic
similarity and order similarity.
H. Sentence Clustering using Similarity Word-
Sequence Kernels
In this paper, present a novel clustering approach
based on the use of kernels as similarity functions
and the C-means algorithm [23]. Several word-
sequence kernels are defined and extended to verify
the properties of similarity functions. Afterwards,
these monolingual word-sequence kernels are
extended to bilingual word-sequence kernels, and
applied to the task of monolingual and bilingual
sentence clustering. The motivation of this proposal
is to group similar sentences into clusters so that
specialized models can be trained for each cluster,
with the purpose of reducing in this way both the size
and complexity of the initial task. Here also provide
empirical evidence for proving that the use of
bilingual kernels can lead to better clusters, in terms
of intra-cluster perplexities.
I. Simultaneous ranking and clustering of sentences
Multi-document summarization aims to produce a
concise summary that contains salient information
from a set of source documents. In this field, sentence
ranking has hitherto been the issue of most concern
[24]. Since documents often cover a number of topic
themes with each theme represented by a cluster of
highly related sentences, sentence clustering was
recently explored in the literature in order to provide
more informative summaries. Existing cluster-based
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page64

ranking approaches applied clustering and ranking in
isolation. As a result, the ranking performance will be
inevitably influenced by the clustering result. In this
paper, we propose a reinforcement approach that
tightly integrates ranking and clustering by mutually
and simultaneously updating each other so that the
performance of both can be improved. Experimental
results on the DUC datasets demonstrate its
effectiveness and robustness.
J. Median Fuzzy C-Means Clustering
Median clustering is a powerful methodology for
prototype based clustering of similarity/dissimilarity
data [25]. In this contribution combine the median c-
means algorithm with the fuzzy c-means approach
which is only applicable for vectorial (metric) data in
its original variant. For the resulted median fuzzy c-
means approach here prove convergence and
investigate the behavior of the algorithm in several
experiments including real world data from
psychotherapy research.
K. Correlation Similarity for Document Clustering
Document clustering is aims to automatically group
related documents into clusters. It is one of the most
important tasks in machine learning and artificial
intelligence and has received much attention in recent
years. Based on various distance measures, a number
of methods have been proposed to handle document
clustering. A typical and widely used distance
measure is the euclidean distance. The k-means
method is one of the methods that use the euclidean
distance, which minimizes the sum of the squared
euclidean distance between the data points and their
corresponding cluster centers. Since the document
space is always of high dimensionality, it is
preferable to find a low-dimensional representation
of the documents to reduce computation complexity.
Propose a new document clustering method [26]
based on correlation preserving indexing (CPI),
which explicitly considers the manifold structure
embedded in the similarities between the documents.
It aims to find an optimal semantic subspace by
simultaneously maximizing the correlations between
the documents in the local patches and minimizing
the correlations between the documents outside these
patches. This is different from LSI and LPI, which
are based on a dissimilarity measure (euclidean
distance), and are focused on detecting the intrinsic
structure between widely separated documents rather
than on detecting the intrinsic structure between
nearby documents. The similarity-measure-based CPI
method focuses on detecting the intrinsic structure
between nearby documents rather than on detecting
the intrinsic structure between widely separated
documents. Since the intrinsic semantic structure of
the document space is often embedded in the
similarities between the documents CPI can
effectively detect the intrinsic semantic structure of
the high-dimensional document space. At this point,
it is similar to Latent Dirichlet Allocation (LDA)
which attempts to capture significant intra document
statistical structure (intrinsic semantic structure
embedded in the similarities between the documents)
via the mixture distribution model.
L. Fuzzy Approach for Multitype Relational Data
Clustering
Generally, pairwise relation could be described by
similarities or dissimilarities between each pair of
objects in the given dataset. In this work [27], only
consider a similarity-type relation, which means the
larger the value of the relationship between two
objects, the more similar the two objects.
Coclustering is treated as bipartite graph partitioning
by calculating the singular value decomposition of
the data matrix. A bipartite graph consists of two
types of nodes, and edges only exist between nodes
of different types. Since homogeneous relational data
correspond to a graph that consists of nodes of the
same type, the data matrix, such as the document
term matrix that corresponds to a bipartite graph, can
be treated as bitype heterogeneous relational data,
i.e., relation between two different object types.
Multitype relational data may form various
structures, depending on the availability of relations.
A star-structure is a special case where relations only
exist between the central type and several attribute
types. It is possible to transform a multitype
relational data into one of the basic data
representation forms and then use an existing
approach to get the clusters of objects of the
interested type. However, useful information may be
lost during data transformation. Moreover, clustering
on each type of objects individually loses the chance
of mutual improvement among clusters of different
object types and is unable to capture the interrelated
patterns among different types which may be of
interest in some data-mining applications
III. CONCLUSION
Clustering, one of the conventional data mining
strategies is an unsubstantiated knowledge pattern.
Here clustering methods endeavor to recognize
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page65

intrinsic alignments of the text documents, with the
intention that a set of clusters is formed in which
clusters display high intra-cluster likeness and low
inter-cluster likeness. Normally, text document
clustering endeavors to separate out the documents
into groups where every group characterizes some
subject that is different from the topics characterized
by the other groups. In this article, a survey of
sentence level clustering algorithms for text data is
presented. A good clustering of text requires effective
feature selection and a proper choice of the algorithm
for the task at hand. Many algorithms are used to find
the solutions to the above problems are discussed in
detailed manner. Neuro-fuzzy clustering approaches
can be used to improve the overall performance of
the clustering approaches.
REFERENCES
[1] V. Hatzivassiloglou, J .L. Klavans, M.L. Holcombe, R.
Barzilay, M. Kan, and K.R. McKeown, SIMFINDER: A Flexible
Clustering Tool for Summarization, Proc. NAACL Workshop
Automatic Summarization, pp. 41-49, 2001.
[2] H. Zha, Generic Summarization and Keyphrase Extraction
Using Mutual Reinforcement Principle and Sentence Clustering,
Proc. 25th Ann. Intl ACM SIGIR Conf. Research and
Development in Information Retrieval, pp. 113-120, 2002.
[3] D.R. Radev, H. J ing, M. Stys, and D. Tam, Centroid-Based
Summarization of Multiple Documents, Information Processing
and Management: An Intl J ., vol. 40, pp. 919-938, 2004.
[4] R.M. Aliguyev, A New Sentence Similarity Measure and
Sentence Based Extractive Technique for Automatic Text
Summarization, Expert Systems with Applications, vol. 36, pp.
7764- 7772, 2009.
[5] R. Kosala and H. Blockeel, Web Mining Research: A
Survey, ACM SIGKDD Explorations Newsletter, vol. 2, no. 1,
pp. 1-15, 2000.
[6] G. Salton, Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer. Addison-
Wesley, 1989.
[7] J .B MacQueen, Some Methods for Classification and Analysis
of Multivariate Observations, Proc. Fifth Berkeley Symp. Math.
Statistics and Probability, pp. 281-297, 1967.
[8] G. Ball and D. Hall, A Clustering Technique for Summarizing
Multivariate Data, Behavioural Science, vol. 12, pp. 153-155,
1967.
[9] J .C. Dunn, A Fuzzy Relative of the ISODATA Process and its
Use in Detecting Compact Well-Separated Clusters, J .
Cybernetics, vol. 3, no. 3, pp. 32-57, 1973.
[10] J .C. Bezdek, Pattern Recognition with Fuzzy Objective
Function Algorithms. Plenum Press, 1981.
[11] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,
second ed. J ohn Wiley & Sons, 2001.
[12] U.V. Luxburg, A Tutorial on Spectral Clustering, Statistics
and Computing, vol. 17, no. 4, pp. 395-416, 2007.
[13] B.J . Frey and D. Dueck, Clustering by Passing Messages
between Data Points, Science, vol. 315, pp. 972-976, 2007.
[14] S. Theodoridis and K. Koutroumbas, Pattern Recognition,
fourth ed. Academic Press, 2008.
[15] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to
Information Retrieval. Cambridge Univ. Press, 2008.
[16] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy
Relational Clustering Algorithm Based on the Fuzzy C-Means
Algorithm, Soft Computing, vol. 9, pp. 439-447, 2005.
[17] Andrew Skabar, and Khaled Abdalgader, Clustering
Sentence-Level Text Using a Novel Fuzzy Relational Clustering
Algorithm, IEEE Transactions on Knowledge and Data
Engineering, Vol. 25, No. 1, 2013.
[18] Richard Khoury Sentence Clustering Using Parts-of-Speech
I.J . Information Engineering and Electronic Business, 2012, 1, 1-9.
[19] Zhengchen Zhang, Shuzhi Sam Ge, Hongsheng He Mutual-
reinforcement document summarization using embedded graph
based sentence clustering for storytelling J ournal Information
Processing and Management: an International J ournal archive
Volume 48 Issue 4, J uly, 2012.
[20] R. Vasanth Kumar Mehta, B. Sankarasubramaniam, S.
Rajalakshmi An algorithm for fuzzy-based sentence-level
document clustering for micro-level contradiction analysis
Proceeding ICACCI '12 Proceedings of the International
Conference on Advances in Computing, Communications and
Informatics 2012.
[21] Martina Naughton, Nicola Stokes, and J oe Carthy Sentence-
Level Event Classification in Unstructured Texts J ournal
Information Retrieval archive Volume 13 Issue 2, April 2010
Pages 132-156
[22] Y. Li, D. McLean, Z.A. Bandar, J .D. OShea, and K.
Crockett, Sentence Similarity Based on Semantic Nets and
Corpus Statistics, IEEE Trans. Knowledge and Data Eng., vol. 8,
no. 8, pp. 1138-1150, Aug. 2006.
[23] J ess Andrs-Ferrer , Germn Sanchis-Trilles, Francisco
Casacuberta Similarity word-sequence kernels for sentence
clustering Proceeding SSPR&SPR'10 Proceedings of the 2010
joint IAPR international conference on Structural, syntactic, and
statistical pattern recognition Pages 610-619 .
[24] Xiaoyan Cai, Wenjie Li, You Ouyang, Hong Yan,
Simultaneous ranking and clustering of sentences: a
reinforcement approach to multi-document summarization
Proceeding COLING '10 Proceedings of the 23rd International
Conference on Computational Linguistics Pages 134-142 2010.
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page66

[25] T. Geweniger, D. Zu hlke, B. Hammer, and T. Villmann,
Median Fuzzy C-Means for Clustering Dissimilarity Data,
Neurocomputing, vol. 73, nos. 7-9, pp. 1109-1116, 2010.
[26] Taiping Zhang, Yuan Yan Tang, Bin Fang and Yong Xiang,
Document Clustering in Correlation Similarity Measure Space,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 24, NO. 6, J UNE 2012
[27] J ian-Ping Mei, and Lihui Chen, A Fuzzy Approach for
Multitype Relational Data Clustering, IEEE TRANSACTIONS
ON FUZZY SYSTEMS, VOL. 20, NO. 2, APRIL 2012

You might also like