Clustering is an extensively studied data
mining problem in the text domains. The difficulty
finds numerous applications in customer
segmentation, classification, collaborative
filtering, visualization, document organization,
and indexing. In text mining, clustering the
sentence is one of the processes and used within
general text mining tasks. Several clustering
methods and algorithms are used for clustering
the documents at sentence level. In this article, the
sentence level based clustering algorithm is
discussed as a survey. The main goal of this survey
is to present an overview of the sentence level
clustering techniques. This demonstration of these
techniques is used to obtain the efficient scheme
for clustering for sentence level text. We can
obtain the more efficient method or we may
propose the new technique to overcome the
problems in these existing approaches. This
survey article is intended to provide easy
accessibility to the main ideas for non-experts.
Clustering is an extensively studied data
mining problem in the text domains. The difficulty
finds numerous applications in customer
segmentation, classification, collaborative
filtering, visualization, document organization,
and indexing. In text mining, clustering the
sentence is one of the processes and used within
general text mining tasks. Several clustering
methods and algorithms are used for clustering
the documents at sentence level. In this article, the
sentence level based clustering algorithm is
discussed as a survey. The main goal of this survey
is to present an overview of the sentence level
clustering techniques. This demonstration of these
techniques is used to obtain the efficient scheme
for clustering for sentence level text. We can
obtain the more efficient method or we may
propose the new technique to overcome the
problems in these existing approaches. This
survey article is intended to provide easy
accessibility to the main ideas for non-experts.
Original Title
Survey on Clustering Algorithms for Sentence Level Text
Clustering is an extensively studied data
mining problem in the text domains. The difficulty
finds numerous applications in customer
segmentation, classification, collaborative
filtering, visualization, document organization,
and indexing. In text mining, clustering the
sentence is one of the processes and used within
general text mining tasks. Several clustering
methods and algorithms are used for clustering
the documents at sentence level. In this article, the
sentence level based clustering algorithm is
discussed as a survey. The main goal of this survey
is to present an overview of the sentence level
clustering techniques. This demonstration of these
techniques is used to obtain the efficient scheme
for clustering for sentence level text. We can
obtain the more efficient method or we may
propose the new technique to overcome the
problems in these existing approaches. This
survey article is intended to provide easy
accessibility to the main ideas for non-experts.
Clustering is an extensively studied data
mining problem in the text domains. The difficulty
finds numerous applications in customer
segmentation, classification, collaborative
filtering, visualization, document organization,
and indexing. In text mining, clustering the
sentence is one of the processes and used within
general text mining tasks. Several clustering
methods and algorithms are used for clustering
the documents at sentence level. In this article, the
sentence level based clustering algorithm is
discussed as a survey. The main goal of this survey
is to present an overview of the sentence level
clustering techniques. This demonstration of these
techniques is used to obtain the efficient scheme
for clustering for sentence level text. We can
obtain the more efficient method or we may
propose the new technique to overcome the
problems in these existing approaches. This
survey article is intended to provide easy
accessibility to the main ideas for non-experts.
Survey on Clustering Algorithms for Sentence Level Text Saranya.J M.Phil 1 , Arunpriya.C M.Sc,M.Phil 2
1 Research Scholar, Department of Computer Applications, PSGR Krishnammal College For Women, Coimbatore 2 Assistant Professor, Department of Computer Science, PSGR Krishnammal College For Women, Coimbatore Abstract: Clustering is an extensively studied data mining problem in the text domains. The difficulty finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In text mining, clustering the sentence is one of the processes and used within general text mining tasks. Several clustering methods and algorithms are used for clustering the documents at sentence level. In this article, the sentence level based clustering algorithm is discussed as a survey. The main goal of this survey is to present an overview of the sentence level clustering techniques. This demonstration of these techniques is used to obtain the efficient scheme for clustering for sentence level text. We can obtain the more efficient method or we may propose the new technique to overcome the problems in these existing approaches. This survey article is intended to provide easy accessibility to the main ideas for non-experts. Keywords: sentence level clustering, Fuzzy relational clustering, Sentence Similarity, ranking and clustering of sentences and Median Fuzzy C- Means Clustering I. INTRODUCTION In many text processing activities, Sentence clustering plays an important role. For instance, various authors have argued that incorporating sentence clustering into extractive multidocument summarization helps avoid problems of content overlap, leading to better coverage [1], [2], [3], [4]. On the other hand, sentence clustering can also be used within more general text mining tasks. For instance, regard as web mining [5], where the specific objective might be to discover some novel information from a set of documents initially retrieved in response to some query. By clustering the sentences of those documents we would intuitively expect at least one of the clusters to be closely related to the concepts described by the query terms; though, other clusters may contain information pertaining to the query in some way hitherto unknown to us, and in such a case we would have successfully mined new information. Irrespective of the specific task (e.g., summarization, text mining, etc.), most documents will contain interrelated topics or themes, and many sentences will be related to some degree to a number of these. Nevertheless, clustering text at the sentence level poses specific challenges not present when clustering larger segments of text, such as documents. We now highlight some important differences between clustering at these two levels, and examine some existing approaches to fuzzy clustering. Clustering text at the document level is well established in the Information Retrieval (IR) literature, where documents are typically represented as data points in a highdimensional vector space in which each dimension corresponds to a unique keyword [6], leading to a rectangular representation in which rows represent documents and columns represent attributes of those documents (e.g., tf-idf values of the keywords). This type of data, which we refer to as attribute data, is amenable to clustering by a large range of algorithms. Since data points lie in a metric space, we can readily apply prototype- based algorithms such as k-Means [7], Isodata [8], Fuzzy c-Means (FCM) [9], [10] and the closely related mixture model approach [11], all of which represent clusters in terms of parameters such as means and covariances, and therefore assume a common metric input space. Since pairwise similarities or dissimilarities between data points can readily be calculated from the attribute data using similarity measures such as cosine similarity, we can also apply relational clustering algorithms such as Spectral Clustering [12] and Affinity Propagation [13], which take input data in the form of a square matrix W ={wij} (often referred to as the affinity matrix), where wij is the (pairwise) relationship between the ith and jth data object. To discriminate it from attribute data, also refer to such data as relational data. A broad range of hierarchical clustering algorithms [14] can also be applied. The vector space model has been flourishing in IR because it is able to effectively capture much of International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page62
the semantic content of document-level text. This is since documents that are semantically correlated are likely to include lots of words in common, and consequently are found to be similar according to popular vector space measures such as cosine similarity, which are based on word co-occurrence [15]. Conversely, while the assumption that (semantic) similarity can be measured in terms of word co-occurrence may be valid at the document II. CLUSTERING TECHNIQUES A. Fuzzy relational clustering algorithm based on the fuzzy C-means algorithm In this work, showed how one can take advantage of the stability and effectiveness of object data clustering algorithms when the data to be clustered are available in the form of mutual numerical relationships between pairs of objects [16]. More specifically, here propose a new fuzzy relational algorithm, based on the popular fuzzy C-means (FCM) algorithm, which does not require any particular restriction on the relation matrix. Here describe the application of the algorithm to four real and four synthetic data sets, and show that this algorithm performs better than well-known fuzzy relational clustering algorithms on all these sets. B. Novel Fuzzy Relational Clustering Algorithm In association with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic present within a document or set of documents. However, because most sentence similarity measures do not represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering. Andrew Skabar and Khaled Abdalgader [17] presented a novel fuzzy clustering algorithm called FRECCA that operates on relational input data; i.e., data in the form of a square matrix of pairwise similarities between data objects. The algorithm uses a graph representation of the data, and operates in an Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as likelihood [17].
C. Clustering Using Parts-of-Speech Clustering algorithms are used in many Natural Language Processing (NLP) tasks. They have proven to be popular and effective tools to use to discover groups of similar linguistic items [18]. In this exploratory paper, propose a new clustering algorithm to automatically cluster together similar sentences based on the sentences part-of-speech syntax. The algorithm generates and merges together the clusters using a syntactic similarity metric based on a hierarchical organization of the parts-of-speech. Here demonstrate the features of this algorithm by implementing it in a question type classification system, in order to determine the positive or negative impact of different changes to the algorithm. D. Embedded graph based sentence clustering In this paper, a document summarization framework for storytelling is proposed to extract essential sentences from a document by exploiting the mutual effects between terms, sentences and clusters [19]. There are three phrases in the framework: document modeling, sentence clustering and sentence ranking. The story document is modeled by a weighted graph with vertexes that represent sentences of the document. The sentences are clustered into different groups to find the latent topics in the story. To alleviate the influence of unrelated sentences in clustering, an embedding process is employed to optimize the document model. The sentences are then ranked according to the mutual effect between terms, sentence as well as clusters, and high-ranked sentences are selected to comprise the summarization of the document. The experimental results on the Document Understanding Conference (DUC) data sets demonstrate the effectiveness of the proposed method in document summarization. The results also show that the embedding process for sentence clustering render the system more robust with respect to different cluster numbers. E. Fuzzy-based sentence-level document clustering Contradiction Analysis is one of the popular text- mining operations in which a document whose content is contradictory to the theme of a set of documents is identified [20]. It is a means to identifying Outlier documents that do not confirm to the overall sense conveyed by other documents. Most of the existing techniques perform document-level comparisons, ignoring the sentence-level semantics, often leading to loss of vital information. Applications in domains like Defense and Healthcare International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page63
require high levels of accuracy and identification of micro-level contradictions are vital. In this paper, propose an algorithm for identifying contradictory documents using sentence-level clustering technique along with an optimization feature. A visualization scheme is also suggested to present the results to an end-user. F. Sentence-level event classification The ability to correctly classify sentences that describe events is an important task for many natural language applications such as Question Answering (QA) and Text Summarization. In this paper, treat event detection as a sentence level text classification problem [21]. Overall, here compare the performance of discriminative versus generative approaches to this task: namely, a Support Vector Machine (SVM) classifier versus a Language Modeling (LM) approach. Here also investigate a rule-based method that uses handcrafted lists of trigger terms derived from WordNet. Two datasets are used in the experiments to test each approach on six different event types, i.e., Die, Attack, Injure, Meet, Transport and Charge-Indict. G. Sentence Similarity Based on Semantic Nets Yuhua Li [22] Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. The proposed method derives text similarity from semantic and syntactic information contained in the compared texts. A text is considered to be a sequence of words each of which carries useful information. The words, along with their combination structure, make a text convey a specific meaning. Texts considered in this paper are assumed to be of sentence length. Unlike existing methods that use a fixed set of vocabulary, the proposed method dynamically forms a joint word set only using all the distinct words in the pair of sentences. For each sentence, a raw semantic vector is derived with the assistance of a lexical database. A word order vector is formed for each sentence, again using information from the lexical database. Since each word in a sentence contributes differently to the meaning of the whole sentence, the significance of a word is weighted by using information content derived from a corpus. By combining the raw semantic vector with information content from the corpus, a semantic vector is obtained for each of the two sentences. Semantic similarity is computed based on the two semantic vectors. An order similarity is calculated using the two order vectors. Finally, the sentence similarity is derived by combining semantic similarity and order similarity. H. Sentence Clustering using Similarity Word- Sequence Kernels In this paper, present a novel clustering approach based on the use of kernels as similarity functions and the C-means algorithm [23]. Several word- sequence kernels are defined and extended to verify the properties of similarity functions. Afterwards, these monolingual word-sequence kernels are extended to bilingual word-sequence kernels, and applied to the task of monolingual and bilingual sentence clustering. The motivation of this proposal is to group similar sentences into clusters so that specialized models can be trained for each cluster, with the purpose of reducing in this way both the size and complexity of the initial task. Here also provide empirical evidence for proving that the use of bilingual kernels can lead to better clusters, in terms of intra-cluster perplexities. I. Simultaneous ranking and clustering of sentences Multi-document summarization aims to produce a concise summary that contains salient information from a set of source documents. In this field, sentence ranking has hitherto been the issue of most concern [24]. Since documents often cover a number of topic themes with each theme represented by a cluster of highly related sentences, sentence clustering was recently explored in the literature in order to provide more informative summaries. Existing cluster-based International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page64
ranking approaches applied clustering and ranking in isolation. As a result, the ranking performance will be inevitably influenced by the clustering result. In this paper, we propose a reinforcement approach that tightly integrates ranking and clustering by mutually and simultaneously updating each other so that the performance of both can be improved. Experimental results on the DUC datasets demonstrate its effectiveness and robustness. J. Median Fuzzy C-Means Clustering Median clustering is a powerful methodology for prototype based clustering of similarity/dissimilarity data [25]. In this contribution combine the median c- means algorithm with the fuzzy c-means approach which is only applicable for vectorial (metric) data in its original variant. For the resulted median fuzzy c- means approach here prove convergence and investigate the behavior of the algorithm in several experiments including real world data from psychotherapy research. K. Correlation Similarity for Document Clustering Document clustering is aims to automatically group related documents into clusters. It is one of the most important tasks in machine learning and artificial intelligence and has received much attention in recent years. Based on various distance measures, a number of methods have been proposed to handle document clustering. A typical and widely used distance measure is the euclidean distance. The k-means method is one of the methods that use the euclidean distance, which minimizes the sum of the squared euclidean distance between the data points and their corresponding cluster centers. Since the document space is always of high dimensionality, it is preferable to find a low-dimensional representation of the documents to reduce computation complexity. Propose a new document clustering method [26] based on correlation preserving indexing (CPI), which explicitly considers the manifold structure embedded in the similarities between the documents. It aims to find an optimal semantic subspace by simultaneously maximizing the correlations between the documents in the local patches and minimizing the correlations between the documents outside these patches. This is different from LSI and LPI, which are based on a dissimilarity measure (euclidean distance), and are focused on detecting the intrinsic structure between widely separated documents rather than on detecting the intrinsic structure between nearby documents. The similarity-measure-based CPI method focuses on detecting the intrinsic structure between nearby documents rather than on detecting the intrinsic structure between widely separated documents. Since the intrinsic semantic structure of the document space is often embedded in the similarities between the documents CPI can effectively detect the intrinsic semantic structure of the high-dimensional document space. At this point, it is similar to Latent Dirichlet Allocation (LDA) which attempts to capture significant intra document statistical structure (intrinsic semantic structure embedded in the similarities between the documents) via the mixture distribution model. L. Fuzzy Approach for Multitype Relational Data Clustering Generally, pairwise relation could be described by similarities or dissimilarities between each pair of objects in the given dataset. In this work [27], only consider a similarity-type relation, which means the larger the value of the relationship between two objects, the more similar the two objects. Coclustering is treated as bipartite graph partitioning by calculating the singular value decomposition of the data matrix. A bipartite graph consists of two types of nodes, and edges only exist between nodes of different types. Since homogeneous relational data correspond to a graph that consists of nodes of the same type, the data matrix, such as the document term matrix that corresponds to a bipartite graph, can be treated as bitype heterogeneous relational data, i.e., relation between two different object types. Multitype relational data may form various structures, depending on the availability of relations. A star-structure is a special case where relations only exist between the central type and several attribute types. It is possible to transform a multitype relational data into one of the basic data representation forms and then use an existing approach to get the clusters of objects of the interested type. However, useful information may be lost during data transformation. Moreover, clustering on each type of objects individually loses the chance of mutual improvement among clusters of different object types and is unable to capture the interrelated patterns among different types which may be of interest in some data-mining applications III. CONCLUSION Clustering, one of the conventional data mining strategies is an unsubstantiated knowledge pattern. Here clustering methods endeavor to recognize International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page65
intrinsic alignments of the text documents, with the intention that a set of clusters is formed in which clusters display high intra-cluster likeness and low inter-cluster likeness. Normally, text document clustering endeavors to separate out the documents into groups where every group characterizes some subject that is different from the topics characterized by the other groups. In this article, a survey of sentence level clustering algorithms for text data is presented. A good clustering of text requires effective feature selection and a proper choice of the algorithm for the task at hand. Many algorithms are used to find the solutions to the above problems are discussed in detailed manner. Neuro-fuzzy clustering approaches can be used to improve the overall performance of the clustering approaches. REFERENCES [1] V. Hatzivassiloglou, J .L. Klavans, M.L. Holcombe, R. Barzilay, M. Kan, and K.R. McKeown, SIMFINDER: A Flexible Clustering Tool for Summarization, Proc. NAACL Workshop Automatic Summarization, pp. 41-49, 2001. [2] H. Zha, Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering, Proc. 25th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 113-120, 2002. [3] D.R. Radev, H. J ing, M. Stys, and D. Tam, Centroid-Based Summarization of Multiple Documents, Information Processing and Management: An Intl J ., vol. 40, pp. 919-938, 2004. [4] R.M. Aliguyev, A New Sentence Similarity Measure and Sentence Based Extractive Technique for Automatic Text Summarization, Expert Systems with Applications, vol. 36, pp. 7764- 7772, 2009. [5] R. Kosala and H. Blockeel, Web Mining Research: A Survey, ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 1-15, 2000. [6] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison- Wesley, 1989. [7] J .B MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967. [8] G. Ball and D. Hall, A Clustering Technique for Summarizing Multivariate Data, Behavioural Science, vol. 12, pp. 153-155, 1967. [9] J .C. Dunn, A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well-Separated Clusters, J . Cybernetics, vol. 3, no. 3, pp. 32-57, 1973. [10] J .C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981. [11] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. J ohn Wiley & Sons, 2001. [12] U.V. Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, vol. 17, no. 4, pp. 395-416, 2007. [13] B.J . Frey and D. Dueck, Clustering by Passing Messages between Data Points, Science, vol. 315, pp. 972-976, 2007. [14] S. Theodoridis and K. Koutroumbas, Pattern Recognition, fourth ed. Academic Press, 2008. [15] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008. [16] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy Relational Clustering Algorithm Based on the Fuzzy C-Means Algorithm, Soft Computing, vol. 9, pp. 439-447, 2005. [17] Andrew Skabar, and Khaled Abdalgader, Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 1, 2013. [18] Richard Khoury Sentence Clustering Using Parts-of-Speech I.J . Information Engineering and Electronic Business, 2012, 1, 1-9. [19] Zhengchen Zhang, Shuzhi Sam Ge, Hongsheng He Mutual- reinforcement document summarization using embedded graph based sentence clustering for storytelling J ournal Information Processing and Management: an International J ournal archive Volume 48 Issue 4, J uly, 2012. [20] R. Vasanth Kumar Mehta, B. Sankarasubramaniam, S. Rajalakshmi An algorithm for fuzzy-based sentence-level document clustering for micro-level contradiction analysis Proceeding ICACCI '12 Proceedings of the International Conference on Advances in Computing, Communications and Informatics 2012. [21] Martina Naughton, Nicola Stokes, and J oe Carthy Sentence- Level Event Classification in Unstructured Texts J ournal Information Retrieval archive Volume 13 Issue 2, April 2010 Pages 132-156 [22] Y. Li, D. McLean, Z.A. Bandar, J .D. OShea, and K. Crockett, Sentence Similarity Based on Semantic Nets and Corpus Statistics, IEEE Trans. Knowledge and Data Eng., vol. 8, no. 8, pp. 1138-1150, Aug. 2006. [23] J ess Andrs-Ferrer , Germn Sanchis-Trilles, Francisco Casacuberta Similarity word-sequence kernels for sentence clustering Proceeding SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition Pages 610-619 . [24] Xiaoyan Cai, Wenjie Li, You Ouyang, Hong Yan, Simultaneous ranking and clustering of sentences: a reinforcement approach to multi-document summarization Proceeding COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics Pages 134-142 2010. International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page66
[25] T. Geweniger, D. Zu hlke, B. Hammer, and T. Villmann, Median Fuzzy C-Means for Clustering Dissimilarity Data, Neurocomputing, vol. 73, nos. 7-9, pp. 1109-1116, 2010. [26] Taiping Zhang, Yuan Yan Tang, Bin Fang and Yong Xiang, Document Clustering in Correlation Similarity Measure Space, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, J UNE 2012 [27] J ian-Ping Mei, and Lihui Chen, A Fuzzy Approach for Multitype Relational Data Clustering, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 20, NO. 2, APRIL 2012