Text Classification
Text Classification
Synopsis
A topic model is a model designed to automatically extract topics from a corpus of text documents. Here, a topic is a collection of terms that co-occur frequently in the documents of the corpus. Due to the nature of language use, the terms that constitute a topic are often semantically related. Topic models were originally developed in the field of natural language processing (NLP) and information retrieval (IR) as a means of automatically indexing, searching, Clustering, and structuring large corpora of unstructured and unlabeled documents. Using topic models, documents can be represented by the topics within them, and thus the entire corpus can be indexed and organized in terms of this discovered semantic structure. The topic model is a statistical language model that relates words and documents through topics. It is based on the idea that documents are made up of a mixture of topics, where topics are distributions over words. Specifically, the topic model is based on the Latent Dirichlet allocation (LDA) model, which has become a popular model for discrete data, such as collections of text documents. Key Features of Topic Model : unsupervised learning technique, which means that the often humanintensive task of finding labelled examples is completely eliminated probabilistically figures out groups of words that tend to co-occur, and identifies these groups as semantic topics helps in automatically summarizing a document collection relates words and documents through topics Hardware/Software specification Java Core and Advanced Overview of Text Classification o the task of automatically sorting a set of documents into categories from a predened set. Applications: identication of document genre
automated indexing of scientic articles according to predened thesauri of technical terms automated population of hierarchical catalogues of Web resources spam ltering automated essay grading Objective The main objectives of this project are: Data Set Collection Classification into Training and Test data Set Pre-processing on Training data set To Implement latent Dirichlet Allocation algorithm To Implement Frequency based text classification using similarity function Comparative Analysis of the output Advantage: Frees organizations from the need of manually organizing document bases Cost Cutting for organisations Saves Time Accuracy is also high