Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
76 views

Text Classification

This document discusses topic modeling and text classification. Topic modeling is an unsupervised machine learning technique that uses probabilistic models to automatically identify topics within a collection of documents based on word co-occurrence. Topic models represent documents as a mixture of topics and topics as a distribution over words. Text classification involves sorting documents into predefined categories. The objectives of this project are to collect a dataset, implement topic modeling using latent Dirichlet allocation and frequency-based text classification, and compare the results.

Uploaded by

Akanksha Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Text Classification

This document discusses topic modeling and text classification. Topic modeling is an unsupervised machine learning technique that uses probabilistic models to automatically identify topics within a collection of documents based on word co-occurrence. Topic models represent documents as a mixture of topics and topics as a distribution over words. Text classification involves sorting documents into predefined categories. The objectives of this project are to collect a dataset, implement topic modeling using latent Dirichlet allocation and frequency-based text classification, and compare the results.

Uploaded by

Akanksha Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Mining

Minor Project Report Topic modelling


(Text Classification)

Synopsis
A topic model is a model designed to automatically extract topics from a corpus of text documents. Here, a topic is a collection of terms that co-occur frequently in the documents of the corpus. Due to the nature of language use, the terms that constitute a topic are often semantically related. Topic models were originally developed in the field of natural language processing (NLP) and information retrieval (IR) as a means of automatically indexing, searching, Clustering, and structuring large corpora of unstructured and unlabeled documents. Using topic models, documents can be represented by the topics within them, and thus the entire corpus can be indexed and organized in terms of this discovered semantic structure. The topic model is a statistical language model that relates words and documents through topics. It is based on the idea that documents are made up of a mixture of topics, where topics are distributions over words. Specifically, the topic model is based on the Latent Dirichlet allocation (LDA) model, which has become a popular model for discrete data, such as collections of text documents. Key Features of Topic Model : unsupervised learning technique, which means that the often humanintensive task of finding labelled examples is completely eliminated probabilistically figures out groups of words that tend to co-occur, and identifies these groups as semantic topics helps in automatically summarizing a document collection relates words and documents through topics Hardware/Software specification Java Core and Advanced Overview of Text Classification o the task of automatically sorting a set of documents into categories from a predened set. Applications: identication of document genre

automated indexing of scientic articles according to predened thesauri of technical terms automated population of hierarchical catalogues of Web resources spam ltering automated essay grading Objective The main objectives of this project are: Data Set Collection Classification into Training and Test data Set Pre-processing on Training data set To Implement latent Dirichlet Allocation algorithm To Implement Frequency based text classification using similarity function Comparative Analysis of the output Advantage: Frees organizations from the need of manually organizing document bases Cost Cutting for organisations Saves Time Accuracy is also high

You might also like