Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
Purpose of IR : The primary goal of an IR system is to retrieve all the information items that are
relevant to a user query while retrieving as few non-relevant items as possible [58]. Furthermore,
the retrieved information items should be ranked from the most relevant to the least relevant.
Steps of iR : Information Retrieval is typically a two – steps process: (i) First potentially relevant
documents are identified (ii) And then found documents are ranked
Stopwords are the English words which does not add much meaning to a sentence. They
can safely be ignored without sacrificing the meaning of the sentence. For example, the
words like the, he, have etc. Such words are already captured this in corpus named
corpus.
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to return
the base or dictionary form of a word, which is known as the lemma . If confronted with the token
saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw
depending on whether the use of the token was as a verb or a noun.
A vector space model is an algebraic model, involving two steps, in first step we represent
the text documents into vector of words and in second step we transform to numerical format
so that we can apply any text mining techniques such as information retrieval, information
extraction,information filtering etc.
Term frequency is the number of times a given term or query apears within a search
index.
TF(w) = (Number of times term w appears in a document) / (Total number of terms in the
document)
IDF(w) = log_e(Total number of documents / Number of documents with term w in
it)
TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that
is intended to reflect how important a word is to a document in a collection or corpus.
Cosine similarity is to check the doc similarity based on the various factors of vectors
being used to check the similarity.
Term weighting is the assignment of numerical values to terms that represent their
importance in a document in order to improve retrieval effectiveness
Clustering has been used in information retrieval for many different purposes, such as
query expansion, document grouping, document indexing, and visualization of search
results.
They are different types of clustering methods, including:
Partitioning methods.
Hierarchical clustering.
Fuzzy clustering.
Density-based clustering.
Model-based clustering.
. Document Parsing
The Documents comes from different source combinations such as multiple languages,
formattings, character sets; normally, if any document consisting of more than languages.
e.g. Consider a Spanish mail which has some part in french language.
Thus Document parsing deals with the overall document structure. In this phase, it breaks
down the document into discrete components.
Search engine optimization is the process of growing the quality and quantity of
website traffic by increasing the visibility of a website or a web page to users of a web
search engine. SEO refers to the improvement of unpaid results and excludes direct traffic
and the purchase of paid placement.