Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

IR : : the techniques of storing and recovering and often disseminating recorded data especially

through the use of a computerized system

Purpose of IR : The primary goal of an IR system is to retrieve all the information items that are
relevant to a user query while retrieving as few non-relevant items as possible [58]. Furthermore,
the retrieved information items should be ranked from the most relevant to the least relevant.

Steps of iR : Information Retrieval is typically a two – steps process: (i) First potentially relevant
documents are identified (ii) And then found documents are ranked

Tokenization is the act of breaking up a sequence of strings into pieces such as


words, keywords, phrases, symbols and other elements called tokens. Tokens can
be individual words, phrases or even whole sentences. In the process of
tokenization, some characters like punctuation marks are discarded. The tokens
become the input for another process like parsing and text mining.

Token normalization is the process of canonicalizing tokens so that matches occur


despite superficial differences in the character sequences of the tokens.

Stopwords are the English words which does not add much meaning to a sentence. They
can safely be ignored without sacrificing the meaning of the sentence. For example, the
words like the, he, have etc. Such words are already captured this in corpus named
corpus.

Stemming is the process of producing morphological variants of a root/base word.


Stemming programs are commonly referred to as stemming algorithms or stemmers. A
stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root
word,

Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to return
the base or dictionary form of a word, which is known as the lemma . If confronted with the token
saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw
depending on whether the use of the token was as a verb or a noun.

Indexing is regarded as the process of describing and identifying documents in terms of


their subject contents. It leads to faster search result…

A vector space model is an algebraic model, involving two steps, in first step we represent
the text documents into vector of words and in second step we transform to numerical format
so that we can apply any text mining techniques such as information retrieval, information
extraction,information filtering etc.

Term frequency is the number of times a given term or query apears within a search
index.

TF(w) = (Number of times term w appears in a document) / (Total number of terms in the
document)
IDF(w) = log_e(Total number of documents / Number of documents with term w in
it)

TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that
is intended to reflect how important a word is to a document in a collection or corpus.

Cosine similarity is to check the doc similarity based on the various factors of vectors
being used to check the similarity.

Term weighting is the assignment of numerical values to terms that represent their
importance in a document in order to improve retrieval effectiveness

Precision  =  Total number of documents retrieved that are relevant/Total number of


documents that are retrieved.

Recall  =  Total number of documents retrieved that are relevant/Total number of


relevant documents in the database.

Clustering has been used in information retrieval for many different purposes, such as
query expansion, document grouping, document indexing, and visualization of search
results.
They are different types of clustering methods, including:
Partitioning methods.
Hierarchical clustering.
Fuzzy clustering.
Density-based clustering.
Model-based clustering.

. Document Parsing
The Documents comes from different source combinations such as multiple languages,
formattings, character sets; normally, if any document consisting of more than languages.
e.g. Consider a Spanish mail which has some part in french language.
Thus Document parsing deals with the overall document structure. In this phase, it breaks
down the document into discrete components.

What are the 3 types of search engines?


There are 3 commonly known types of search engines that have been identified during
various research projects: navigational, informational and transactional.

Search engine optimization is the process of growing the quality and quantity of
website traffic by increasing the visibility of a website or a web page to users of a web
search engine. SEO refers to the improvement of unpaid results and excludes direct traffic
and the purchase of paid placement.

What is Google SEO algorithm?


Google's algorithms are a complex system used to retrieve data from its search index and
instantly deliver the best possible results for a query. The search engine uses a
combination of algorithms and numerous ranking signals to deliver webpages ranked by
relevance on its search engine results pages

Corpus means a collection of facts or things. Corpus is associated with storage,


indexing, search, and delivery of multimedia data, 


You might also like