NLP - Module 5
NLP - Module 5
▪ Involves identifying
Good doc
Descriptor(GD).
▪ GD helps describe
the content of
document and
discriminate it from
other doc collection.
TREC – Text Retrieval Conference method for phrase extraction
Non-Classical IR Model
It is completely opposite to the classical IR model. Such kinds of IR
models are based on principles other than similarity, probability,
Boolean operations. Information logic model, situation theory model,
and interaction models are examples of non-classical IR models.
Alternative IR Model
It is the enhancement of the classical IR model making use of some
specific techniques from some other fields. Cluster model, fuzzy
model, and latent semantic indexing (LSI) models are the example of
alternative IR model.
Classical IR Model
Boolean Model
▪ Based on Boolean Logic and Classical set theory.
▪ Doc’s are represented as set of keywords and stores in inverted file.
▪ We can pose any query in the form of a Boolean expression of terms
where the terms are logically
combined using the Boolean
operators AND, OR, and NOT in
the Boolean retrieval model.
Terms given a finite set
T = {t1, t2,….., ti, tm}
Index terms finite set
D = {d1, d2,……, dj, dn}
Normal form query Q:
Q = ^(vɵi), ɵi ϵ {ti - ti}
➢ The Boolean AND of two logical statements x and y means that both
x AND y must be satisfied and will be a set of documents that
will smaller or equal to the document set.
➢ While the Boolean OR of these same two statements means that at
least one of these statements must be satisfied and will fetch a set of
documents that will be greater or equal to the document set
otherwise.
➢ Any number of logical statements can be combined using the three
Boolean operators.
➢ The queries are designed as Boolean expressions which have
precise semantics and the retrieval strategy is based on binary
decision criterion.
➢ The most famous web search engine in recent times Google also
ranks the web page result set based on a two-stage system:
• The Boolean model builds the indices for the terms in the query
considering that index terms are present or absent in a document.
Term-Document Incidence matrix: This is one of the
basic mathematical models to represent text data and can be used
to answer Boolean expression queries using the Boolean Retrieval
Model. It can be used to answer any query as a Boolean expression.
• This model has good precision as the documents are retrieved if the
condition is matched but, it doesn't scale well with the size of the
corpus, and an inverted index can be used as a good alternative
method.
Processing the data for Boolean retrieval model
Drawbacks
First, the model is not able to retrieve documents that are only partly
relevant to user query, all information is “ to be or not to be”.
Drawbacks
▪ The results given by this model will partly match the user query.
▪ Determination of a threshold value for the initially retrieved set.
▪ Number of relevant document by a query is too small for the
probability to be estimated accurately.
Term Weighting
▪ Term weighting is a procedure that takes place during the text
indexing process in order to assess the value of each term to the
document.
W = TSDT
For k number of largest singular values,
Wk = TkSkDkT
LEXICAL RESOURCES
A lexicon, or lexical resource, is a collection of words and/or phrases
along with associated information, such as part-of-speech and sense
definitions. Lexical resources are secondary to texts, and are usually
created and enriched with the help of texts.
WordNet
FrameNet
Tools such as
Stemmers
Taggers
Parsers
test corpus
WordNet
Word Sense
Words are ambiguous, which means the same word can be used
differently depending on the context.
For example, a 'bank' could be a river bank or a financial institution.
These meanings and variety due to context are captured by sense (or
word sense).
A sense (or word sense) is a discrete representation of one aspect of
the meaning of a word.
Semantic Relations
Synonymy The senses of two separate words are called synonyms if
the meanings of these words are identical or similar. Example:
center/middle, run/jog, etc.
Antonymy Antonyms are words with opposite meanings.
Example: dark/light, fast/slow etc.
Taxonomic Relations
Word senses can be related taxonomically so that they can be
classified in certain categories.
A word (or sense) is a hyponym of another word or sense if the one
denotes a subclass of the other and is conversely called hypernym.
For example, man is a hyponym of animal, and animal is a hypernym
of man. Alternatively, this hyponym/hypernym can be defined as IS-
A relationship 'Man IS-A animal‘
Meronymy The 'part-whole' relationship is called Meronymy. A
wheel is part of car.
What is the WordNet?