UNIT 4 Information Retrieval Using NLP
UNIT 4 Information Retrieval Using NLP
UNIT 4 Information Retrieval Using NLP
5. Retrieval:
Once the similarity scores between the query vector and all document vectors are computed,
documents are ranked based on these scores.
Pros:
1. Flexibility:
VSM can handle various types of text data and can be adapted to different
NLP tasks.
It's versatile and can accommodate different representations and similarity
measures.
2. Efficiency:
Once the Term-Document Matrix is constructed, retrieval is efficient.
Retrieval time is generally proportional to the number of terms in the query,
making it suitable for large-scale retrieval tasks.
3. Simple Implementation:
The concept of representing documents and queries as vectors is
straightforward and easy to implement.
It doesn't require complex algorithms or heavy computational resources for
implementation.
4. Interpretability:
The similarity scores produced by VSM have an intuitive interpretation.
Users can understand the relevance of retrieved documents based on their
similarity to the query.
5. Scalability:
VSM scales well to large collections of documents.
It's suitable for applications where the corpus contains millions of documents.
6. Customization:
VSM allows for customization of various parameters such as term weighting
schemes, similarity measures, and dimensionality reduction techniques.
This customization enables tailoring the model to specific requirements and
datasets.
Cons:
1. Sparse Representation:
Large vocabularies lead to sparse vectors, where most elements are zeros.
This sparsity can affect the efficiency of computation and storage.
2. Semantic Gap:
VSM treats terms as independent and doesn't capture semantic relationships
well.
It may fail to understand the meaning or context of words, leading to
mismatches between query terms and relevant documents.
3. Curse of Dimensionality:
In high-dimensional spaces, distances between vectors lose meaning due to the
curse of dimensionality.
High-dimensional vectors can be computationally expensive to handle and
may require dimensionality reduction techniques.
4. Lack of Context:
VSM ignores the order and context of words within documents.
This can lead to inaccurate results, especially in tasks where context is crucial,
such as sentiment analysis or language translation.
5. Need for Preprocessing:
VSM heavily relies on preprocessing steps such as tokenization, stop word
removal, and stemming.
Poor preprocessing can lead to suboptimal results and affect the quality of
retrieval.
6. Difficulty with Synonyms and Polysemous Words:
VSM may struggle with synonyms and polysemous words, as it treats each
term independently.
Variations in word usage can lead to mismatches between query terms and
relevant documents.
Evaluation metrics -
Evaluation metrics for Named Entity Recognition (NER) measure the performance of NER
systems by comparing their predicted named entities to the ground truth (annotated) entities.
Here are some common evaluation metrics used in NER:
1. Precision, Recall, and F1-score:
Precision: Precision measures the accuracy of the positive predictions made by the
model. It is the ratio of correctly predicted positive entities to the total entities
predicted as positive.
Recall: Measures the completeness of the predicted entities. It calculates the ratio of
correctly predicted entities to the total number of actual entities.
F1-score: The harmonic mean of precision and recall, providing a balance between
precision and recall.
2. Accuracy:
Accuracy: Measures the overall correctness of the predicted entities. It calculates the
ratio of correctly predicted entities to the total number of entities.
3. Entity-Level Metrics:
Correct Entities (CE): The number of correctly predicted entities.
Partial Entities (PE): The number of partially overlapping entities (e.g., predicting
"New York City" instead of "New York").
Missed Entities (ME): The number of ground truth entities that were not predicted.
4. Token-Level Metrics:
Token-Level Precision: Measures the proportion of correctly predicted tokens among
all tokens predicted as entities.
Token-Level Recall: Measures the proportion of correctly predicted tokens among all
tokens that should have been predicted as entities.
Token-Level F1-score: The harmonic mean of token-level precision and recall.
5. CoNLL Evaluation:
The CoNLL evaluation measures precision, recall, and F1-score at the token level and
takes into account exact entity matching.
It's commonly used for evaluating NER systems, especially in shared tasks and
competitions.
Example –
let's use a sentence where the prediction can lead to different values for precision, recall,
and F1-score.
Sentence: "Elon Musk is the CEO of Tesla and he lives in Palo Alto, California."
Ground Truth:
"Elon Musk" - PERSON
"Tesla" - ORGANIZATION
"Palo Alto, California" - LOCATION
Predicted:
"Elon Musk" - PERSON
"Tesla" - ORGANIZATION
"Palo Alto" - LOCATION
Calculating Precision, Recall, and F1-score:
True Positives (TP):
"Elon Musk" - Correctly predicted as PERSON
"Tesla" - Correctly predicted as ORGANIZATION
False Positives (FP):
None
False Negatives (FN):
"Palo Alto, California" - Missed from predictions
Entity Extraction:
Entity Extraction, also known as Named Entity Recognition (NER), involves identifying and
classifying named entities in text into predefined categories such as person names,
organization names, locations, dates, numerical expressions, and more.
Working of Entity Extraction:
1. Preprocessing:
Tokenize the input text into words or subwords.
Remove irrelevant information like punctuation.
2. Feature Extraction:
Extract relevant features from the text, which may include:
Word embeddings
Part-of-speech (POS) tags
Contextual information
3. Model Prediction:
Use a pre-trained NER model or train a new model on labeled data.
For each token in the text, predict its named entity category.
Many models use a token-level tagging approach, where each token is tagged
with its entity category.
4. Post-processing:
Refine the predictions to improve accuracy and consistency.
Resolve conflicts and handle complex cases like nested entities.
Relation Extraction:
Relation Extraction involves identifying and extracting relationships between entities
mentioned in text. These relationships can represent various types of connections between
entities, such as ownership, affiliation, location, etc.
Working of Relation Extraction:
1. Entity Extraction:
Extract named entities from the text using techniques like NER.
2. Dependency Parsing:
Analyze the syntactic structure of the sentence to identify relationships
between entities.
Use techniques like dependency parsing to identify the grammatical
relationships between words in the sentence.
3. Pattern Matching:
Use predefined patterns or rules to identify specific relationships between
entities.
For example, a pattern like "X is the CEO of Y" can be used to identify the
CEO relationship between two entities.
4. Supervised Learning:
Train a supervised machine learning model to predict relationships between
entities based on labeled data.
Features for the model may include entity types, syntactic features, and
contextual information.
5. Post-processing:
Refine the predicted relationships to improve accuracy and coherence.
Resolve conflicts and handle cases where multiple relationships exist between
the same pair of entities.
Example:
Let's consider the sentence: "Elon Musk, the CEO of SpaceX, was born on June 28, 1971, in
Pretoria, South Africa."
Entity Extraction:
"Elon Musk" - PERSON
"SpaceX" - ORGANIZATION
"June 28, 1971" - DATE
"Pretoria" - LOCATION
"South Africa" - LOCATION
Relation Extraction:
CEO of:
Subject: "Elon Musk"
Object: "SpaceX"
Relation: "CEO of"
Place of Birth:
Subject: "Elon Musk"
Object: "Pretoria, South Africa"
Relation: "Place of Birth"
In summary, Entity Extraction involves identifying named entities in text, while Relation
Extraction involves identifying relationships between these entities.
Reference Resolution:
Identifies all expressions in a text that refer to the same entity, including pronouns,
proper nouns, and even descriptions.expand_more
This is a broader category encompassing various ways words can refer to entities.
Example: In the sentence "Barack Obama, the 44th president of the United States,
delivered a speech. He spoke about the importance of education."
o Here, "Barack Obama" and "he" both refer to the same entity (Barack Obama).
o Reference resolution would identify both mentions.
Coreference Resolution:
Focuses specifically on resolving pronouns to the entities they refer to within a
text.expand_more
It's a subcategory of reference resolution that deals with pronouns like "he," "she,"
"it," "they," etc.expand_more
Example: Consider the sentence "Alice went to the store. She bought some groceries."
o Coreference resolution would identify "Alice" and "she" as referring to the
same entity (Alice).
Key Differences:
Scope: Reference resolution covers a wider range of expressions, including pronouns,
proper nouns, and descriptions. Coreference resolution is limited to pronouns.
Focus: Reference resolution aims to find all mentions of the same entity, regardless of
the type of expression. Coreference resolution specifically targets pronouns and who
they refer to.
Here's an analogy: Imagine a party.
Reference resolution: Identifies everyone at the party, including people with name
tags (proper nouns), those described by their clothes (descriptions), and people you
only know by sight (pronouns).
Coreference Resolution: Focuses on figuring out who people are referring to when
they use pronouns like "he" or "she" at the party.
Where to use Coreference Resolution –
Text understanding
Document summarization
Information extraction
Sentiment analysis
Machine translation
How it works:
Query Translation: The first step in CLIR is to translate the user's query from the source
language into the target language. This can be done using machine translation techniques.
Document Retrieval: Once the query is translated, the system searches for relevant
documents in the target language using traditional information retrieval methods. This could
involve searching through indexed documents or web pages.
Result Translation: After retrieving relevant documents, the system may translate them
back into the source language for presentation to the user.
Example:
Let's say a user who speaks English wants to find information about "climate change" in
Spanish documents:
Query Translation: The user's query, "climate change," is translated into Spanish as
"cambio climático."
Document Retrieval: The system searches through a collection of Spanish documents (e.g.,
articles, websites) for those containing the term "cambio climático" or related terms.
Result Translation: Once relevant documents are retrieved, the system may translate them
back into English for the user to read and understand.
Multilingual Search Engines: CLIR allows users to search for information on the web in
languages they may not understand, broadening access to information across linguistic
barriers.