Lab2 IR
Lab2 IR
Lab2 IR
Assignment 2
Submitted by:
Saqlain Nawaz 2020-CS-135
Supervised by:
Sir Khaldoon Syed Khurshid
NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk
Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
python text_search.py
Explanation and Guide
Imports (Libraries)
import os
import string
import math
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
The "Imports" section includes various Python libraries and modules that are used in the
program. Each import statement serves a specific purpose and contributes to the
functionality of the code. Here's an explanation of each import statement and its role:
import os
● os is a Python module that provides a way to interact with the operating system. In
this program, it is used to manipulate file paths and directories, specifically to access
and process text documents stored in a directory.
import string
import math
import nltk
● nltk stands for the Natural Language Toolkit, which is a powerful library for natural
language processing (NLP) and text analysis in Python. It is used extensively in this
program for text preprocessing, tokenization, part-of-speech tagging, and stemming.
Variables
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.
Functions
def preprocess(document)
def preprocess(document):
words = word_tokenize(sentence_without_punctuation)
tagged_words = nltk.pos_tag(words)
processed_doc = []
processed_doc.append(stemmer.stem(word))
Explanation
def preprocess(document):
sentence_without_punctuation = "".join([char for char in document
if char not in string.punctuation and char not in unwanted_chars])
words = word_tokenize(sentence_without_punctuation)
tagged_words = nltk.pos_tag(words)
processed_doc = []
processed_doc.append(stemmer.stem(word))
5. This is a loop that iterates through each word in tagged_words along with its
associated POS tag.
6. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words: This condition checks if the
word's part of speech (pos) belongs to a list of specific POS tags (nouns and verbs)
and if the word is not in the set of stop_words, which are common words often
filtered out in text analysis.
7. If the word meets the conditions, it is stemmed using the stemmer.stem(word)
function, and the stemmed word is appended to the processed_doc list. Stemming
reduces words to their root form, which can help improve text search and retrieval.
print(processed_doc)
return processed_doc
def create_index(dir_path):
def create_index(dir_path):
inverted_index = defaultdict(list)
doc_count = defaultdict(int)
if filename.endswith('.txt'):
document = file.read().lower()
processed_doc = preprocess(document)
doc_word_count = defaultdict(int)
doc_word_count[word] += 1
inverted_index[word].append((filename, count))
doc_count[word] += 1
inverted_index = defaultdict(list)
doc_count = defaultdict(int)
3. The function begins iterating over each file in the directory specified by dir_path. It
checks if the file ends with '.txt' to ensure it's a text document and then proceeds to
open and read the file.
4. document = file.read().lower(): The content of the file is read and
converted to lowercase. This step ensures that the text is uniform in terms of case,
making it easier to match queries to documents.
5. processed_doc = preprocess(document): The preprocess function is
called to preprocess the document. This step includes removing punctuation,
tokenizing, part-of-speech tagging, filtering, and stemming, as explained earlier.
doc_word_count = defaultdict(int)
7. A loop iterates through each word in the processed_doc, and for each word, it
increments the count in the doc_word_count dictionary. This step counts how
many times each word appears in the document.
for word, count in doc_word_count.items():
inverted_index[word].append((filename, count))
doc_count[word] += 1
8. The code then iterates through the doc_word_count dictionary to get each word
and its count within the document.
9. For each word, it appends a tuple (filename, count) to the inverted_index.
This tuple records the document where the word appears and its term frequency
within that document. Simultaneously, the code updates the doc_count to reflect
that this word has appeared in one more document.
return score
Explanation
score = 0
total_docs = len(os.listdir(dir_path))
query_words = preprocess(query)
pythonCopy code
if word in inverted_index:
4. The function then iterates over each word in the preprocessed query. For each word,
it checks if the word exists in the inverted_index. If the word is not in the index, it
is skipped as it won't contribute to the score.
df = doc_count[word]
if doc == document:
score += tf * idf
contributing_words[doc].append(word)
7. The code then iterates through the entries in the inverted_index for the word. For
each entry, which is a tuple (document, term frequency), it checks if the
document matches the one being scored (document). If it matches, the term
frequency (TF) for that word in the current document is multiplied by the IDF, and the
result is added to the score.
8. Additionally, the word is added to the list of contributing words for the document in
the contributing_words dictionary. This list keeps track of which words
contribute to the score for each document.
return score
9. Finally, the function returns the calculated score. This score represents the relevance
of the document to the query based on TF-IDF scoring.
scores = {}
contributing_words = defaultdict(list)
for filename in os.listdir(dir_path):
if filename.endswith('.txt'):
scores[filename] = tf_idf(query, filename, inverted_index,
doc_count, contributing_words)
if(norm !=0):
# Normalize the scores
for filename in scores:
scores[filename] /= norm
ranked_docs = sorted(scores.items(), key=lambda x: x[1],
reverse=True)
Explanation
1. The search function takes two parameters: query (the search query) and
dir_path (the directory containing the text documents). It begins by calling the
create_index function to create the inverted index and document count data
structures based on the documents in the specified directory.
scores = {}
contributing_words = defaultdict(list)
2. scores is initialized as an empty dictionary, which will store the relevance scores for
each document. contributing_words is also initialized as a dictionary where
each document's list of contributing words will be stored.
if filename.endswith('.txt'):
3. The code then iterates over each file in the directory. For each file ending with '.txt', it
calculates the TF-IDF score for that document using the tf_idf function. The
relevance score is computed by comparing the query to the document's content. The
resulting score is stored in the scores dictionary, where the keys are the document
filenames, and the values are the computed scores.
4. After calculating the relevance scores for all documents, the code proceeds to
compute the Euclidean norm. The Euclidean norm is a mathematical operation to
calculate the magnitude of a vector. In this case, it is used to measure the magnitude
of the relevance scores. This magnitude is used for normalization.
if(norm !=0):
5. The code checks if the calculated norm is not equal to zero to avoid division by zero
in the normalization step.
scores[filename] /= norm
6. If the norm is not zero, the code proceeds to normalize the scores by dividing each
score by the calculated norm. Normalization ensures that the scores fall within a
consistent range, making it easier to rank and compare documents.
7. The normalized scores are sorted in descending order, resulting in a list of ranked
documents. The sorted function is used to sort the scores dictionary items by the
score values (x[1]) in reverse order. This list represents the ranked documents
based on their relevance to the query.
8. Finally, the function returns two values:
○ ranked_docs: A list of ranked documents in descending order of relevance
to the query.
○ contributing_words: A dictionary that stores the words that contributed to
the relevance of each document during the scoring process.
dir_path = os.path.dirname(os.path.abspath(__file__))
9. The code at the end sets dir_path to the directory containing the text documents,
determined based on the location of the script (__file__ represents the script's
location).
User input
while True:
if query.lower() == 'exit':
break
Explanation
while True:
1. The program enters an infinite loop, allowing the user to enter search queries
continuously. The user is prompted to input a search query, and the query is stored in
the variable query.
if query.lower() == 'exit':
break
2. The program checks if the user has entered 'exit' (case-insensitive). If 'exit' is
entered, the program breaks out of the loop, effectively ending the search and exiting
the program.
3. If the user enters a search query, the program calls the search function, passing the
query and the directory path (dir_path). The search function returns two values:
ranked_docs, which is a list of ranked documents, and contributing_words,
which is a dictionary containing contributing words for each document.
4. The program then iterates over the ranked_docs list, which contains the ranked
documents. For each document in the list, it prints the document's filename and its
relevance score. It also displays the list of contributing words that helped determine
the document's score.
Data Flow Diagram
Block Diagram