Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lab2 IR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Information Retrieval

Assignment 2

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Introduction
This program is designed to create an inverted index for a collection of text documents,
enabling users to perform text searches and retrieve documents based on relevance scores.
The program incorporates techniques such as text preprocessing, TF-IDF (Term
Frequency-Inverse Document Frequency) scoring, and document ranking. It allows users to
input search queries and receive a list of documents sorted by their relevance to the query.

Purpose of the Program:


The primary purpose of this program is to demonstrate how to create an inverted index and
use it for text retrieval and ranking. It also showcases the use of TF-IDF for calculating
document relevance. The program is a simple example that can be extended and
customized for various applications, such as document retrieval and information retrieval
systems.

Installation and Setup:


Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.

NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk

Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.

python text_search.py
Explanation and Guide

Imports (Libraries)
import os
import string
import math
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

The "Imports" section includes various Python libraries and modules that are used in the
program. Each import statement serves a specific purpose and contributes to the
functionality of the code. Here's an explanation of each import statement and its role:

import os

● os is a Python module that provides a way to interact with the operating system. In
this program, it is used to manipulate file paths and directories, specifically to access
and process text documents stored in a directory.

import string

● string is a module in Python's standard library that provides a collection of string


constants and functions for text manipulation. In this program, it is used to access a
set of punctuation characters and unwanted characters to remove from text
documents.

import math

● math is a standard Python module that provides mathematical functions and


constants. In this program, it is used to perform mathematical operations, such as
calculating the Euclidean norm for score normalization.

import nltk

● nltk stands for the Natural Language Toolkit, which is a powerful library for natural
language processing (NLP) and text analysis in Python. It is used extensively in this
program for text preprocessing, tokenization, part-of-speech tagging, and stemming.

from collections import defaultdict


● collections is a Python module that provides specialized container datatypes. In
this program, it imports the defaultdict class, which is used to create dictionaries
with default values for inverted indexing and word counting.

from nltk.corpus import stopwords

● nltk.corpus.stopwords provides a list of common English stopwords.


Stopwords are words that are often removed from text data because they are
considered non-informative (e.g., "the," "and," "in"). These stopwords are used for
filtering out common words during text preprocessing.

from nltk.stem import PorterStemmer

● nltk.stem.PorterStemmer is a stemming algorithm included in the NLTK library.


Stemming is the process of reducing words to their root form (e.g., "running" to "run").
The Porter stemmer is used to normalize words in the text for indexing and retrieval.

from nltk.tokenize import word_tokenize

● nltk.tokenize.word_tokenize is a function from NLTK for tokenizing text into


words. It breaks text into individual words, making it easier to process and analyze
the content.

Variables

stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}
stemmer = PorterStemmer()

stop_words = set(stopwords.words('english'))

● Explanation: The variable stop_words is assigned a set of English stopwords


using NLTK's stopwords.words('english'). These stopwords will be used to
filter out common words from the text documents being processed. This filtering
helps reduce the size of the inverted index and focuses on the content-carrying
words.

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are


considered unwanted and should be removed from the text before processing. The
characters include various forms of quotes, dashes, and ellipses. If additional
unwanted characters are identified, they can be added to this set.

stemmer = PorterStemmer()
● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.

Functions

def preprocess(document)

def preprocess(document):

sentence_without_punctuation = "".join([char for char in document


if char not in string.punctuation and char not in unwanted_chars])

words = word_tokenize(sentence_without_punctuation)

tagged_words = nltk.pos_tag(words)

processed_doc = []

for word, pos in tagged_words:

if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',


'VBN', 'VBP'] and word not in stop_words:

processed_doc.append(stemmer.stem(word))

Explanation

def preprocess(document):
sentence_without_punctuation = "".join([char for char in document
if char not in string.punctuation and char not in unwanted_chars])

1. sentence_without_punctuation = "".join([char for char in


document if char not in string.punctuation and char not in
unwanted_chars]): This line creates a new string,
sentence_without_punctuation, by iterating through each character in the
document. It checks if the character is not in the string.punctuation and not in
the unwanted_chars set. If it's not a punctuation character and not an unwanted
character, it is included in the new string. This step removes punctuation and
specified unwanted characters from the text.

words = word_tokenize(sentence_without_punctuation)

2. words = word_tokenize(sentence_without_punctuation): Here, the


cleaned text, stored in sentence_without_punctuation, is tokenized into
individual words using the word_tokenize function. This step breaks the text into a
list of words, making it easier to work with individual terms.

tagged_words = nltk.pos_tag(words)

3. tagged_words = nltk.pos_tag(words): The code uses NLTK's pos_tag


function to tag each word in the words list with its part of speech (POS). This tagging
is important for later filtering based on specific POS, such as nouns and verbs.

processed_doc = []

4. processed_doc = []: processed_doc is initialized as an empty list. This list will


hold the processed and filtered words from the document.

for word, pos in tagged_words:

if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',


'VBN', 'VBP'] and word not in stop_words:

processed_doc.append(stemmer.stem(word))

5. This is a loop that iterates through each word in tagged_words along with its
associated POS tag.
6. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words: This condition checks if the
word's part of speech (pos) belongs to a list of specific POS tags (nouns and verbs)
and if the word is not in the set of stop_words, which are common words often
filtered out in text analysis.
7. If the word meets the conditions, it is stemmed using the stemmer.stem(word)
function, and the stemmed word is appended to the processed_doc list. Stemming
reduces words to their root form, which can help improve text search and retrieval.
print(processed_doc)

return processed_doc

8. print(processed_doc): This line prints the processed_doc, which is a list of


processed and filtered words, to the console. This step is for debugging and allows
you to see the result of the text preprocessing.
9. return processed_doc: Finally, the function returns the processed_doc list,
which contains the preprocessed and filtered words. This list is ready to be used in
further stages of the program, such as indexing or searching.

def create_index(dir_path):
def create_index(dir_path):

inverted_index = defaultdict(list)

doc_count = defaultdict(int)

for filename in os.listdir(dir_path):

if filename.endswith('.txt'):

with open(os.path.join(dir_path, filename), 'r',


encoding='utf8') as file:

document = file.read().lower()

processed_doc = preprocess(document)

doc_word_count = defaultdict(int)

for word in processed_doc:

doc_word_count[word] += 1

for word, count in doc_word_count.items():

inverted_index[word].append((filename, count))

doc_count[word] += 1

return inverted_index, doc_count


Explanation

inverted_index = defaultdict(list)

1. inverted_index = defaultdict(list): Here, an inverted_index is


created as a dictionary where the keys are words (or terms), and the values are lists.
The defaultdict ensures that if a key (word) is not already in the dictionary, it will
create an empty list as the default value. This structure will be used to build the
inverted index where each word is associated with a list of (document, term
frequency) pairs.

doc_count = defaultdict(int)

2. doc_count = defaultdict(int): A separate doc_count dictionary is created,


also as a defaultdict, to keep track of how many documents each word appears
in. The integer value associated with each word in this dictionary represents its
document frequency (DF).

for filename in os.listdir(dir_path):


if filename.endswith('.txt'):
with open(os.path.join(dir_path, filename), 'r',
encoding='utf8') as file:
document = file.read().lower()
processed_doc = preprocess(document

3. The function begins iterating over each file in the directory specified by dir_path. It
checks if the file ends with '.txt' to ensure it's a text document and then proceeds to
open and read the file.
4. document = file.read().lower(): The content of the file is read and
converted to lowercase. This step ensures that the text is uniform in terms of case,
making it easier to match queries to documents.
5. processed_doc = preprocess(document): The preprocess function is
called to preprocess the document. This step includes removing punctuation,
tokenizing, part-of-speech tagging, filtering, and stemming, as explained earlier.

doc_word_count = defaultdict(int)

6. doc_word_count = defaultdict(int): A new doc_word_count dictionary is


created for each document to store the count of each word in that specific document.

for word in processed_doc:


doc_word_count[word] += 1

7. A loop iterates through each word in the processed_doc, and for each word, it
increments the count in the doc_word_count dictionary. This step counts how
many times each word appears in the document.
for word, count in doc_word_count.items():
inverted_index[word].append((filename, count))
doc_count[word] += 1

8. The code then iterates through the doc_word_count dictionary to get each word
and its count within the document.
9. For each word, it appends a tuple (filename, count) to the inverted_index.
This tuple records the document where the word appears and its term frequency
within that document. Simultaneously, the code updates the doc_count to reflect
that this word has appeared in one more document.

return inverted_index, doc_count

10. Finally, the function returns two dictionaries:


○ inverted_index: A dictionary where words are associated with lists of
(document, term frequency) pairs, forming the inverted index.
○ doc_count: A dictionary indicating how many documents each word appears
in, representing the document frequency (DF).

def tf_idf(query, document, inverted_index, doc_count,


contributing_words):
def tf_idf(query, document, inverted_index, doc_count,
contributing_words):
score = 0
total_docs = len(os.listdir(dir_path))
query_words = preprocess(query)
for word in query_words:
if word in inverted_index:
df = doc_count[word]
idf = math.log(total_docs / df+1)

for doc, tf in inverted_index[word]:


if doc == document:
score += tf * idf
contributing_words[doc].append(word) # Add the
word to the list of contributing words for this document

return score
Explanation

def tf_idf(query, document, inverted_index, doc_count,


contributing_words):

score = 0

1. def tf_idf(query, document, inverted_index, doc_count,


contributing_words): This function, named tf_idf, calculates the TF-IDF
(Term Frequency-Inverse Document Frequency) score for a given query and
document. It takes several parameters: query (the search query), document (the
document being scored), inverted_index (the inverted index of the collection of
documents), doc_count (a dictionary representing document frequencies), and
contributing_words (a dictionary to store words contributing to the score of each
document).

total_docs = len(os.listdir(dir_path))

2. total_docs = len(os.listdir(dir_path)): This line calculates the total


number of documents in the directory specified by dir_path. It's used to compute
the inverse document frequency (IDF) for words in the query.

query_words = preprocess(query)

3. query_words = preprocess(query): The search query is preprocessed using


the preprocess function, which prepares the query for matching against the
documents in a way similar to how documents were preprocessed.

pythonCopy code

for word in query_words:

if word in inverted_index:

4. The function then iterates over each word in the preprocessed query. For each word,
it checks if the word exists in the inverted_index. If the word is not in the index, it
is skipped as it won't contribute to the score.

df = doc_count[word]

5. df = doc_count[word]: Here, the document frequency (DF) of the word is


obtained from the doc_count dictionary. The document frequency represents how
many documents in the collection contain the word.

idf = math.log(total_docs / df+1)


6. idf = math.log(total_docs / df + 1): The inverse document frequency
(IDF) is calculated using a logarithmic formula. The math.log function is used to
compute the natural logarithm of the ratio of the total number of documents to the
document frequency. Adding 1 to the denominator helps avoid division by zero.

for doc, tf in inverted_index[word]:

if doc == document:

score += tf * idf

contributing_words[doc].append(word)

7. The code then iterates through the entries in the inverted_index for the word. For
each entry, which is a tuple (document, term frequency), it checks if the
document matches the one being scored (document). If it matches, the term
frequency (TF) for that word in the current document is multiplied by the IDF, and the
result is added to the score.
8. Additionally, the word is added to the list of contributing words for the document in
the contributing_words dictionary. This list keeps track of which words
contribute to the score for each document.

return score

9. Finally, the function returns the calculated score. This score represents the relevance
of the document to the query based on TF-IDF scoring.

def search(query, dir_path):


def search(query, dir_path):
inverted_index, doc_count = create_index(dir_path)

scores = {}
contributing_words = defaultdict(list)
for filename in os.listdir(dir_path):
if filename.endswith('.txt'):
scores[filename] = tf_idf(query, filename, inverted_index,
doc_count, contributing_words)

# Compute the Euclidean norm for normalization


norm = math.sqrt(sum(score**2 for score in scores.values()))

if(norm !=0):
# Normalize the scores
for filename in scores:
scores[filename] /= norm
ranked_docs = sorted(scores.items(), key=lambda x: x[1],
reverse=True)

return ranked_docs, contributing_words

Explanation

inverted_index, doc_count = create_index(dir_path)

1. The search function takes two parameters: query (the search query) and
dir_path (the directory containing the text documents). It begins by calling the
create_index function to create the inverted index and document count data
structures based on the documents in the specified directory.

scores = {}

contributing_words = defaultdict(list)

2. scores is initialized as an empty dictionary, which will store the relevance scores for
each document. contributing_words is also initialized as a dictionary where
each document's list of contributing words will be stored.

for filename in os.listdir(dir_path):

if filename.endswith('.txt'):

scores[filename] = tf_idf(query, filename, inverted_index,


doc_count, contributing_words)

3. The code then iterates over each file in the directory. For each file ending with '.txt', it
calculates the TF-IDF score for that document using the tf_idf function. The
relevance score is computed by comparing the query to the document's content. The
resulting score is stored in the scores dictionary, where the keys are the document
filenames, and the values are the computed scores.

norm = math.sqrt(sum(score**2 for score in scores.values()))

4. After calculating the relevance scores for all documents, the code proceeds to
compute the Euclidean norm. The Euclidean norm is a mathematical operation to
calculate the magnitude of a vector. In this case, it is used to measure the magnitude
of the relevance scores. This magnitude is used for normalization.

if(norm !=0):
5. The code checks if the calculated norm is not equal to zero to avoid division by zero
in the normalization step.

# Normalize the scores

for filename in scores:

scores[filename] /= norm

6. If the norm is not zero, the code proceeds to normalize the scores by dividing each
score by the calculated norm. Normalization ensures that the scores fall within a
consistent range, making it easier to rank and compare documents.

ranked_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)

return ranked_docs, contributing_words

7. The normalized scores are sorted in descending order, resulting in a list of ranked
documents. The sorted function is used to sort the scores dictionary items by the
score values (x[1]) in reverse order. This list represents the ranked documents
based on their relevance to the query.
8. Finally, the function returns two values:
○ ranked_docs: A list of ranked documents in descending order of relevance
to the query.
○ contributing_words: A dictionary that stores the words that contributed to
the relevance of each document during the scoring process.

dir_path = os.path.dirname(os.path.abspath(__file__))

9. The code at the end sets dir_path to the directory containing the text documents,
determined based on the location of the script (__file__ represents the script's
location).

User input

while True:

query = input("Enter a search query (or 'exit' to quit): ")

if query.lower() == 'exit':

break

ranked_docs, contributing_words = search(query, dir_path)

for doc, score in ranked_docs:


print(f"The document '{doc}' has a relevance score of
{score}.")

print("Contributing words:", contributing_words[doc])

Explanation

while True:

query = input("Enter a search query (or 'exit' to quit): ")

1. The program enters an infinite loop, allowing the user to enter search queries
continuously. The user is prompted to input a search query, and the query is stored in
the variable query.

if query.lower() == 'exit':

break

2. The program checks if the user has entered 'exit' (case-insensitive). If 'exit' is
entered, the program breaks out of the loop, effectively ending the search and exiting
the program.

ranked_docs, contributing_words = search(query, dir_path)

3. If the user enters a search query, the program calls the search function, passing the
query and the directory path (dir_path). The search function returns two values:
ranked_docs, which is a list of ranked documents, and contributing_words,
which is a dictionary containing contributing words for each document.

for doc, score in ranked_docs:

print(f"The document '{doc}' has a relevance score of


{score}.")

print("Contributing words:", contributing_words[doc])

4. The program then iterates over the ranked_docs list, which contains the ranked
documents. For each document in the list, it prints the document's filename and its
relevance score. It also displays the list of contributing words that helped determine
the document's score.
Data Flow Diagram
Block Diagram

You might also like