Create Inverted Index for File using Python
Last Updated :
30 Jan, 2023
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.
Creating Inverted Index
We will create a Word level inverted index, that is it will return the list of lines in which the word is present. We will also create a dictionary in which key values represent the words present in the file and the value of a dictionary will be represented by the list containing line numbers in which they are present. To create a file in Jupiter notebook use magic function:
%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.
This will create a file named file.txt will the following content.
To read file:
Python3
# this will open the file
file = open('file.txt', encoding='utf8')
read = file.read()
file.seek(0)
read
# to obtain the
# number of lines
# in file
line = 1
for word in read:
if word == '\n':
line += 1
print("Number of lines in file is: ", line)
# create a list to
# store each line as
# an element of list
array = []
for i in range(line):
array.append(file.readline())
array
Output:
Number of lines in file is: 3
['This is the first word.\n',
'This is the second text, Hello! How are you?\n',
'This is the third, this is it now.']
Functions used:
- Open: It is used to open the file.
- read: This function is used to read the content of the file.
- seek(0): It returns the cursor to the beginning of the file.
Remove punctuation:
Python3
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:
if ele in punc:
read = read.replace(ele, " ")
read
# to maintain uniformity
read=read.lower()
read
Output:
'this is the first word \n
this is the second text hello how are you \n
this is the third this is it now '
Tokenize the data as individual words:
Apply linguistic preprocessing by converting each words in the sentences into tokens. Tokenizing the sentences help with creating the terms for the upcoming indexing operation.
Python3
def tokenize_words(file_contents):
"""
Tokenizes the file contents.
Parameters
----------
file_contents : list
A list of strings containing the contents of the file.
Returns
-------
list
A list of strings containing the contents of the file tokenized.
"""
result = []
for i in range(len(file_contents)):
tokenized = []
# print("The row is ", file_contents[i])
# split the line by spaces
tokenized = file_contents[i].split()
result.append(tokenized)
return result
Clean data by removing stopwords:
Stop words are those words that have no emotions associated with it and can safely be ignored without sacrificing the meaning of the sentence.
Python3
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
for i in range(1):
# this will convert
# the word into tokens
text_tokens = word_tokenize(read)
tokens_without_sw = [
word for word in text_tokens if not word in stopwords.words()]
print(tokens_without_sw)
Output:
['first', 'word', 'second', 'text', 'hello', 'third']
Create an inverted index:
Python3
dict = {}
for i in range(line):
check = array[i].lower()
for item in tokens_without_sw:
if item in check:
if item not in dict:
dict[item] = []
if item in dict:
dict[item].append(i+1)
dict
Output:
{'first': [1],
'word': [1],
'second': [2],
'text': [2],
'hello': [2],
'third': [3]}
Similar Reads
How to iterate over files in directory using Python? Iterating over files in a directory using Python involves accessing and processing files within a specified folder. Python provides multiple methods to achieve this, depending on efficiency and ease of use. These methods allow listing files, filtering specific types and handling subdirectories.Using
3 min read
Create an empty file using Python File handling is a very important concept for any programmer. It can be used for creating, deleting, and moving files, or to store application data, user configurations, videos, images, etc. Python too supports file handling and allows users to handle files i.e., to read and write files, along with
3 min read
How to create index for MongoDB Collection using Python? Prerequisites: MongoDB Python Basics This article focus on the create_index() method of PyMongo library. Indexes makes it efficient to perform query requests as it stores the data in a way that makes it quick & easy to traverse. Let's begin with the create_index() method: Importing PyMongo Modul
2 min read
Python: Inplace Editing using FileInput Python3's fileinput provides many useful features that can be used to do many things without lots of code. It comes handy in many places but in this article, we'll use the fileinput to do in-place editing in a text file. Basically we'll be changing the text in a text file without creating any other
2 min read
How to Access Index using for Loop - Python When iterating through a list, string, or array in Python, it's often useful to access both the element and its index at the same time. Python offers several simple ways to achieve this within a for loop. In this article, we'll explore different methods to access indices while looping over a sequenc
2 min read
Using the Cat Command in Python The cat command is a Linux shell command. It is the shorthand for concatenate. It is placed among the most often used shell commands. It could be used for various purposes such as displaying the content of a file on the terminal, copying the contents of a given file to another given file, and both a
4 min read
Interact with files in Python Python too supports file handling and allows users to handle files i.e., to read, write, create, delete and move files, along with many other file handling options, to operate on files. The concept of file handling has stretched over various other languages, but the implementation is either complica
6 min read
Python String index() Method The index() method in Python is used to find the position of a specified substring within a given string. It is similar to the find() method but raises a ValueError if the substring is not found, while find() returns -1. This can be helpful when we want to ensure that the substring exists in the str
2 min read
Read a file line by line in Python Python provides built-in functions for creating, writing, and reading files. Two types of files can be handled in Python, normal text files and binary files (written in binary language, 0s, and 1s). In this article, we are going to study reading line by line from a file.Example:Pythonwith open('file
4 min read
Setting file offsets in Python Prerequisite: seek(), tell() Python makes it extremely easy to create/edit text files with a minimal amount of code required. To access a text file we have to create a filehandle that will make an offset at the beginning of the text file. Simply said, offset is the position of the read/write pointer
4 min read