Skip to content
Related Articles

Related Articles

Create Inverted Index for File using Python

View Discussion
Improve Article
Save Article
Like Article
  • Difficulty Level : Easy
  • Last Updated : 29 Dec, 2020

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.

Creating Inverted Index

We will create a Word level inverted index, that is it will return the list of lines in which the word is present. We will also create a dictionary in which key values represent the words present in the file and the value of a dictionary will be represented by the list containing line numbers in which they are present. To create a file in Jupiter notebook use magic function:
 

%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.

This will create a file named file.txt will the following content.
 

To read file: 

Python3




# this will open the file
file = open('file.txt', encoding='utf8')
read = file.read()
file.seek(0)
read
  
# to obtain the
# number of lines
# in file
line = 1
for word in read:
    if word == '\n':
        line += 1
print("Number of lines in file is: ", line)
  
# create a list to
# store each line as
# an element of list
array = []
for i in range(line):
    array.append(file.readline())
  
array


Output:

Number of lines in file is: 3
['This is the first word.\n',
'This is the second text, Hello! How are you?\n',
'This is the third, this is it now.']

Functions used:

  • Open: It is used to open the file.
  • read: This function is used to read the content of the file.
  • seek(0): It returns the cursor to the beginning of the file.

Remove punctuation: 

Python3




punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:  
    if ele in punc:  
        read = read.replace(ele, " ")  
          
read
  
# to maintain uniformity
read=read.lower()                    
read


Output:

'this is the first word \n
this is the second text hello how are you \n
this is the third this is it now '

Clean data by removing stopwords: 

Stop words are those words that have no emotions associated with it and can safely be ignored without sacrificing the meaning of the sentence.
 

Python3




from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
  
for i in range(1):
    # this will convert
    # the word into tokens
    text_tokens = word_tokenize(read)
  
tokens_without_sw = [
    word for word in text_tokens if not word in stopwords.words()]
  
print(tokens_without_sw)


Output: 

['first', 'word', 'second', 'text', 'hello', 'third']

Create an inverted index:
 

Python3




dict = {}
  
for i in range(line):
    check = array[i].lower()
    for item in tokens_without_sw:
  
        if item in check:
            if item not in dict:
                dict[item] = []
  
            if item in dict:
                dict[item].append(i+1)
  
dict


Output: 

{'first': [1],
'word': [1],
'second': [2], 
'text': [2], 
'hello': [2], 
'third': [3]}


My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!