Implementing an inverted index is a key step in building our search engine, providing quicker access to documents based on search terms. Think of the inverted index as a dictionary where each word is associated with a list of documents in which the word appears, much like the index at the back of a finance book might list page numbers for each relevant term.
Imitating a finance book seems a fitting analogy. Imagine having a finance book containing several topics, including 'stock markets', 'AI in finance', and 'finance management'. In the index of such a book, you would find each term (word) mapped to the pages (documents) it appears in. This is exactly what we aspire to achieve in our IRS (Information Retrieval System), albeit with documents and search terms instead of pages and words.
Each unique word from documents in our store serves as a key, and the value is a list of docID
s where the word appears. Suppose a word 'Python' appears in documents 'doc1' and 'doc3'. In our inverted index, 'Python' would map to ['doc1', 'doc3']. The Python code snippet provided implements this logic:
- It first creates a dictionary
inverted_index
to hold our inverted index. - It then iterates over each document in our document store. For each document, it further iterates over each word in the document.
- For each word, if the word doesn't exist in our inverted index, it adds an entry with the word as the key and a list containing the current
docID
as the value. - If the word already exists in our inverted index, it appends the current
docID
to the existing list.
Finally, it prints out all entries in our inverted index. The result is a dictionary where each word is mapped to a list of docID
s, indicating the documents in which the word appears.
xxxxxxxxxx
if __name__ == '__main__':
document_store = {
'doc1': 'Python is widely used in AI and big data.',
'doc2': 'Python supports object-oriented programming.',
'doc3': 'Python also allows procedural-style programming.'
}
inverted_index = {}
for docID, doc in document_store.items():
for word in doc.split():
if word not in inverted_index:
inverted_index[word] = [docID]
else:
inverted_index[word].append(docID)
for word, docIDs in inverted_index.items():
print(f'Word: {word}, appears in: {docIDs}')
print('Inverted Index implemented')