Providing Full-Text Search With the Inverted Index
We've previously discussed the value of full-text search and how it improves the user's experience by returning more nuanced and context-specific results that adhere to the user's search intent. Now let's discuss how we can leverage our inverted index to provide this level of full-text search in our own datastore.
The inverted index lays the groundwork for full-text search as it enables quick access to the occurrence of words in our dataset (in our case, the books in our search engine). Full-text search ramps up the complexity by going beyond exact word matching and enabling search for phrases, handling typos or fuzzy matches, and even understanding synonyms or related terms.
Consider an example where a user is searching for books on 'financial diversification'. If we only consider exact word-for-word matches, we might exclude books that use words like 'wealth allocation', 'investment spread', or 'financial mix' - all of which are related to the concept of financial diversification. Full-text search can understand these nuances and return results that are more informative and pertinent to the user's search query.
Let's implement a basic version of full-text search leveraging our inverted index. We'll use the Porter stemming algorithm, which is a common process in natural language processing that reduces words to their base or root form. This allows our search engine to treat 'diversify', 'diversification', and 'diversified' as the same concept.
xxxxxxxxxx
if __name__ == "__main__":
# Helper functions to tokenize and stem words
def tokenize(text):
return text.split(' ')
def stem(word):
return PorterStemmer().stem(word)
# The search function: tokenizes the search query, applies stemming,
# and queries the inverted index
def search(query, index):
# prepare the query
query = [stem(word) for word in tokenize(query)]
# find the documents containing all query terms
docs = set(index[query[0]])
for term in query[1:]:
docs = docs.intersection(set(index.get(term, [])))
return docs
# let's start a search
print(search('financial diversification', inverted_index))