AlgoDaily - Understanding Underlying Mechanisms

Home > Build Datastores From Scratch > Build Datastores From Scratch > Understanding Underlying Mechanisms

As a senior engineer, you are aware that data structures are the fundamental building blocks of computer science. Especially, when we talk about building datastores from scratch, understanding how data structures work could not be more crucial. Here we'll be discussing simple built-in data structures in Python including lists (arrays), dictionaries (hashmaps), and sets.

We initialize a List arr, a Dictionary d corresponding to stock symbols to company names, and a Set s. The thing to take note of at this stage is that lists (a type of sequence) hold the order of insertion, dictionaries and sets (types of collection) do not.

Adding elements to these data structures, arr.append(6) adds the integer ‘6’ to the end of our list. When we add a key-value pair to our dictionary d['AMZN'] = 'Amazon', the new entry gets placed arbitrarily within the dictionary because, unlike lists, dictionaries are unordered which means they do not record element position. Sets function similarly to dictionaries and do not maintain element order, hence adding 'AMZN' to our set s places it arbitrarily.

This might seem simple, but this is the cornerstone of understanding how to implement more complex structures that underpin datastores such as trees, graphs, and different kinds of indexes.

xxxxxxxxxx
 
if __name__ == '__main__':
    # Basic Python data structures
    # The list / Array
    arr = [1, 2, 3, 4, 5]
    print(f'Original array: {arr}')
​
    # The dictionary / Hashmap
    d = {'AAPL': 'Apple', 'MSFT':'Microsoft', 'GOOGL':'Google'}
    print(f'Original dictionary: {d}')
​
    # Sets
    s = {'AAPL', 'MSFT', 'GOOGL'}
    print(f'Original set: {s}')
​
    # Adding elements to all the above data-structures
    arr.append(6)
    d['AMZN'] = 'Amazon'
    s.add('AMZN')
​
    print(f'Array after adding element: {arr}')
    print(f'Dictionary after adding element: {d}')
    print(f'Set after adding element: {s}')

Try this exercise. Is this statement true or false?

In Python, lists maintain the order of element insertion, but dictionaries and sets do not.

Press true if you believe the statement is correct, or false otherwise.

As highly competent engineers working in technology and finance, we often come across tools like Redis, a simple yet powerful in-memory key-value store, used for caching, configuration storage, data sharing between threads, and more. Let's try to build our own rudimentary version of a key-value store.

In Python, a dictionary is a built-in data structure that can be used as a key-value store. It uses hashmaps under the hood and allows us to store and retrieve data using unique keys. Think of it as a simple database. You would have seen this being used in programming paradigms like dynamic programming.

To build this key-value store, we're going to use a Python dictionary. We are creating a kvstore, which stands for key-value store, as an empty dictionary - kvstore = {}.

You can add key-value pairs to the kvstore with the syntax kvstore[key] = value. For example, kvstore["AMZN"] = "Amazon" adds the key-value pair "AMZN" : "Amazon" to our kvstore.

Give it a shot below by executing the following python code.

xxxxxxxxxx
 
if __name__ == "__main__":
​
  # Define Key-Value Store
  kvstore = {}
​
  # Add Key-Value Pairs
  kvstore["AMZN"] = "Amazon"
  kvstore["MSFT"] = "Microsoft"
  kvstore["GOOGL"] = "Google"
​
  print(kvstore)

Let's test your knowledge. Is this statement true or false?

In Python, we can only retrieve a key-value pair if we know the key.

Press true if you believe the statement is correct, or false otherwise.

In the preceding lesson, we built a simple key-value store. The keys were mapped to single values. Document stores like MongoDB, on the other hand, allow us to store more complex information by allowing the keys to be mapped to documents, a structured data model.

Documents can include a variety of data formats such as BSON (Binary JSON), XML, or JSON. For simplicity, we'll implement a document store that manages JSON-like documents.

Python's built-in data structures offer suitable forms for implementing a document store. For instance, dictionaries can represent JSON documents.

Let's create a simple document store. Here, a key maps to a Python dictionary, that represents a document. Each document includes several different fields capturing more detailed information.

Execute the below Python code that implements a simple document store.

xxxxxxxxxx
 
if __name__ == "__main__":
    # Python logic here
    document_store = {}
    document_store['GOOG'] = {'name': 'Google', 'industry': 'Technology', 'founded': 1998, 'CEO': 'Sundar Pichai'}
    document_store['AAPL'] = {'name': 'Apple', 'industry': 'Technology', 'founded': 1976, 'CEO': 'Tim Cook'}
    print(document_store)

Build your intuition. Fill in the missing part by typing it in.

In Python, dictionaries can represent JSON documents. So, a key in a Document Store can map to a ____, which represents a JSON document.

Write the missing line below.

Relational databases are an essential part of backend development across industries including finance and AI. SQL (Structured Query Language) is a type of database that allows storing, querying, and manipulating data.

Relational databases such as PostgreSQL support complex queries, transactions, and routine administration processes. They are efficient in dealing with structured data and offer advanced querying options with SQL.

Python provides the sqlite3 module that can be used to interact with a SQLite database. SQLite is a self-contained, high-reliability, embedded, full-featured, public-domain SQL database engine.

Let's build a simple relational database with Python's sqlite3 module. In this demo, we'll create an example.db SQLite database, create a stocks table in it, and insert data into this table. Here's the Python script to do this.

xxxxxxxxxx
 
if __name__ == "__main__":
  import sqlite3
  conn = sqlite3.connect('example.db')
  c = conn.cursor()
  c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''')
  c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
  conn.commit()
  conn.close()

Build your intuition. Click the correct answer from the options.

Which of the following Python libraries is used to interact with a SQLite database?

Click the option that best answers the question.

MySQLdb
sqlite3
pymongo
psycopg2

As an experienced developer, you would already know that a Full-Text Search (FTS) system allows the user to efficiently search text content data for keywords. A notable solution for this purpose is Elasticsearch, a popular open-source FTS engine built on top of the Lucene library.

A full-text search engine works by using an index to quickly look up documents containing search terms. When building a basic full-text search engine from scratch, the essential part is designing a data structure to store the index for quick lookups.

You can use a data structure like a Trie or a Hash Map, depending on the requirements and constraints of your system. For instance, Tries can be a great choice when the number of search terms is large, and we need to perform prefix searches often.

Now, let's look at a Python example where we are using a dictionary (Python's built-in hash map) to create a simple inverted index.

xxxxxxxxxx
 
if __name__ == '__main__':
  # We Initialize an empty dictionary to hold our inverted index
  inverted_index = {}
​
  # Imagine these documents are in our data store
  docs = [
    'Data Science is an exciting field',
    'AI and Machine Learning are the buzzwords in tech',
    'Python is commonly used in Machine Learning'
  ]
​
  # We loop through the documents and index the words
  for index, doc in enumerate(docs):
    for word in doc.split():
      word = word.lower()
      if word in inverted_index:
        inverted_index[word].add(index)
      else:
        inverted_index[word] = {index}
​
  # Let's search for a keyword
  keyword = 'machine'.lower()
  if keyword in inverted_index:
    print('Keyword found in documents:', inverted_index[keyword])
  else:
    print('No document contains the keyword')

Are you sure you're getting this? Fill in the missing part by typing it in.

A full-text search engine works by using an ____ to quickly look up documents containing search terms.

Write the missing line below.

Achieving Feature Parity

Often, when designing and implementing data stores from scratch, we aim to achieve a 'feature parity' with existing, more complex systems like PostgreSQL or MongoDB. This means that we want our data store to have the same functionality and features as the professional versions.

Achieving feature parity doesn't mean we have to implement every single function of these elaborate data stores. Instead, we focus on the most important features that define the functionality of the data store.

For instance, MongoDB is a document-oriented database system, meaning that it stores data as 'documents' in a semi-structured manner. If we were to build a simplified version of MongoDB from scratch, we'd want to ensure our system can store data as documents as well.

In Python, a simple way to implement a document store is by using dictionaries. The code example here demonstrates how we can create a very basic document store using Python's built-in defaultdict function. Each document is represented by another dictionary, with key-value pairs representing the fields of the document.

By reaching a feature parity, we ensure our system can be utilized in similar ways as the more established versions. However, the goal isn't to just clone these data stores, but to understand the core principles, mechanisms, and algorithms that allow them to function as they do.

xxxxxxxxxx
 
if __name__ == '__main__':
    from collections import defaultdict
​
    document_store = defaultdict(dict)
​
    document_store['doc1'] = {'title': 'Data Science', 'content': 'Data Science is cool'}
    document_store['doc2'] = {'title': 'AI in finance', 'content': 'AI is transforming the financial industry'}
​
    print(document_store['doc1'])
    print(document_store['doc2'])

Build your intuition. Is this statement true or false?

When achieving feature parity in designing data stores, it's necessary to implement every single feature of a professional system like PostgreSQL or MongoDB.

Press true if you believe the statement is correct, or false otherwise.

Congratulations on completing the journey through the underlying mechanisms of various datastores! We started with a look at data structure basics, then used our knowledge to recreate simple versions of popular datastores like key-value stores, document stores, relational databases, and full-text search engines.

Whether it was running loops similar to a trading AI to maintain a key-value store, or structuring a 'database' to handle player statistics in a sports video game similar to a relational database, it was about understanding how these complex systems use fundamental structures behind the scenes.

A clear understanding of these principles not only enables us to use these datastores more effectively in our projects, but also makes us better programmers. We close understanding that achieving 'feature parity' is not just about cloning their features, but about grasping the core principles and algorithms that power them.

Remember, in this journey, the destination was not the goal. The goal was the journey itself, understanding the mechanisms behind the technology that drives today's data-driven world. From finance to AI, these principles hold the key.

We hope that this exploration was enlightening and bolstered your confidence in managing data. Remember, the wisdom in computer science doesn't just lie in mastering advanced algorithms, but in knowing how to use simple ones creatively.

xxxxxxxxxx
 
if __name__ == '__main__':
    print("Congratulations on completing the 'Building Datastores from Scratch' tutorial!")

Let's test your knowledge. Click the correct answer from the options.

What is the ultimate goal of building datastores from scratch?

Click the option that best answers the question.

To have your own custom datastore
To better understand the underlying mechanisms of the datastore
For commercial use
To impress your peers

Generating complete for this lesson!

Try this exercise. Is this statement true or false?

Let's test your knowledge. Is this statement true or false?

Build your intuition. Fill in the missing part by typing it in.

Build your intuition. Click the correct answer from the options.

Click the option that best answers the question.

Are you sure you're getting this? Fill in the missing part by typing it in.

Achieving Feature Parity

Build your intuition. Is this statement true or false?

Let's test your knowledge. Click the correct answer from the options.

Click the option that best answers the question.

Programming Categories

Popular Lessons