AlgoDaily - What is Database Sharding? Scaling DBs

Home > Systems Design and Architecture 🔥 > Database Concepts and Applications > What is Database Sharding? Scaling DBs

In this lesson, we will learn about database sharding. We'll particularly concentrate on the following:

Scaling databases.
Horizontal partitioning, or sharding of databases.
Sharding strategies, their merits, and downfalls.

In previous tutorials, we've introduced databases and learned about different features of databases. This tutorial will focus on an advanced database topic that helps tremendously with database scalability.

Scalability is generally referred to as the property of the system to handle more throughput/work, by increasing its available resources. Database scaling is exactly as it sounds-- increasing the capacity of the datastore to accommodate more data through various techniques. One such technique among them is database sharding, otherwise known as horizontal partitioning.

Scaling Databases

Suppose that you've recently developed an application. Let's say-- a learning management system-- and released it for students of different institutions. Initially, very few institutions used the application, but after some time, it became popular, and various institutions registered and started using it.

Since you would be saving the record of all students from each institution, the performance of the application would degrade if a large number of users use the application simultaneously. Some transactions may result in deadlocks or even fail, and the application would take more time to respond to user queries. This results in customer dissatisfaction, something startups deeply dislike.

There are several ways to solve this problem (like optimization of queries or upgrading system hardware). One common way to solve this problem is by scaling, or sharding, your database.

Sharding

The act of sharding or horizontal partitioning is the process of breaking up large tables in a database into smaller chunks called shards. These shards are spread across multiple servers (or multiple database instances). To reference which server has what data, a sharding key is generated, which is a key for a specific server, and it specifies which data is stored where.

Each shard will have the same columns and schema as the original table, but the data stored will be different for each created shard.

These shards are useful as they allow for faster and easier management of (if the database is too large). By spreading the data across multiple servers, the shards can store more information and handle a larger number of queries.

You may have already heard about some popular sharded databases, without knowing that they employ the concept. These databases are sometimes also known as distributed databases. MongoDB and Cassandra are examples of these sharded or distributed databases. Many popular databases are also not sharded, these include SQLite, or Redis.

When should you shard your database?

Databases should be sharded only when all other database optimization strategies (such as caching, upgrading to larger servers) fail to improve performance. This is because sharding databases increase the operational complexity of performing operations on the database. One such situation could be the exceeding limit of a single database node implemented on your application when it is used by an immense number of users.

Before we move on, let's revise a little about databases.

Try this exercise. Click the correct answer from the options.

What are the four basic database operations?

Click the option that best answers the question.

create, read, update, delete
read, delete, merge, break
read, delete, create, merge

Sharding Strategies

Sharding can be performed in several ways, using several strategies. Let's discuss them below.

Key Based Sharding

To understand key-based sharding, we'll need to understand-- or revisit-- the concept of a hash function. For this lesson, we can assume the hash function to be a black box that maps values. Thus, it takes a piece of data as its input, and outputs a discrete value corresponding to the input value. In this case, the value is known as hash value.

In key-based sharding, values from a column in a database table are used. The values are plugged into the hash function. The output hash value determines which shard the data should go to. More precisely, the hash value obtained is the shard ID, which determines which shard the data will be stored on.

The values in the hash function all come from the same column. They can be thought of as primary keys, establishing a unique identifier for each row in the table. The values in this selected column are known as shard keys. It should be noted that the shard key needs to be of a value that does not change over time. Otherwise, the update operations may give errors and increase the amount of work.

Range Based Sharding

The range-based approach involves sharding the data according to a specified range of values for an attribute. This strategy is simple to understand and implement.

Consider a database that stores customer information, along with the number of products that they purchased from the store. Suppose we define 2 ranges. The first is from 1-25 and the second is 25-50 for the number of products purchased. This will result in the creation of 2 shards. The data will be divided according to the value of the quantity into their respective tables.

Directory-Based Sharding

Directory-based sharding utilizes a lookup table that keeps track of which shard holds what data. In other words, it specifies a one-to-one mapping of the data with the shard that it is stored in.

Consider the example below. A column from the original table is selected as shard key (just as we did for key-based sharding). Each shard key is then given a specific shard ID, which tells which shard has the data with its corresponding shard key. In this way, the rows of all the rows in the original table are divided into different shards.

Let's see if you understand sharding!

Let's test your knowledge. Fill in the missing part by typing it in.

Hash function is used in ____.

Write the missing line below.

Let's test your knowledge. Is this statement true or false?

Shard key defines which row in the database table will go in which shard.

Press true if you believe the statement is correct, or false otherwise.

Which sharding strategy should you use?

By now you might be wondering, why is there a need for multiple strategies to the same problem? Is there a best strategy among these? The truth is every strategy has its own merits and demerits, and they must be chosen carefully depending on the situation that you're faced with.

Simplicity

In terms of simplicity, range-based architecture works best. It does not need any complicated application code and can be implemented easily, unlike key-based or directory-based sharding.

Dynamic Addition/Removal of Servers

Directory-based sharding is beneficial when there is a need for dynamic addition or removal of servers. In the case of range-based sharding, there is a limitation of specifying ranges, and key-based sharding is restricted due to the hash function. The addition or removal of a server would require rehashing of hash values which could result in server downtime. Directory-based sharding provides the most flexibility in this situation as it allows to retrieve data entries by using a single key.

Distributed Data Storage

Key-based sharding is better, as compared to the other two strategies, when there is a need to maintain data in a distributed manner. It allows for the algorithmic distribution of data. In range-based or directory-based sharding, a mapping needs to be maintained for the shards. In directory-based sharding, there is a heavy reliance on the lookup table. If it gets corrupted, directory-based sharding will have issues.

Even Distribution of Data

Key-based and directory-based sharding provide an advantage when there is a need for even distribution of data. Range-based sharding fails to provide this feature, as the data may be biased to have different values falling under the same range. This would result in a disproportionate number of reads for some shards.

Summary

In summary, different sharding strategies are used under different conditions. It is important to choose the strategy according to the situation faced. Choosing whether you want to shard your database is also an important decision, as undoing changes to such a huge change in the database is a tough task. It is best to try other optimization strategies first before sharding a database.

One Pager Cheat Sheet

We will learn about database sharding and its scalability benefits through a focus on sharding strategies, their merits, and downfalls.
By scaling or sharding your database, you can ensure optimal performance and customer satisfaction for your application when there is a large number of users.
Sharding, or horizontal partitioning, is the process of breaking up large tables into multiple shards on different servers, with a sharding key specifying which data belongs where, providing improved data management and more efficient query handling.
Databases should generally only be sharded if all other methods of optimization, such as caching and upgrading servers, have been exhausted, as it adds significant complexity to perform operations on the database.
The four CRUD operations - Create, Read, Update and Delete - enable users to interact with and modify data stored in a database.
Sharding can be employed using various strategies, depending on the type of data and type of application.
Key-based sharding uses values from a database table as shard keys to plug into a hash function, which outputs a discrete value as the shard ID to determine which shard the data should be stored on.
Range-based sharding divides data into shards based on specified ranges of values for an attribute, providing a simple and effective strategy for data distribution.
Directory-based sharding utilizes a lookup table to map the data with a shard ID that tells which shard holds the data corresponding to the shard key selected from the original table.
A hash function is used to convert a shard key into a unique numerical value to quickly locate the corresponding shard ID in a lookup table for key-based sharding.
Hash functions and shard key determine in which shard a given row from the database table will be stored.
The sharding strategy that is most suitable for your specific needs depends on factors such as application requirements and performance needs, so careful considerations should be taken when deciding which strategy to use.
Range-based architecture is the simplest and easiest to implement approach for sharding compared to key-based or directory-based alternatives.
Directory-based sharding provides the most flexibility for dynamic addition or removal of servers without the need for rehashing and resulting in no server downtime.
Key-based sharding is preferred for Distributed Data Storage, as it does not rely heavily on lookup tables and allows for algorithmic distribution of data.
Sharding with key-based and directory-based techniques ensures an even distribution of data, making it a better option than range-based sharding which may lead to disproportionate reads.
Choosing the right sharding strategy according to the situation and carefully assessing the need to shard the database is important for successful optimization.