Mark As Completed Discussion

Introduction

Design of a storage layer is a challenging task. Fortunately data management is a mature discipline with a variety of solutions available to accommodate various use cases. Unfortunately there is no single solution that can fit all use cases well, hence we need to understand why certain solutions are better than others in dealing with particular data management problems.

The two types of storage layers that can be discussed in this question are:

  • Ephemeral: meaning that data will not be available in case of system restart. Data that is stored in an ephemeral layer is the data that an organisation can afford to lose - typically cache for the data stored in a persistent storage layer.
  • Persistent: meaning that the data must always be available or recoverable. All data that has a value for an organisation is stored in a persistent layer.

By the end of this tutorial you will learn about the following:

  • The challenges of storage layer design and how to address them.
  • The types of workloads and arguably good fit solutions for those workloads.

Challenges

  • Ensuring correct operation - We need to ensure that the storage layer deals correctly with concurrent operations, and takes the necessary steps to ensure that the data is consistent and durable during restart scenarios.
  • Performance - Later in the tutorial you will learn about key performance indicators, at this stage it is enough to understand that storage layer performance depends on the types of data stores that are being used and the configuration that is being applied.
  • Accessing data - Often our use cases would involve requirements for retrieving data that matches specific characteristics. For example: a client would like to know how much time on average customers spend using a platform, hence we need a way to provide such functionality.
  • Dealing with large data volumes - Today businesses deal with hundreds of terabytes of data, which can not be handled by a single server. Large scale storage systems are available for such use cases, however these systems require compromises. Such data stores use the concept of sharding to distribute data across a number of servers.
  • Ensuring business continuity - What happens to our storage layer if the server that is hosting the primary data store breaks down? What happens if the data centre gets flooded? When we design a storage layer we need to make sure that our data can be accessed in case of disaster.

Prerequisites

Types of data stores

The 4 types of data stores, that can solve pretty much any data management problem, make sure you understand what they are:

  • In memory key value: Super fast key value lookup and update operations, can scale up to the capacity of RAM on a single server.
  • Relational: Good performance and genuinely good tool for pretty much anything can scale up to several terabytes.
  • Large scale (NoSQL) key value: Reasonably fast lookups, can scale up to petabytes of data.
  • Distributed file system: unstructured data, that is cheap to store, searching an individual row in a distributed file system is like searching a needle in a haystack.

The list is ordered by the performance characteristics of each type of data store. The list is not complete and the other notable data store types are Search engines and Graph data stores. These are not in our list since they are designed to solve slightly different problems.

Try this exercise. Click the correct answer from the options.

Which data store has the best performance (Lowest latency and highest throughput) for a single key lookup?

Click the option that best answers the question.

  • Large scale NoSQL
  • Relational
  • In memory key value
  • Distributed file system

Transactions

The best way to start looking at this question is through a prism of transactions - atomic operations performed over data. Transactions have the following properties: Atomicity, Consistency, Isolation and Durability. These properties allow us to reason about the correctness of our storage layer design. Understanding the following is of highest importance:

  • Atomicity means an all or nothing approach to a transaction. Think of a money transfer - we want this operation to complete in full: Money deducted from the source account and added to the destination account. If this operation is interrupted by a power outage we want this operation to either be completed in full or rolled back.
  • Consistency means that system goes from one consistent state to another consistent state. An operation should not leave the system in an inconsistent state.
  • Isolation transactions that execute in parallel must not interfere with each other.
  • Durability once operation is executed it remains durable across restarts.

Performance indicators

  • Latency: how fast a data store can process a single request.
  • Throughput: How many requests can a data store process per unit of time.

Typically, you will use data store benchmarking software to test the performance of your data store by generating a particular type of workload that you are planning to run. For example: the workload can on average consist of 90% read transactions and 10% write transactions. Low latency and high throughput would be great, if only both can be achieved at the same time. There is always a compromise between the two, since throughput can be increased by sacrificing latency and vice versa. Later in the tutorial we will look at how we can improve both.

The most common use case

Typical system consists of the following 3 tiers: presentation, business logic and storage tier. Dealing with data represents its own set of non-trivial challenges which we have discussed above - storage layer encapsulates all of these challenges. It is expected that you understand the reasons behind the tiering approach at this point.

The most common use case

A relational data store is a goto solution for the vast majority of use cases and can comfortably support data volumes of up to several terabytes. Relational data stores offer strong ACID guarantees, support for Structured Query Language (SQL) and great performance for certain types of workloads. This architecture addresses correctness, performance and data access challenges.

Let's test your knowledge. Is this statement true or false?

High latency means low throughput?

Press true if you believe the statement is correct, or false otherwise.

Performance tuning

Now let’s take the next step and configure the system according to the workload that is expected to run. Default settings are okay to get started, however as workload on the server will grow, you will need to adjust settings for the system to perform best for your particular workload.

Before making changes it is a good idea to run a few benchmarks to test the performance of the current configuration so it can later be compared with the new configuration. This allows us to verify that the tuning steps have indeed made the expected effect. For the specific settings you will need to refer to the product documentation and follow their guidance in how the data store can be tuned for the best performance in particular workload.

Ephemeral Storage Layer

Although basic 3 tier architecture described above addresses the performance challenge, the performance is a characteristic, which can almost always be improved. Ephemeral storage is used for caching data that is frequently accessed. Source data is always kept in persistent storage such as relational data store. Ephemeral storage layer needs to support Atomicity, Consistency and Isolation without having to take care of Durability property.

Let’s consider the following scenario: We have a relational data store. We notice that certain read-only transactions are executed frequently. These transactions require that the server computes the result over and over, even if the resulting data set has not been changed.

How can we improve this situation? We can deploy a server that can store the result of this query in memory and return the result without the need to run the query again. There are several in-memory data stores designed specifically for such use cases, the most notable are: Redis, RocksDB, Memcached. This strategy is often used in modern storage layer design. A well designed cache server integration would result in an order of magnitude performance improvement.

Bear in mind that a relational data store is in general well optimised in how they use cache internally, so using a cache server for single key lookup transactions would not necessarily produce the desired performance improvements. The real goal is to store frequently read transaction results and update the cached versions only if the result set has changed.

Ephemeral Storage Layer

This allows to reduce latency since response is already available to be retrieved from the cache server. This also improves throughput since there is no more overhead of computing the result of the query on the data store side, which allows it to execute more transactions per given unit of time. This is a straight win situation!

Persistent Storage layer

When designing persistent storage layers we need to pay more attention towards the following problems: dealing with large volumes of data and service reliability.

Suppose the requirement is to accommodate the storage for several hundreds of terabytes. Such volumes of data can not fit on a single server. Think about the kind of volumes large internet companies are working with. How can we deal with such volumes? Fortunately a number of solutions are already available for us to use.

Distributed file system and Large scale data stores Sharding is a process of splitting large data stores into smaller portions that can be hosted on a separate server. Sharding can be vertical or horizontal.

Persistent Storage layer

During vertical sharding we move tables to a separate host. This approach is often applied in relational data stores. Key value data stores partition a single table according to the number of servers in the cluster.

Distributed file system that allows to store any type of data. Internally such systems provide some level of ACID guarantees, however, there are caveats depending on the specific distributions that you are working with. Typically such storage systems would be used to store a mix of structured and unstructured data.

The next question you should be asking is: How can I find the data in the distributed file system quickly? The answer is large scale NoSQL key value data stores. You can think of them as an index on top of the distributed file system.

Putting everything together

Overall a persistent storage layer consists of the following systems: A relational data store for the data that requires various ways of querying, A distributed file system for structured and unstructured data that needs to be preserved and occasionally analysed and a large scale key value data store that is can be set up to serve as an index for the data that is stored in a distributed file system. We can also add a cache layer to accelerate the performance of either data stores applying the principles described in the ephemeral storage layer section.

Persistent Storage layer

Large scale data stores seem like a great tool, why can’t we use them for everything? As mentioned at the beginning of this tutorial there is no single solution to fit all use cases. The following are the exact reasons why this would not be a good option:

  • Lack of schema
  • Lack of support for multi row transactions
  • Higher latency with lower throughput compared to a relational database.

The first 2 reasons also apply to memory key value stores - quite often you will need multiple ways to query your data and searching the value by key would not fit every use case.

Try this exercise. Click the correct answer from the options.

Sharding method applied for key value stores is:

Click the option that best answers the question.

  • Vertical
  • Horizontal

Ensuring business continuity

Servers fail, Data centres catch fire, businesses need to continue their operations. Depending on the specific purpose of the storage layer that you are designing one of the methods described below will need to be applied for the purposes of disaster recovery.

Backups

Regular backups would allow recovery of data. Recovery would take time proportional to the volume of data that needs to be recovered and the storage class which is used to store the backup. Cloud providers offer a range of storage tier solutions for backups at different price categories.

Replication

Is a process of incremental copying of the content from primary server to a backup server. This allows an instant failover to the backup server in case of primary server failure. The benefit of the replication approach is that the backup server can be used for read only workloads in certain cases. Replication can be enabled within a single data centre or across separate data centres. In order to maintain reasonable performance across data centres the closest suitable data centre is normally selected for hosting backup replicas.

Geographical distribution

Is used to solve the latency issue by duplicating data between different geographical regions. This problem is far from trivial, often involving data sets which are completely disjoint from each other. It is also used as part of business continuity strategy as we saw above.

Try this exercise. Click the correct answer from the options.

Primary use of replication is:

Click the option that best answers the question.

  • To Improve data store throughput
  • To be able to quickly restore data in case of failure

One Pager Cheat Sheet

  • By understanding the different types of storage layers (Ephemeral & Persistent) and their respective use cases, this tutorial will help you to address the challenges of storage layer design and choose the best solutions for various workloads.
  • Ensuring business continuity and performance, along with accessing large data volumes and ensuring correct operation are all concerns when designing a storage layer.
  • All 4 types of data stores: In Memory Key Value, Relational, Large Scale (NoSQL) Key Value and Distributed File System provide different performance characteristics and have the potential to scale up to different capacities in order to solve any data management problem.
  • In Memory Key Value stores provide the fastest data lookups due to their in-memory architecture, latency advantages, and ability to leverage parallelism.
  • Transactions are atomic, consistent, isolated, and durable operations which ensure that data is always kept in a consistent state across restarts.
  • The performance of a data store is usually measured by benchmarking, by testing its latency (how fast it can process a request) and throughput (how many requests can be processed per unit of time), though it is difficult to have both low latency and high throughput at the same time.
  • The most common use case is to use a relational data store, with strong ACID guarantees, SQL support and performance, to address correctness, performance and data access challenges.
  • High latency and throughput are not necessarily inversely proportional, and the throughput can still be high even with high latency if the number of operations is high enough.
  • Optimize system performance for a particular workload by running benchmarks and adjusting settings according to the product's documentation.
  • A well-implemented Ephemeral Storage Layer can dramatically improve performance by allowing responses to be retrieved from a cache server and reducing latency and increasing throughput.
  • Designing a persistent storage layer requires careful consideration of the different solutions available to serve large volumes of data, such as sharding, distributed file systems, and Large scale NoSQL data stores, while meeting the requirements of ACID, scalability, and low latency.
  • The sharding method applied for key value stores is horizontal, allowing different sub-sets of the data store to be retrieved and managed separately, making them more efficient for data retrieval.
  • Ensuring business continuity by regular backups or through replication across multiple data centers can minimize downtime in the event of disasters.
  • Geographical distribution is used to solve latency issues, as well as form a business continuity strategy using disjoint data sets.
  • Data replication provides a way to restore data quickly in case of system or hardware failures, reducing the risk of data loss by creating multiple copies of the data stored in different locations.