Back to top

Sharding

Quick Video Glance

Develop Intuition

Let’s say you are going on a camping trip with your friends. You have got the responsibility to pack 4 fruits in 4 boxes (each box has 4 quadrants). You need to ensure that even if some of the boxes somehow are lost in the trip, everyone gets to eat all the fruits. You do this by packing each of the 4 fruits in separate quadrants in each box. The same analogy can be used to understand sharding. Each of those fruits can be compared to files and the boxes to servers. Similar to distributing the fruits across boxes, the files can be spread across servers.

Additionally, there is always a chance that some of the boxes may get lost on the trip, but still, you want your friends to eat all the fruits. Similar to that, even if some of the servers dies, we always want to prevent data loss. This is the benefit of sharding, which is further enhanced by replication, we will discuss more in the latter part of this article.

Fig 0: Fruit Analogy for sharded vs un-sharded data

Introduction

In recent decades, we saw exponential growth in the amount of data to be stored per server on the internet. Sharding is a mechanism used for splitting and storing a large amount of data into several smaller datasets. In distributed data-storage system, sharding distributes the data across multiple servers, as a result of which each server acts as the source for only a subset of data. Often, sharding is referred to as horizontal partitioning as rows of the same database table is split into multiple database nodes.

Shard is identified using a primary key which comprises a partition key. The partition key helps in determining the database node where a record will be stored. Often, records sharing the same partition key are stored on the same database node. This has a significant impact on the time taken to fetch records while querying these shards. Sharding is used extensively in applications such as distributed file systems (e.g., Amazon S3) where a single file is sharded into chunks and spread across multiple servers. In this article, we will be focusing on the applications of sharding in distributed data-stores.

Structuring Shards

A shard is typically a collection of attributes, also known as aggregates, which determines its size and boundary. It’s imperative to think through the decisions taken to determine the shard boundary. There are three significant factors which should be taken into account while structuring shard.

  • Cross-Partition Operations: It refers to operations which are spread across different partition keys on separate database nodes. As a result, there is an increase in network calls, which directly impacts the latency of such operations. Ideally, the shards should be structured in such a way that such operations are prevented.
  • Hot Partitions: Hot partitions occur when the majority of the traffic is served by only a few of the partitions. They occur due to sub-optimal shard boundaries, which results in an uneven distribution of traffic. We have shown hot partitions in Fig 1 where Node1 becomes the hot partition in the cluster due to the unequal traffic distribution.

Fig 1: Node1 becomes a hot partition due to uneven traffic

  • Handling Data Loss: Sharding doesn’t guarantee high availability when used alone. Although the data is on different nodes in a cluster, there is always a possibility that if a node fails the sharded data on that node may be lost as well. However, by replicating the sharded data on multiple nodes in a cluster, the data-loss issue can be fixed. More details about the usage of sharding with replication can be found here.

Sharding Strategies

Often, choosing the boundary of the shard is a challenging decision to make, which requires developers to think through all the possible scenarios. If the shard size is more than what’s needed, then the records sharing the same partition may not fit on the same node. On the other hand, choosing a minimal shard boundary may lead to cross-partition operations and increase the complexity as well. We have provided a brief overview of different variants of some of sharding strategies below along-with some brain exercises.

  • Algorithmic Sharding: In this approach the strategy to distribute data across cluster of servers is within a sharding function. The choice of an effective sharding function can help in significantly reducing the data which gets transferred in re-sharding scenarios. Consistent Hashing is one of the commonly used sharding function. An example of a system using this strategy is Memcached.

Fig 2: Algorithmic Sharding Strategy

  • Dynamic Sharding: This method involves using an external locator service that determines the location of entries in the cluster. The locator service can be implemented in several ways. One such approach can be to shard the traffic based on the range of partition keys. This strategy is more resilient than algorithmic sharding for non-uniform distribution of data. However, it comes at the cost of the locator service being a single point of failure. This strategy is used in MongoDB, Apache HBase, and HDFS.

Fig 3: Dynamic Sharding Strategy

  • Entity Group Sharding: This strategy is most commonly used in relational databases where related data entities are stored as an entity group within the same partition. Often, this method is implemented dynamically given that the size of entity groups can vary to a great extent. In Fig 4, we have shown an order data entity which forms an entity group along-with payments and lineItem data-entities and is distributed across multiple nodes in a cluster.

Fig 4: Entity Group Sharding

It should be apparent that the right choice of shard boundaries leads to minimal joins across partitions, leading to increased latency. Let’s get a better understanding of it using a brain exercise.

Brain Exercise

Suppose you need to design a data-store for an application which stores the timestamps when the users logged in. This data store will mostly be queried to fetch the number of times a user (having a unique identifier) logged in a given time interval. The data will be stored in a key-value database. What attributes will you choose to be used as the primary key in the database for storing the user information?

We highly recommend you to think about the solution to the brain exercise before reading further.

The primary key for each of the records in the data-store will consist of a partition key (PK) and a range key (RK). This is also called hierarchical-keys sharding strategy. In such a data-model, multiple rows share the same partition key but have different range key and are often used for range-based queries. Amazon DynamoDB provides a similar functionality where the range key is called the sort key.

In Table 1, we have shown how the data-model in the key-value data-store will look like. The primary key will comprise of userId as the partition key and the timestamp at which the user logged as the range key. Such a data-model will enable us to effectively run queries related to user login behavior, for example, the number of times a user logged in a time-frame.

UserId(PK)

LoggedInTime(RK)

Username

email

UID001

05/19/2018 12:54:55

Alice

alice@email.com

UID001

05/19/2018 15:34:45

Alice

alice@email.com

UID002

05/19/2018 09:23:54

Bob

bob@email.com

Table 1: Hierarchical Sharding Strategy

Sharding with Replication

We need effective replication mechanism to prevent the loss of sharded data. There are two major replication schemes used along-with sharding: Master-slave replication, and peer-to-peer replication.

Master-Slave Replication

This replication strategy shards data across multiple nodes in which one of them is the master node, and the remaining are secondary nodes (aka slaves). The master node serves as the ground-truth and the data gets replicated in the slave nodes. This is especially effective in read-intensive applications where the read performance can be improved by adding more slaves. There are two significant limitations of this strategy: inconsistency between slaves for all updates and some updates made to the master don’t get propagated to slave.

Fig 5: Master Slave Replication: Read Scalability, Single point of failure

In the image above, we can see that the master nodes act as a single point of failure. Nevertheless, there are several ways in which we can handle the issue associated with the master being the single point of failure.

Brain Exercise

What approach would you take to solve the problem associated with the master node becoming the single point of failure?

We highly recommend you to think about the solution to the brain exercise problem before moving forward.

One way of approaching this problem can be to rely on one of the slaves to be the hot back-up of the master node when it dies. However, this there is still a possibility of the data getting lost when some of the updates made only to the master that doesn’t get propagated to hot-backup. This can occur when the master node dies even before the sync from master to slave nodes is complete. This risk is tolerated by the majority of the highly-available applications which use master-slave replication.

Peer-to-Peer Replication

In this replication process, there is no single point of failure, and writes are accepted by all the nodes. All the nodes have the same share of responsibilities (unlike master-slave), and the loss of any of them do not restrict us from accessing data. However, the primary issue we see here is related to write-write inconsistencies, which can result in conflicting writes within nodes. Such discrepancies can be resolved by nodes syncing with each-other to fix such conflicting writes.

There are several ways of handling such conflicts, one popular way of doing that is using version stamps. However, it comes with the trade-off of increased latency in committing writes as the number of network calls is also increased in the quorum of nodes which decides whether write was successful.

Fig 6: Peer to Peer Replication: Read & Write Scalability, Write-write inconsistencies

Brain Exercise

Now that you are aware about both the replication schemes. Which replication scheme seems to be better? Can you think of a more optimal way of solving this problem?

We highly recommend you to think about the solution to the brain exercise problem before moving forward.

The answer to the first question is that it depends on the use-case. Based on the business requirements, we can make the trade-offs between the two replication schemes. The master-slave replication can be used in an application handling financial transactions. In such transactions, data integrity is of utmost importance, and an erroneous credit/debit can have serious consequences. On the other hand, peer-to-peer replication is a better fit for a social-network post in which some delay in data propagation can be tolerated. For the second problem, the standard practice is to use a hybrid of the master-slave and peer-to-peer replication schemes by having master and slaves on separate fleets along-with some replication factor.

References