• Separates large databases into smaller, manageable parts called shards
  • Each shard shares the same schema.
  • Actual data on each shard is unique to the shard

Example

The hash function user_id % 4 is used to find the server with the relevant user data

Choosing the right sharding key

  • Sharding key (Partition key) consists of one or more columns that determine how data is distributed. In the example above, user_id is the sharding key.
  • It allows you to determine the correct database for a query
  • Choose a key that can ensure that the data is evenly distributed

Challenges

  • Resharding data:
    • Needed when
      • A single shard can no longer hold more data
      • Certain shards are exhausted due to uneven data distribution
    • Requires updating the sharding function and moving data around to comply with the new function
  • Celebrity problem:
    • Also called the hotspot key problem
    • Excessive access to a specific shard could cause server overload
    • Example: Data for a celebrity such as Lionel Messi is on a particular shard. In case of a social application, that shard will be overwhelmed with read operations
    • Solution:
      • Allocate a shard for each celebrity
      • Increase the number of shards
  • Join and de-normalization:
    • Hard to perform join operations across database shards
    • Common workaround is to de-normalize the database so that queries can be performed in a single table.

Sources