Back to top

Distributed System

Quick Video Glance

Develop Intuition

Often, in a pizza outlet, we observe that there are separate sections specific to the various tasks involved in preparing a pizza which may include: getting the pizza base, adding the topping, adding cheese and cutting slices. In this example, the pizza outlet can be considered a distributed system consisting of separate sub-systems (e.g., getPizzaBase(), addToppings(), etc.) as shown in the image below.

Fig 0: Pizza outlet as A distributed system

Distributed System can be defined as a level of abstraction in which multiple nodes (hardware or software components) coordinate their actions by communicating with each other on a network and still act as a single unit. We have listed below definitions of distributed system by a few well-known computer scientists.

Leslie Lamport’s definition, ' A distributed system is one that prevents you from working because of the failure of a machine that you had never heard of.'

Andrew Tanenbaum’s definition, ‘A collection of independent computers that appear to its users as one computer.'

It seems from these definitions that a vast range of applications, from databases like Cassandra to streaming platforms like Kafka to the entire ecosystem are different flavors of distributed systems. Nevertheless, it’s essential to understand how distributed systems evolved to its current state.

Evolution of Distributed Systems

Over the last few decades, the need to make services available to a wide range of people in real time led to the growth of distributed systems. The evolution was focused around availability, reliability, and reusability of the system. This also led to the growth of standard communication interfaces between systems. In Fig. 1, we have shown how the distributed systems paradigm evolved based on the need of the hour.

Fig 1: Distributed Systems Taxonomy

The client-server model had been around since the 1960s and 1970s when computer scientists building ARPANET (at the Stanford Research Institute ) used the terms server-host (or serving host) and user-host (or using-host). Nevertheless, in 1990s, other paradigms such as Mobile Agents technology came into picture which was geared towards overcoming slow networks by enabling the capability to migrate software and data from one computer to another autonomously. Due to the growth of available and fast network connections and lightweight communication protocols ( APIs ), the client-server model gained more attention than its contemporaries. After that, Service Oriented Architecture ( SOA ) came into the picture to provide integration points between different organizations which relied on simple remote procedure calls such as Simple Object Access Protocol ( SOAP ). However, the need for more lightweight and modular systems led to the growth of Microservices .

Common applications of Distributed Systems

The usage of distributed systems is widely prevalent in day-to-day applications. Some of its applications are listed below.


It’s an architectural pattern where an application is structured as a collection of small autonomous services modeled around a business domain, and each of those services comprises of a specific business capability. For instance, designing a streaming platform such as Spotify ( talk by their VP of Engineering if someone’s just curious) may involve creating separate web services for different sub-domains of the business: user-identity management, user-playback history, audio content management and so forth.

Distributed Caching

With the growth of the internet, it was not possible to cache the information for quick look-up on a single host. The usage of distributed systems such as Distributed Hash Tables( DHT ) helped us solve this problem by using efficient hashing(e.g. Consisent Hashing) mechanisms and communication protocols( Chord ).

Distributed Data Storage

The growth of NoSQL over the past decade led the innovation in the field of distributed data storage systems, which can run over large clusters of commodity hardware. These systems can be catered towards a range of query patterns: simple key-value data stores like AmazonDynamoDB and graph-based stores like Neo4j to time-series database such as InFluxDB .

Caveats of Distributed Systems

There’s no doubt in the fact that building distributed systems is more complicated compared to centralized systems , however, with an increase in requirement of high computing and data-storage power, there was a need to develop more complex systems - under the purview of distributed systems. Let’s try to understand the complexities involved in a distributed system by doing a quick brain exercise.

Brain Exercise

There is a contest to vote for your favorite TV series. Ten popular TV series (Friends, GOT, The Big Bang Theory, Breaking Bad and so forth) are part of this contest. The contest will last for a few hours. We need to build an application to support this voting contest. The major requirements which we need to support in this application are:

  1. Users should be able to use mobile and desktop platforms to cast their vote.
  2. Administrators of the voting application should be able to access the current votes count.
  3. Since the voting will last for only a couple of hours, we expect the frequency of votes( TPS ) to be significantly high (i.e., tens of thousands).

We recommend you to think about a solution to this problem before reading further.

Naive Approach

One possible approach to solve the brain exercise problem can be to build an application where each server/host maintains the count of votes. After that, the total votes count is obtained by fetching the votes count from individual servers and summing them up. However, a major issue with this approach is that in case a host goes down, then the votes count stored on that particular node will be lost as well.

Optimal Approach

We will cover the detailed design of this system in one of our later chapters; however, here we are covering a high-level overview to explain the complexities of distributed systems using an example. The voting application can be broken down into two separate components: Online component (aka user plane, data plane) and the Offline component (aka control plane ). The online part helps with interfacing with voters on real-time. For example, users can cast their vote and get a confirmation when the vote is persisted onto the database securely. On contrary, control plane supervises the background operations for the system (e.g. tallying up the votes).

  • Online Component/Data Plane

In Fig 2, we have shown the high-level design of the data plane of this voting application. The voters will use mobile and desktop platforms to cast their vote by placing HTTP requests. Those requests will be forwarded to load balancers using DNS lookups. The load balancer will use scheduling algorithms to direct the traffic to application servers, which will save the votes in the data store.

Fig 2: Voting Application: Online component

  • Data Model

We may use any aggregate-oriented NoSQL data store (key-value store such as Riak, Amazon DynamoDB) as they are cheaper and scalable compared to traditional relational databases. One way of modeling the data can be to have separate vote counts for different TV series (using TV series name as key), which gets updated whenever a vote gets cast. However, a skewed distribution of votes (as shown in Fig 3) can lead to a scenario where the partition on the disk storing the TV series having higher votes can get overwhelmed by the writes.

Fig 3: Skewed distribution of casted votes

The issue related to skewed distribution can be solved using write-sharding by attaching a random number to the TV series name, as shown in Fig 4 below. Using this approach, the writes for a particular TV series name gets distributed across multiple shards (more details in upcoming chapters).

Fig 4: Write sharding to solve skewed distribution problem

  • Offline Component

In the offline component, we can use a cron job which triggers a map-reduce operation that scans the sharded records, counts votes of all the TV series and persists the final votes count in a separate data store as shown in Fig 5. We will discuss more details about map-reduce and its applications in future.

After that, the administrator’s request to fetch the current votes count is served by the database table which gets populated by the offline operation (Votes Count table in Fig 5).

Fig 5: Offline Operation to compute votes count

Caveats Uncovered

Let’s use the voting application to gain in-depth insights into some of the gotchas, which can help software development teams in designing more reliable, robust, and scalable systems.

  • Business Requirements: System Design depends a lot on the business domain using which we can make the trade-off between reliability( consistency ) and performance( availability ). For instance, building a financial system requires data to be consistent even though it comes at the cost of reduced performance. However, in the voting application, we may prefer availability over consistency (aka CAP Theorem) as it’s important that users can cast votes, even if the administrators get the exact vote counts after a minimal delay.

  • Data Modeling: The choice of the data store and the data schema depends a lot on users requirements and query patterns. For social media applications, we can use graph-based data stores like Neo4j , applications similar to Fitbit, which are related to time-series data may use databases like InfluxDB . In the voting app above, the data is stored in a key-value data-store by using the write-sharding mechanism to tackle the problem of skewed votes distribution**.**

  • Node Failure: Failure of nodes is an unavoidable phenomenon, and developers should take this into account while building a robust system. Under the purview of distributed systems, we employ techniques such as sharding and replication to minimize the impact of such node failures. In the voting application, we account for the fact that the hosts used in the application servers can go down, and so the data is persisted in a separate data store rather than on individual servers(naive approach).

  • Network Failure: Similar to node failure, it’s challenging to prevent network failure, so developers should account for error-handling on networking errors. One way of handling such error-scenarios is the retry policy, which varies a lot depending on business requirements (e.g.latency and availability SLA). For instance, we can have a retry mechanism in the voting application to persist the information in the data store to account for issues related to network failure.

  • Network Security: Often, complacency related to network security leads to the system being vulnerable to malicious users and programs that keeps on adapting to the existing security measures. Using secure protocols such as HTTPS in the voting application will ensure that malicious users don’t cast the votes.

  • Infrastructure Cost: We should be mindful of the infrastructure cost required for running the system. It’s quite common in the industry to see developers being blind-sided about the cost factor in their designs. The usage of NoSQL data store in the voting application is an approach to minimize the infrastructure cost as the NoSQL data stores are efficient at running on large clusters of commodity hardware.

  • Communication Interfaces: It’s imperative to use best practices while designing APIs for the web-services using light-weight communication protocols. This allows the creation of secure and extensible integration point for those services. In the voting application, using light-weight REST APIs in the data plane provide ease of access from different platforms such as desktop and mobile.


In this chapter, we have tried to give our readers a high-level overview of distributed systems along with its applications and key learnings which we have had based on our experience. We have covered the main concepts of distributed systems here and will discuss in detail about them in our later chapters.