What's a distributed system?
It's not microservices exchanging messages and storing data in a DB
I was talking to candidate the other day when he asked me the following question:
Candidate: In your job description you mention experience with building distributed systems. What do you mean by a distributed system
He then proceeded to quip that the present day definition of distributed systems are micro services or applications exchanging messages with each other over gRPC, a RESTful API or similar manners. He posited that that isn’t a distributed system from his perspective. He’s right as I’ll elaborate on in this post.
Consider this very simple application comprised of two services that rely on some backend store for state persistence. These two services exchange messages, perhaps over gRPC or some other form of communication.
If you worked on a system like the one depicted above, you did not work on a distributed system. At least not according to our - the candidate and I - definition Yes, the system is distributed in the sense that it is comprised of multiple services coordinating over a network, but that doesn’t meet (my) definition of a distributed system. In fact, the only component in the above diagram that could be described as distributed system is the database, especially if it can run on multiple servers, which most modern scalable databases and file-systems do.
So what then is my definition of a distributed system?
A distributed system is one that is able to run on multiple computers, yet appearing as a single coherent system to the end-user or client.
In lay terms, it’s a system that is comprised of many parts that appear as a single unified one. Distributed systems are common in database and file-systems and broadly in storage products.
Distributed systems often exhibit the following characteristics:
Concurrency: Distributed systems allow multiple transactions to run concurrently, either on the same machine or across different machines in the network. This enables the system to perform multiple tasks at the same time, improving efficiency and throughput.
Scalability: They can scale horizontally or vertically to accommodate increased loads. Horizontal scaling involves adding more machines to the pool of resources, while vertical scaling involves adding more power (CPU, RAM) to an existing machine. Distributed systems are designed to easily integrate new resources without significant downtime or performance degradation. Some distributed systems are able to scale up or down (horizontally) without incurring any downtime.
Fault Tolerance: Distributed systems are designed to continue operating despite the failure of one or more servers (nodes). Through redundancy and replication, these systems ensure that data is not lost and services remain available even when parts of the system fail.
Decentralization: Instead of relying on a central control unit, distributed systems manage resources, state and tasks across multiple nodes (i.e machines), which can lead to improved resilience and scalability.
Distributed systems also share some common problems that they need to solve. Below are just a few - and much simplified - that highlight the challenges and intricacies of building such systems.
Fault tolerance, recovery & durability
Fault tolerance, recovery, and durability are critical aspects of distributed file systems, ensuring data integrity, availability, and persistence even in the face of failures. These concepts are intertwined, each playing a crucial role in the system's overall resilience and reliability.
Fault tolerance in a distributed file system refers to its ability to continue operating in the event of failures of its components, such as servers, storage devices, or network connections. This is achieved through various mechanisms:
Replication: Data is copied across multiple nodes or geographical locations. If one node fails, the data can be accessed from another replica.
Erasure coding: A form of data protection in which data is divided into fragments, expanded and encoded with redundant data pieces, and stored across a set of different locations or storage media. It offers a more storage-efficient way to achieve fault tolerance compared to traditional replication methods.
Automatic failover: In case of a failure, the system automatically switches to a redundant or standby system, network, or component without requiring human intervention.
Recovery mechanisms ensure that a distributed file system can return to a correct state after a failure. This involves restoring data from backups or redundant copies to replace lost or corrupted data. Recovery could also imply restoring the system to a known good state, which may involve replaying log files to reconstruct the state immediately before the failure. Consistency checks, or checkpointing, distributing state across multiple nows are all common techniques that allow a distributed system to recover from failures.
Durability guarantees that once a write operation has been confirmed to the user, the data will not be lost even in the case of a system crash. In distributed file systems, durability is ensured by:
Write-ahead logging: Recording changes to data in a log before the changes are actually made. If the system crashes before the changes are applied, the log entries can be used to replay and complete the operations.
Synchronous replication: Ensuring that data is written to multiple nodes before a write operation is considered successful. This can impact performance and cost but increases data durability.
Data versioning: Keeping multiple versions of data objects, which can help in recovering from data corruption or accidental deletion. Snapshots are often used to track versions and support back-in-time capabilities.
In short, fault tolerance, recovery, and durability mechanisms work together to protect against data loss and ensure that the distributed file system remains available and reliable, even in the face of hardware failures, network partitions, or other unforeseen challenges.
The CAP theorem
The CAP theorem, is a fundamental principle that applies to distributed systems, especially in the context of distributed databases and file systems. It states that in the presence of a network partition (P), a distributed system can provide only two out of the following three guarantees:
Consistency (C): Every read receives the most recent write or an error. In other words, consistency ensures that all nodes see the same data at the same time. It's equivalent to having a single up-to-date copy of the data.
Availability (A): Every request receives a (non-error) response, without the guarantee that it contains the most recent write. The system remains operational and accessible, even in the face of network partitions or failures.
Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes. Essentially, the system can sustain any amount of network failure that doesn't result in a failure of the entire network or loss of quorum. We’ll talk about quorums in the next section.
In the context of a distributed file system, the CAP theorem manifests in the design and operational trade-offs:
Consistency and Partition Tolerance (CP): If a distributed file system prioritizes consistency and partition tolerance, it might choose to make data unavailable until it can guarantee consistency across partitions. This could mean that if there is a network split, some parts of the file system might not be accessible to ensure that no stale data is read or written.
Availability and Partition Tolerance (AP): If a system prioritizes availability and partition tolerance, it will ensure that the file system remains accessible even if it cannot immediately update all copies of the data across partitions. This could lead to scenarios where different nodes might have different versions of a file until the network partition is resolved.
Consistency and Availability (CA): Without partition tolerance, a system might choose to focus on keeping data consistent and available as long as there is no network failure. However, this scenario is less relevant for distributed systems since network partitioning is a common issue that needs to be addressed.
In practice, most distributed file systems need to manage partition tolerance due to the inherent unreliability of network communications. Thus, the choice often lies between focusing on consistency or availability.
For example, a distributed file system designed for a banking application might prioritize consistency over availability to ensure that all financial transactions are accurately reflected across all nodes, even if it means some data might not be accessible during a partition.
On the other hand, a distributed file system for a social media platform might prioritize availability over consistency to ensure that users can always post and read content, even if some users might see slightly outdated data during a network partition. This is known as eventual consistency.
Quorums and consensus
In the context of distributed systems, consensus and quorum are fundamental concepts used to achieve reliability, consistency, and coordination among multiple nodes, especially in the presence of failures or network partitions.
Consensus refers to the process of agreeing on one result among a group of nodes. It's crucial in distributed systems for operations like committing transactions, electing leaders, and synchronizing state across the system. Achieving consensus ensures that despite failures or message losses, the system can agree on a single course of action or a single value, maintaining the integrity and consistency of the system.
Common consensus algorithms include:
Paxos: Introduced by Leslie Lamport, Paxos is known for its correctness proof but is often considered complex to implement. It involves multiple phases (prepare/promise and propose/accept) to ensure that a majority of nodes agree on a proposed value.
Raft: Designed as a more understandable alternative to Paxos, Raft breaks down the consensus process into leader election, log replication, and safety. It ensures that all changes are made in a consistent order across the cluster.
Quorum is a concept related to achieving consensus, referring to the minimum number of nodes that must agree or participate in a decision to ensure that the decision is valid and reliable. It's a way to ensure that any operation or change has the support of a majority of nodes, which is critical for maintaining consistency and avoiding split-brain scenarios, where different parts of the system make conflicting decisions.
The typical formula for a simple majority quorum in a system with N nodes is ⌈2N​+1⌉, meaning more than half of the nodes must agree for a decision to be made. This ensures that there is always a majority agreement on any decision, preventing conflicting operations from being considered valid by different parts of the system. Establishing a quorum involves several steps:
Defining the quorum size: Determine the minimum number of nodes that must participate in a decision. This is typically more than half of the total number of nodes in the system to ensure a majority.
Node participation: Each node in the system must have a mechanism to participate in decision-making processes, typically through voting on proposals or participating in elections. This can be established via a consensus protocol like Paxos or Raft.
Communication: Nodes communicate their votes or decisions to each other using a reliable communication mechanism to ensure that all participating nodes can receive and acknowledge the information.
Agreement Checking: Once votes are cast, the system checks if the number of votes meets or exceeds the quorum size. If it does, the decision is considered made, and the system proceeds accordingly.
Handling Failures: The system must also handle cases where a quorum cannot be established (e.g., due to node failures or network partitions) by implementing fallback mechanisms, such as retrying the operation or electing new leaders. Additionally distributed systems are usually tolerant to losing one or more nodes, up to the point of not being able to establish a new quorum. At this point, the system might elect to go offline versus continue to operate. This depends on the CAP tradeoffs, specifically between A and C.
In practice, establishing a quorum and achieving consensus in a distributed system involves dealing with various challenges, including network latency, node failures, and the CAP theorem's implications. The choice of consensus algorithm and quorum size can significantly affect the system's performance, reliability, and consistency.
This post might appear as me attempting to split hairs. After all the system illustrated in my diagram is a distributed one, comprised of multiple services running on different machines and coordinating amongst each other. However, I doubt that the services in my diagram are concerned with any of the challenges and constraints that distributed databases or storage products have to contend with. Hence, the distinction.
A good litmus test is if you find yourself all too familiar with the concepts in this book, then you’re experienced with distributed systems. At least according to my definition! And if you are and want to work on a very novel one, solving hard security problems, please reach out. tome. I’m hiring software engineers who have experience building distributed systems.