In this post, I attempt to relate Vertica distributed system properties to the well known CAP theorem and provide a fault tolerance comparison with the well known HDFS block storage mechanism.
The CAP theorem, as originally presented by Brewer @ PODC 2000 reads:
It is impossible for a web service to provide the following three
The CAP theorem is useful from a system engineering perspective because distributed systems must pick 2/3 of the properties to implement and 1/3 to give up. A system that “gives up” on a particular property strives makes a best effort but cannot provide solid guarantees. Different systems choose to give up on different properties, resulting in different behavior when failures occur. However, there is a fair amount of confusion about what the C, A, and P actually mean for a system.
- Partition-tolerance – A network partition results in some node A being unable to exchange messages with another node B. More generally, the inability of the nodes to communicate. Systems that give up on P assume that all messages are reliably delivered without fail and nodes never go down. Pretty much any context in which the CAP theorem is invoked, the system in question supports P.
- Consistency – For these types of distributed systems, consistency means that all operations submitted to the system are executed as if in some sequential order on a single node. For example, if a write is executed, a subsequent read will observe the new data. Systems that give up on C can return inconsistent answers when nodes fail (or are partitioned). For example, two clients can read and each receive different values.
- Availability – A system is unavailable when a client does not receive an answer to a request. Systems that give up on A will return no answer rather than a potentially incorrect (or inconsistent) answer. For example, unless a quorum of nodes are up, a write will fail to succeed.
Vertica is a stateful distributed system and thus worthy of consideration under the CAP theorem:
- Partition-tolerance – Vertica supports partitions. That is, nodes can fail or messages can fail to be delivered and Vertica can continue functioning.
- Consistency – Vertica is consistent. All operations on Vertica are strongly ordered – i.e., there is a singular truth about what data is in the system and it can be observed by querying the database.
- Availability – Vertica is willing to sacrifice availability in pursuit of consistency when failures occur. Without a quorum of nodes (over half), Vertica will shut down since no modification may safely be made to the system state. The choice to give up availability for consistency is a very deliberate one and represents cultural expectations for a relational database as well as a belief that a database component should make the overall system design simpler. Developers can more easily reason about the database component being up or down than about it giving inconsistent (dare I say … “wrong”) answers. One reason for this belief is that a lack of availability is much more obvious than a lack of consistency. The more obvious and simplistic a failure mode is, the easier integration testing will be with other components, resulting in a higher quality overall system.
In addition to requiring a quorum of up nodes, each row value must be available from some up node, otherwise the full state of the database is no longer observable by queries. If Vertica fully replicated every row on every node, the database could function any time it had quorum: any node can service any query. Since full replication significantly limits scale-out, most users employ a replication scheme which stores some small number of copies of each row – in Vertica parlance, K-Safety. To be assured of surviving any K node failures, Vertica will store K+1 copies of each row. However, it’s not necessary for Vertica to shut down the instant more than K nodes fail. For larger clusters, it’s likely that all the row data is still available. Data (or Smart) K-Safety is the Vertica feature that tracks inter-node data dependencies and only shuts down the cluster when node failure actually makes data unavailable. This feature achieves a significant reliability improvement over basic K-Safety, as shown in the graph below.
The key reason Data K-Safety scales better is that Vertica is careful about how it arranges the replicas to ensure that nodes are not too interdependent. Internally, Vertica arranges the nodes in a ring and adjacent nodes serve as replicas. For K=1, if node i fails, then nodes i-1 and i+1 become critical: failure of either one will bring down the cluster. The key take away is that for each node that fails, a constant number (2) of new nodes become critical, whereas in the regular K-Safety mechanism, failure of the K th node makes all N-K remaining nodes critical! While basic K=2 safety initially provides better fault tolerance, the superior scalability of Data K=1 Safety eventually dominates as the cluster grows in size.
Here we can draw an interesting comparison to HDFS, which also provides high availability access to data blocks in a distributed system. Each HDFS block is replicated and by default stored on three different nodes, which would correspond to a K of 2. HDFS provides no coordination between the replicas of each block: the nodes are chosen randomly (modulo rack awareness) for each individual block. By contrast, Vertica storing data on node i at K=2 would replicate that data on nodes i+1 and i+2 every time. If nodes 3, 6, and 27 fail, there is no chance that this brings down a Vertica cluster. What is the chance that it impacts HDFS? Well, it depends on how much data is stored – the typical block size is 64MB. The graph below presents the results of simulated block allocation on a 100 node cluster with replication factor of 3, computing the probability of a random 3-node failure making at least one block unavailable.
Assuming that you’re storing 50TB of data on your 100 node cluster, the fault tolerance of HDFS should be the same as a basic K=2 Vertica cluster – namely, if any 3 nodes fail, some block is highly likely to be unavailable. Data K-Safety with K=1 provides better fault tolerance in this situation. And here’s the real kicker: at K=1, we can fit 50% more data on the cluster due to less replication!
This comparison is worth a couple extra comments. First, HDFS does not become unavailable if you lose a single block – unless it’s the block your application really needs to run. Second, nodes experience correlated failures, which is why HDFS is careful to place replicas on different racks. We’ve been working on making Vertica rack-aware and have seen good progress. Third, the model assumes the mean-time-to-repair (MTTR) is short relative to the mean-time-to-failure (MTTF). In case of a non-transient failure, HDFS re-replicates the blocks of the failed node to any node that has space. Since Vertica aggressively co-locates data for increased query performance, it uses a more significant rebalance operation to carefully redistribute the failed node’s data to the other nodes. In practice, the recovery or rebalance operation is timely relative to the MTTF.
In conclusion, Vertica uses a combination of effective implementation and careful data placement to provide a consistent and fault tolerant distributed database system. We demonstrate that our design choices yield a system which is both highly fault tolerant and very resource efficient.
- The CAP theorem was proved by Lynch in 2002 in the context of stateful distributed systems on an asynchronous network.