Some Notes on the DynamoDB 2022 Paper

It’s been a long time since DynamoDB published a paper. A few days ago, I read its newly published one “Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service”^[1], and I think it’s one of the most practical papers in recent years for large-scale distributed systems from an engineering perspective. It doesn’t even explain much about the architecture because it doesn’t need to. Shared-nothing systems look similar, and the authors know exactly who the readers of this paper are ^_^. What’s more, the paper is written simply, and there’s no math!

After all, DynamoDB doesn’t need to try to “prove” anything. It has fully proven itself in the past 10 years, both in terms of scale and stability. I found that the paper is relatively new, and not too many blogs talk about it, so why don’t I write something about it?

Predictable Performance > High Performance

This is the first point I put in this post, and it is also the point that touches me most deeply when we recently built TiDB Cloud. I always believe that stable slow is better than unstable fast. 99 latency is more reflective of system design skills than average latency.

I don’t know if this is intentional, but it is also presented in the first section of the DynamoDB 2022 paper, so you can see the importance.

DynamoDB wants to provide predictable performance. The first step is to abstract the workload, which introduces the concept of read capacity unit (RCU) and write capacity unit (WCU). In fact, RCU and WCU are very close to queries per second (QPS) in the traditional sense: only the size of the target item is added, so that you can do relatively accurate workload planning. For example, 1 WCU = 1 KB item’s 1 QPS. When the user can describe the workload in terms of RCU and WCU, the first step to predictability is complete. DynamoDB’s scheduler can do a lot of things like pre-partitioning and pre-allocating resources, because the hardware capabilities for different models are simply abstracted into a combination of WCUs and RCUs.

Once you know the quota of each partition, it is probably a backpack problem for the scheduling. DynamoDB considers the sum of the quotas of the partitions of the same machine. The sum should be less than the total RCUs or WCUs that the machine can provide when allocating the quota of partitions, which is about 20%-30% from the example given in the paper. In the real world, inexperienced developers or system administrators will usually squeeze the last bit of CPU and I/O out of the machine to pursue the “ultimate” performance. They must see 100% CPU usage before they are satisfied. However, in such cases the machine is already in a very unhealthy state. The long tail latency of requests will become very high, even though there may be an increase in throughput, but because of this unstable long tail, the observed performance from the user’s perspective will be “unpredictable.” In a production environment, we recommend over-provisioning about 30% of the machines for the same reason.

Burst is a simple idea. When allocating a quota, DynamoDB reserves some capacity for each partition. When the traffic spikes in the short term, it uses the reserved capacity. Adaptive capacity dynamically adjusts the quotas of different partitions after the user’s workload is skewed (but the total amount cannot exceed the total quota of the table).

It is important to note that burst and adaptive capacity are based on an assumption that the user’s workload does not change much, and also that the flow control is focused on the partition level (almost at the storage node level), that is, local scheduling.

In a large-scale storage system, flow control is actually a global problem. Using TiDB as an example, TiDB’s SQL layer is stateless. The requests are forwarded to TiKV, the storage server of TiDB. The TiDB SQL layer is a kind of “request router” (using the term from the paper), but if multiple TiDB SQL nodes are deployed, flow control is only done on TiKV (the storage layer). In extreme cases, TiDB SQL nodes might still keep hitting the overloaded TiKV node with requests. To solve this problem, you need to do flow control in the TiDB layer, return errors directly to the client on the TiDB layer, and not penetrate the overloaded TiKV nodes.

This part in DynamoDB’s paper is a bit vague. My understanding of the strategy is that the request router periodically applies a request quota to global admission control (GAC). GAC maintains a global allocation policy and, if a partition has been overloaded, the corresponding request router can directly deny service to customers to protect the rest of the system. The flow control at the partition level is also kept as the last line of defense at the node level.

For a shared-nothing storage system, the key to scale-out is the sharding strategy. DynamoDB chooses a dynamic sharding strategy, just like TiKV. It also uses range split, but the difference is that TiKV’s Region (similar to DynamoDB’s partition concept) uses size as the default splitting threshold. TiDB 6.0 introduced the Load Based Split strategy, and DynamoDB directly adopts Load-based Split. Partitions have a default throughput threshold, which is split when it exceeds this value. Once you start monitoring the state of the load distribution over the key range, it is easy to get the optimal splitting point (which is not always the midpoint).

Another important splitting concept is when to avoid it. The paper mentions:

Single row hotspot, which is well understood.
The access pattern is the sequential order of keys (similar to iteration on Key). In this case, DynamoDB avoids splitting.

To summarize, the core of DynamoDB’s predictable performance:

Uses a more accurate abstraction of workloads (WCU and RCU), instead of simple transactions per second (TPS), QPS, and data size.
Pre-allocates quotas to partitions and strictly limits the flow.
Leaves a margin on the node level as a temporary resource pool for unexpected scenarios (Burst).
Uses global information for traffic scheduling and does flow control at all levels.

Failover with as little observable impact as possible

Let’s first talk about write-ahead logging (WAL). DynamoDB, like TiKV, replicates logs to multiple replicas (the paper implies that the default is 3) via a distributed consensus algorithm (Multi-Paxos for DynamoDB and Raft for TiKV). However, DynamoDB also synchronizes WALs (it’s not a DB snapshot) to S3 periodically for higher durability. I understand that this is not only for higher durability, but also for point-in-time recovery (PITR).

Another interesting detail about failover is that when a node in DynamoDB’s replication group fails, for example, if one of the 3 copies fails, the group leader immediately adds a log replica to the replication group. The log replica is actually what we often call a witness. I will also use witness below instead of log replica. This is a node that only stores logs and does not provide services. In fact, this is a very smart approach, since in the above case, although it also meets the majority, the system is very fragile at this time. To completely restore a new member, the time is usually longer (first copy DB snapshot and then apply the recent logs), especially if the partition snapshot is relatively large. The process of copying the snapshot may introduce extra pressure to existing peers.

Adding a witness is low cost. The time mentioned in the paper is in seconds. It can at least ensure the security of the logs during data recovery. What’s more, for cross Availability Zone (AZ) deployment scenarios, this optimization can also reduce the write latency in the failover phase. For example, we have two AZs, one of which is the primary. We call it AZ-A, and it carries the main write traffic. Another AZ, AZ-B, is a standby for disaster recovery (or serves some local read traffic). When a node in AZ-A hangs, the Raft group of the data in this node does not stop writing (after leader reelection). However, according to the classical Raft, to meet the requirements of the majority, it must ask the peers in AZ-B and make sure the log persists successfully. It then could return the success to the client, which means the performance jitter is observed from the client (until a new peer in AZ-A is added). When we detect a node failure, immediately find a healthy node in AZ-A as a witness, and add it to this unhealthy group, the write to AZ-A can still achieve majority, which saves the latency of synchronizing log to AZ-B’s replica. From the client’s point of view, the system does not show significant jitter.

How DynamoDB deals with failover

To reduce observable impact during failover, DynamoDB also improves Leader election for replication groups. For large-scale systems, network jitter or network partitioning are common. For example, let’s say there’s a peer in a replication group called peer X. Once X cannot connect to the group leader, according to the traditional election protocol, it rashly initiates a new election with a bigger voting term. Other peers stop voting for it. Observed from the user side, this part of the data is unavailable (service will be stopped during the election for data consistency), but the old leader might still be alive.

This is a common problem, but the solution is very straightforward. TiKV 2.1 introduced an optimization called Prevote. When a peer wants to initiate an election, it first asks other peers whether they will vote for it, which essentially confirms whether the older leader is alive or available. If not, then the peer can launch a new election. The DynamoDB paper mentions a similar mechanism. Before a peer initiates a new election, it asks other peers whether they also think the old leader is disconnected, and if not, it means that it is the candidate’s own problem, and it does not affect the normal nodes.

DynamoDB ‘s leader election process

The worst failure of a large system is a cascading failure. The DynamoDB failure in 2015^[2]is a typical one. The improvements mentioned in this paper for the metadata service remind me of this case. (I guess it was probably because of this case, those improvements were made.) The solution is very intelligent, so I’ll paraphrase it a little.

DynamoDB observed that one of the root causes of cascading failures is traffic mutation in a short time. One of the common factors that cause traffic mutation is cache failure. Although we mostly think that the higher the cache hit rate, the better (the paper mentions that the cache hit rate of the partition router table is about 99.75%), such a high cache hit rate means that when there is a cache failure (or a cache warm-up phase when a large number of new nodes join), the metadata service must be able to carry 400 times the traffic surge (worst case, 0.25% → 100%). DynamoDB solves this problem in following ways:

DynamoDB adds a level of distributed memory cache MemDS in the middle of the request router and the metadata service. After the local cache of the request router is missed, it does not access the meta service directly but accesses MemDS first. Then MemDS accesses the metadata service in the background to fill in the data. By adding a layer of cache for peak-shaving, it is equivalent to adding another layer of insurance, which is a common way.
The second way is very smart. I just mentioned that the request router gets metadata through MemDS when the request did not hit the cache in MemDS. This is easy to understand. But what is really smart is that even if the cache hits, MemDS will also asynchronously access metadata service. The reasons are:

It ensures that the existing cache in MemDS is updated as soon as possible.
It brings “stable” traffic to the metadata service (although it may be larger).

The ‘stable’ but larger traffic is, for example, the equivalent of playing in the water so that you can have good confidence when a flood comes. 🙂

3. GAC based on limited token (capacity) reduces the impact of cascading failures.

In addition, for services on the cloud, one big advantage is that updates are released faster than traditional enterprise software. However, the deployment of a new release is usually the most vulnerable time of the system, and DynamoDB, as a large-scale system, is unlikely to do offline updates. For rolling updates, the old and new versions of nodes will coexist during the update process. Therefore, the new version needs to communicate with the nodes running in the old version and then switch to the new protocol after all nodes have deployed the new version.

DynamoDB’s work on stability and failover can be summarized in one phrase: minimizing the observable client-side impact, which I think is part of the “predictable” performance.

Database ≠ Database as a Service

I would say the Dynamo described in the paper a decade ago is more like a DB (laugh). This DynamoDB is actually a real Database as a Service (DBaaS). You may wonder what the difference is. I think building a DBaaS is not simply deploying multiple database instances to the cloud and hosting them. From the user’s point of view, the endpoint provided by a DBaaS may act like a database instance, but under the hood, the implementation may not be so straightforward. Take an extreme example: let’s say a DBaaS provides SQLite as a service. I think it is unlikely that it will really create a new container for every user, provision the environment, and then start an SQLite process to expose it publicly. It is likely to be a shared service and just behave the same as SQLite externally to better utilize the resources.

So to build a DBaaS, the first thing to consider is multi-tenancy. The reason DynamoDB has to be redesigned is that the old Dynamo does not support multi-tenancy. This is unacceptable for cloud services. When TiDB was transitioning to the cloud, we learned that cloud-native is not simply moving a database to the cloud and deploying it for different users. It requires a deep transformation from the kernel to the control platform for the capabilities and environment provided by the cloud.

Another significant difference between DBaaS and a database is that DBaaS is often difficult to deploy locally. In fact, modern DBaaS is built with many microservices or heavily dependent on features provided by cloud vendors (especially storage and security-related services). You can also see in the DynamoDB paper that the request router is a service, in charge of connections from different tenants; GAC is a service; the authentication system is a service; metadata is a service; and storage is a service. This is not to mention the dependencies on other services like S3/IAM and so on. It’s interesting that the paper doesn’t mention any EC2 or EBS, which makes me guess that DynamoDB’s hardware infra is probably maintained by itself; that is, it runs on a bare-metal machine.

Takeaways

For TiDB, the problem is a little bit more complex than DynamoDB. After all, TiDB Cloud provides SQL service. For example, if a user types a SELECT * FROM table, it’s challenging to calculate the RCU (especially since TiDB pushes down computing). However, this calculation is not impossible; maybe I can write about this topic in the future. TiDB Cloud has recently separated the control platform as an independent service, and the session management service (similar to DynamoDB’s request router) has been recently split from TiDB’s kernel codebase. So, the DynamoDB paper reinforces our judgment about the importance of the path of unbundling and microservices transformation for building a cloud database.

Finally, here are my takeaways from this paper:

The more you understand the abstraction of workloads, the better it is to build predictable systems. The more granular the measurement of workloads, the more room you have to make money (or save costs).
Consider multi-tenancy for your system from the beginning, and from a global perspective.
On the cloud, the access layer is very important. There are many things we can do in the access layer to improve the predictability of the system, including flow control, high availability, tenant isolation, and non-stop updates.
For abstract scheduling, do flow control at different layers.
Use microservices to build a platform that achieves multitenancy.
The use of cloud infrastructure will save a lot of work, such as S3

After reading this paper, I feel that there are many things that it didn’t address, including serverless architectures, GC strategy, and distributed transactions (ACID), but these don’t stop this paper from being a classic. I learned a lot from this paper. If any of the readers of this blog are on the DynamoDB team, please let me know, I’d be happy to buy you a beer. (If you’re in the Bay Area. 😉)

Keep reading:
Real-World HTAP: A Look at TiDB and SingleStore and Their Architectures
Speeding up a SaaS Platform by 5x in a Day without Changing the Code
Build a Better Github Insight Tool in a Week? A True Story

References:

Book a Demo