TiDB Auto Scaling: Why Distributed SQL for Cloud-Native Apps

The ability to scale quickly and efficiently in response to varying workloads is a pivotal feature for any database system. Auto scaling is a capability that allows databases to adjust their computational resources automatically. Benefits include improved performance during sudden workload surges and cost-effectiveness during periods of lower demand.

Developed by PingCAP, TiDB is an advanced, open-source, distributed SQL database that provides horizontal scalability, strong consistency, and high availability. Its cloud-native architecture separates compute from storage for efficient and seamless auto scaling, regardless of cloud platform.

In this post, we’ll explore the architecture and operational details of TiDB’s auto-scaling capabilities, demonstrated through a real-world setup on Amazon Web Services (AWS).

Exploring the auto-scaling architecture

TiDB embraces a cloud-native architecture with automatic horizontal scaling on different cloud platforms. For this demo, we built TiDB’s auto-scaling solution using various AWS services, including Auto Scaling Group, CloudWatch, EventBridge, Lambda, Simple Queue Service, and Network Load Balancer.

Figure 1: Auto-scaling architecture

Here are the key components of our demo:

Auto Scaling Group: Each TiDB component is encapsulated within its own EC2 Auto Scaling Group. This allows for independent configuration adjustments per component, including EC2 instance type and EBS family.
Resource monitoring and management: AWS CloudWatch monitors resource usage in real-time while the EC2 Auto Scaling Group executes scaling policies for EC2 instances. This combination ensures automatic and responsive scaling adjustments.
Custom event listener: A custom event listener program that interacts with TiUP, the component management utility of TiDB, to facilitate automatic management and scaling of different components within the TiDB cluster.
Front-end network load balancer: To enhance client interactions, a network load balancer mitigates the impact of the SQL layer’s scaling operations on the clients. This keeps TiKV server’s horizontal scaling operations isolated from the client, enabling uninterrupted client application functionality during auto-scaling processes.

TiDB auto scaling in action

Now that we’ve outlined the architecture, let’s dive into a real-time demo of TiDB’s auto scaling capability.

Initial cluster setup and demo design

We begin our demonstration with a cluster consisting of three TiKV nodes (storage layer) and two TiDB nodes (stateless SQL layer).

We then initiate a client program that continuously inserts data into our TiDB cluster. At the same time, we keep a terminal window dedicated to querying the data inserted into TiDB, allowing us to observe the system behavior in real-time.

Figure 2. Initial cluster setup

To trigger auto scaling in the TiDB Auto Scaling Group, we employ a straightforward metric – average CPU utilization. We aim to maintain the average CPU utilization around 30 percent. If it crosses this threshold for a sustained period, the scaling out process will be triggered to initiate the addition of EC2 instances. Conversely, if it falls below this threshold, then the number of EC2 instances drops accordingly. This is known as scaling in.

Note: While we use CPU utilization as the auto-scaling trigger, you can use any appropriate metrics as long as they can be collected and aggregated in the actual environment.

To ensure the stability of the client application, we adopt an aggressive strategy for scale-out, and a conservative one for scale-in.

Auto scaling the SQL layer

To begin, in the EC2 console, we trigger an auto-scaling event by running a script that hogs the CPU of the existing TiDB servers in the SQL layer.

ec2-user@ip-10-90-4-224 scripts]$ ./sd-002-csp-demo-hog-db.sh
tiup is checking updates for component cluster
Starting component
'cluster' : /home/ec2-user/.tip/components/cluster/v1.12.1/tiup-cluster display tidb-demo
hog 10.90.1.70
hog 10.90.2.73

The auto-scaling process initiates when the CPU utilization exceeds 30 percent and sustains for a specified duration. During this time, the event listener program captures the event and utilizes the TiUP utility to prepare a new EC2 instance, subsequently adding it to the SQL layer in the TiDB cluster.

With the auto-scaling process in motion, we can observe the changes in real-time. The Auto Scaling Group adds new EC2 instances to alleviate the CPU pressure on the existing nodes. As a result, the number of TiDB servers increases from two to four, effectively scaling out the SQL layer.

Figure 3. Auto scale-out of TiDB

It is important to note that the automatic scalability of TiDB is intelligently designed to avoid disrupting client applications. The network load balancer plays a crucial role in mitigating interference between the SQL layer and client operations, ensuring a seamless experience.

From the load balancer’s perspective, it may take some time for the newly added TiDB servers to pass the health check. However, once all servers are up and running, they are seamlessly integrated into the cluster, providing increased capacity and performance.

Figure 4. TiDB auto-scale status

Auto scale-in of TiDB and scale-out of TiKV

Now, let’s shift our focus to the TiKV layer. It is essential to understand that the horizontal scalability of the SQL layer and the TiKV layer are decoupled from each other. This decoupled design enables independent scalability based on specific requirements.

To demonstrate the independent scalability of the TiKV layer, we release the CPU pressure on the TiDB servers before we hog the CPU utilization of TiKV. However, due to the conservative scale-in strategy we set initially, it will take a few minutes before the scale-in occurs.

[root@ip-10-90-1-70 ~]# kill -9 29629
[root@ip-10-90-1-70 ~]# kill -9 29561
[ec2-user@ip-10-90-4-224 scripts]$ ./sd-002-csp-demo-hog-kv.sh
tiup is checking updates for component cluster ...
Starting component 'cluster': /home/ec2-user/.tiup/components/cluster/v1.12.1/tiup-cluster display tidb-demo hog 10.90.1.146
hog 10.90.2.51
hog 10.90.3.222

As the auto-scaling strategy for the TiKV layer is also set at a threshold of 30 percent, the combined changes result in an automatic scale-in for the SQL layer and a scale-out for the TiKV layer.

During this process, the event listener program detects the auto scale-in event for the SQL layer and removes a TiDB server from the cluster. The TiKV layer scales out by adding additional nodes to meet the increasing workload demands. As we can observe, the number of TiKV nodes has now increased to five.

Figure 5. TiKV scale-out

Throughout this demo, our sample workload representing real-world business operations continues to run smoothly without any noticeable impact. TiDB’s auto-scaling capabilities ensure optimal resource utilization while keeping the system stable and efficient.

Auto scale-in of TiKV

Similar to the scale-in of the SQL layer, we first stop the CPU from consuming in TiKV. Following a conservative approach, we then allow a cool down period to ensure stability. The event listener program identifies the reduced CPU utilization and triggers the removal of excess resources. In the end, TiKV nodes scale in from 5 to 3.

Figure 6. TiKV auto scale-in

Summary of the auto-scaling demo

Component	Initial nodes	Scale-out	Scale-in
SQL Layer (TiDB)	2	4	2
Storage Layer (TiKV)	3	5	3

During periods of high demand, the SQL layer (TiDB) seamlessly scaled from 2 to 4 nodes, accommodating the increased workload. Once the surge subsided, the system automatically scaled back to its original configuration, maintaining efficiency and cost-effectiveness. Similarly, when the storage layer (TiKV) experienced overload, the number of nodes expanded from 3 to 5, and later reduced to 3 as the workload normalized.

Throughout this demo, there was no negative impact on the running workload. This demonstrates TiDB’s ability to handle dynamic scaling without disrupting business operations.

Conclusion

With TiDB’s auto-scaling solution, we observed how the system intelligently adapts to workload fluctuations, ensuring optimal performance and resource utilization.

Compared to traditional databases like MySQL, TiDB’s auto-scaling capabilities streamline the scaling process. It also eliminates the need for manual configuration changes or complex sharding techniques. Similarly, while managed database services like Amazon Aurora offer auto-scaling features, TiDB’s decoupled scaling sets it apart, providing granular control over SQL and storage layers.

Want to reproduce this demo? You can find all the related scripts, the AWS Cloud Formation template, and a step-by-step guide on GitHub.

If you’d like to dive deeper into auto scaling or any of the concepts mentioned in this post, sign up for our free, on-demand course, Introduction to TiDB.

Book a Demo