Cluster Recovery: How TiDB Redefines Large-Scale Data Restores

Backup and restore are critical for ensuring business continuity, with the Recovery Time Objective (RTO) serving as a key metric for assessing restore performance. As TiDB continues to grow in popularity for its scalability, many users have datasets reaching hundreds of terabytes (TBs). That means the challenge of ensuring a fast RTO for such large clusters becomes increasingly complex. However, starting with TiDB 8.1 LTS, recovering clusters of this magnitude in under an hour has become a reality. This blog dives into the performance improvements, challenges overcome, and innovations that make TiDB 8.1 a leader in large-scale cluster recovery.

Revolutionizing Cluster Recovery Performance with TiDB 8.1

TiDB 8.1 takes full cluster restore performance to the next level by fully leveraging hardware capabilities. Each TiKV node consistently achieves disk throughput limits of ~1.2 GB/s, ensuring smooth, linear scalability when adding more nodes. This innovation removes common bottlenecks in disk and network throughput, allowing enterprises to handle even the largest datasets with confidence.

Cluster Recovery Performance at a Glance

Restore speeds are calculated as total data size divided by total restore time, encompassing all stages of the process.
During the critical download and ingestion phase, the system consistently hits hardware throughput limits of ~1.2 GB/s per TiKV node, showcasing maximum achievable performance with current hardware.
Starting with TiDB 8.1, full cluster restores exhibit exceptional scalability, both horizontally and vertically, with some cases achieving over 3–4x average improvement in restore throughput compared to TiDB 7.5.

Dataset	TiKV Nodes	TiDB 7.5 Performance	TiDB 8.1 Performance
110 TB	50	100 MB/s per TiKV (6.4h)	1.15 GB/s per TiKV (41m)
300 TB	90	274 MB/s per TiKV (3.5h)	1.02 GB/s per TiKV (56m)

Cluster recovery performance at a glance for TiDB.

Key Enhancements Driving These Results

Hardware Utilization: TiKV nodes now consistently achieve disk throughput limitations of ~1.2 GB/s, ensuring optimal scaling when adding more nodes.
Restore Phase Innovations: Efficient region splitting and enhanced ingestion workflows maximize hardware and network utilization.
Scalability Improvements: TiDB 8.1 overcomes common bottlenecks, ensuring that restore performance scales predictably even for massive datasets.
Future Optimization Potential: While small datasets per TiKV node can still experience underutilization, this points to opportunities for additional refinements.

These advancements highlight TiDB 8.1’s ability to deliver reliable scalability and predictable performance for even the most demanding workloads.

Breaking Barriers in Cluster Recovery Scalability

Before TiDB 8.1, the restore process followed a rigid pipeline approach, interweaving three key stages to maximize resource utilization:

Table Creation: Identifying SST files and creating tables was an excruciatingly slow process. Creating 300,000 tables could take up to 60 hours, stalling the overall restore timeline.
Region Splitting & Scattering: Each batch of newly created tables required split and scattered regions, often resulting in inefficiencies when these tasks focused on single nodes.
Data Download & Ingestion: Data files were downloaded and ingested into the respective regions, completing the pipeline for each batch.

This pipeline was originally necessary to keep resources busy while waiting for slower stages, such as table creation. However, after TiDB 7.6 introduced faster table creation (reducing the time for 300,000 tables to just 30 minutes), the rigid pipeline became a bottleneck. The pipeline’s single concurrency parameter limited the system in two key ways:

Low Concurrency (e.g., 128 or fewer): TiKV nodes were underutilized as data arrived too slowly, leading to inefficient CPU and network usage.
High Concurrency (e.g., 2048 or more): Region splitting and scattering couldn’t keep up, creating a single point of contention. This led to excessive time spent balancing regions rather than ingesting data, diminishing the expected performance gains from additional resources.

Innovations in TiDB 8.1

TiDB 8.1 introduces a phased restoration model that tackles these bottlenecks:

Rapid Table Creation: Optimizations reduced table creation time to just 30 minutes for 300,000 tables, eliminating this as a bottleneck.
Efficient Region Splitting and Scattering:
- Coarse Splitting: Selects a subset of boundary keys to split regions and scatters them across TiKV nodes for early workload balancing.
- Detailed Splitting: Further refines region distribution, ensuring even resource utilization and preventing hotspots.
Balanced Data Restoration:
- Token Bucket Mechanism: Dynamically controls gRPC request rates to prevent node overload and maintain steady performance.
- Independent Streams for Download and Ingestion: Decouples these operations, allowing better resource allocation and improved scalability.

These advancements enable TiDB to achieve linear scaling during restores, making it possible to restore a 100 TB cluster in just one hour.

Technical Deep Dive: Cluster Recovery Optimizations

To streamline region management, TiDB 8.1 introduces a two-step approach: coarse splitting and detailed splitting. Imagine managing a massive dataset divided into smaller chunks, each assigned to specific storage nodes. Coarse splitting acts as the initial pass, dividing the dataset into broad sections based on sorted keys and evenly distributing them across TiKV nodes. This step reduces preparation overhead and ensures no overloaded single nodes. Once the data is roughly balanced, detailed splitting refines this distribution by breaking down these broad sections into smaller, more precise chunks. This ensures fully-utilized resources and prevents hotspots that could slow down operations.

Coarse Splitting: Divides data into broad sections and evenly distributes them across TiKV nodes, reducing preparation overhead and balancing workloads.
Detailed Splitting: Refines the distribution by breaking broad sections into smaller, more precise chunks, ensuring fully-utilized resources and no hotspots.

Workload Balancing with Token Buckets

TiDB 8.1 manages workload distribution using a token bucket mechanism, a method commonly used in databases to regulate the flow of requests. Think of each node as a worker handling incoming tasks. Tokens act as permits, allowing tasks to proceed at a controlled pace. As each request is processed, a token is consumed. Tokens automatically replenish at regular intervals, ensuring that the system continues to operate smoothly. This approach prevents resource bottlenecks and ensures a steady and predictable throughput.

Dynamic Rate Control: Limits gRPC requests to prevent resource bottlenecks.
Automatic Replenishment: Tokens are refilled periodically, ensuring smooth processing without manual intervention.

Independent Request Streams

In TiDB 8.1, download and ingest operations are handled separately to optimize resource usage. Picture a system where data is first retrieved from storage (download) and then processed and stored in its final location (ingest). By separating these tasks, the system can maximize efficiency. Download streams evenly distribute data fetching across all nodes, ensuring that network resources are fully utilized. Meanwhile, ingest streams focus on processing this data within leader peers, avoiding overload on any specific node and ensuring smooth and efficient operations throughout the cluster.

Download Streams: Evenly distribute data fetching across nodes to maximize network utilization.
Ingest Streams: Handle data ingestion on leader peers, avoiding overload on specific nodes.

Best Practices for Large-Scale Restores

To maximize restore performance with TiDB 8.1, follow these best practices:

Allocate Sufficient vCPU Resources: Set import.num-threads to about 60% of TiKV vCPU cores to fully utilize compute capacity. For instance, allocate 10 threads for a 16-core TiKV instance.
Avoid Hardware Bottlenecks: Operate at 90% of hardware throughput to maintain stability and consistency.
Use Ample BR Memory: Allocate enough memory to minimize garbage collection during snapshot restores.
Start Fresh: Use a brand-new cluster instead of repurposed clusters with dropped databases to avoid performance issues during table ID rewriting.

Future Enhancements

TiDB’s roadmap includes exciting features to further improve restore performance:

PITR Optimization: A 3x throughput improvement for Point-in-Time Recovery (PITR), enabling rapid, event-based recovery.
Granular Restore Control: Applying cluster-level optimizations to table-level restores, allowing parallel tasks with better resource isolation and SLA adherence.

Conclusion

TiDB 8.1 redefines large-scale data restoration with innovations that deliver unparalleled performance and scalability. By addressing bottlenecks in table creation, region splitting, and data ingestion, TiDB enables enterprises to restore massive datasets in record time. With future enhancements on the horizon, TiDB continues to lead the way in database innovation, empowering businesses to meet the most demanding recovery objectives with confidence.

If you have any questions about TiDB’s cluster recovery capabilities, please feel free to connect with us on Twitter, LinkedIn, or through our Slack Channel.

Experience modern data infrastructure firsthand.

Start for Free

Engineering

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

Blazing-Fast Cluster Recovery: How TiDB 8.1 Redefines Large-Scale Data Restoration