Scaling Millions of Tables: Atlassian's Multi-Tenant Journey to TiDB

Atlassian, a global leader in project management and team collaboration, offers widely adopted products like Jira, Confluence, and Forge. As Atlassian’s SaaS platforms scale, so too do their data infrastructure needs, such as scaling millions of database tables efficiently and cost-effectively. That’s where TiDB comes in.

This post explores how TiDB empowers Atlassian’s Forge platform to support a massive, multi-tenant architecture with over 3 million tables—all within a single cluster. We’ll dive into the technical challenges Atlassian faced, the innovations TiDB introduced, and the performance breakthroughs that resulted.

The Challenge: Scaling Millions of Tables

Atlassian employs a one-schema-per-tenant (database = schema) model for its SaaS offerings. This means each customer gets their own schema with dozens to hundreds of tables. As the customer base expands, the number of tables can balloon into the tens of millions, something traditional single-node databases struggle to handle. Some applications also grow 30–40% annually, requiring support for 1.5 to 30 million tables per app.

Scaling millions of tables: A diagram showing a typical representation of a one-schema-per-tenant (database = schema) model for SaaS. — *Figure 1. A diagram showing a typical representation of a one-schema-per-tenant (database = schema) model for SaaS.*

Atlassian faced the following core issues during business expansion:

Skyrocketing costs: Scaling millions of tables using traditional single-node databases meant deploying thousands of instances. This was both expensive and inefficient.
Operational complexity: Managing this sprawl introduced difficulties in configuration, scaling, backup, and upgrades. This made it hard to maintain SLAs and ensure uptime.

To overcome these issues, Atlassian turned to TiDB, beginning with its Forge platform.

Putting TiDB to the Test: Scaling Millions of Tables in a Single Cluster

Atlassian Forge became the first application to implement TiDB at scale. The goal: Host three million tables in a single TiDB cluster — no small feat.

The following sections will outline the major challenges and optimization practices across several key areas.

DDL Performance: Turning Days Into Minutes

Schema changes were painfully slow—creating 1 million tables took two days, and altering 100,000 tables took over six hours.

Root causes and optimizations included:

Inefficient DDL Task scheduling: DDL tasks processed sequentially, incurring unnecessary scheduling overhead.
Slow database/table existence checks: Schema validation sometimes relied on slower fallback mechanisms.
Underutilized computing resources: TiDB nodes were not fully leveraged for concurrent execution.
Inefficient broadcasting mechanisms: Schema changes propagated across nodes inefficiently, causing delays.
Refactor DDL framework: Unlimited scalability.

Atlassian enjoyed the following performance gains after implementing the optimizations:

Comparison Item	Before Optimization	After Optimization
Creating 100,000 Tables	3h49m	4m (50X faster)
Creating 3 Million Tables	More than 6 days	4h30m (50X faster)
Creating 100,000 Databases	8h27m	15m (32X faster)
Adding Columns to 100,000 Tables	6h11m	32m (11X faster)
Adding a Single-Column Index to a Table with 10,000 Rows in a 3 Million Table Scenario	20m	3s (400X faster)

For more details on DDL performance, please refer to our latest blog on the topic.

DML Efficiency: Faster Reads and Writes

Inefficient metadata caching forced DML statements to repeatedly fetch database and table info during execution. As the number of databases and tables grew, performance dipped, reducing queries per second (QPS) and increasing latency.

Additionally, large table counts led to many empty regions during import or backup recovery. Merging these Regions consumed resources and impacted online QPS.

Root causes included:

The time complexity for tidb-server to look up metadata for databases and tables in the metadata cache was O(n). This led to longer lookup times as the number of databases and tables increased.
The current logic for merging empty Regions followed the TiKV Ingest process. This required significant resources when merging a large number of empty Regions.

As a result, the following optimizations were made:

Optimize the metadata cache storage structure to reduce lookup time complexity to O(1). This improved metadata query efficiency during DML operations in scenarios when scaling millions of tables.
Directly write KV data for empty Region merges instead of using the Ingest process. This improved the efficiency of empty Region merges.

Atlassian saw significant performance gains after the optimizations. These included:

Comparison Item	Before Optimization	After Optimization
QPS	5,000	40,000
P99 Latency	62.5ms	3.91ms
Efficiency of Merging 2.5 Million Regions	20h	2h

Metadata Management: Keeping Memory in Check

TiDB originally loaded all schema metadata into memory at startup, leading to long boot times and memory exhaustion. When scaling millions of tables, this can cause high memory usage and potential OOM issues.

Root causes included:

TiDB stored metadata in TiKV, its row storage layer, as “Meta Maps”, while each TiDB instance maintained an in-memory Infoschema Cache. DDL operations updated metadata in TiKV and pushed a new schema version to PD. Other TiDB nodes fetched this version and applied the corresponding Schema Diff from TiKV to update their cache.

*Figure 2. How TiDB stored metadata in TiKV as “Meta Maps”, while each TiDB instance maintained an in-memory Infoschema Cache.*

DML, statistics, FK, TTL, and TiFlash used Infoschema Cache. Many components called ListTables to load all metadata, though only a few tables were accessed. In large-scale scenarios, this full loading wasted memory and increased the risk of OOM.

This led to the following optimizations:

Introduced the tidb_schema_cache_size parameter to limit memory usage, only caching metadata for accessed objects. This parameter uses LRU to evict unused metadata.
Avoid loading all Schema metadata at once, reducing TiDB startup time and OOM risk.

As a result of these optimizations, in a scenario with 1 million databases and 3 million tables, Atlassian was able to:

Simulate read/write load on 300,000 active tables (1:10 active tenant ratio) without prepared statements.
With the default tidb_schema_cache_size (512 MiB), cache hit rate now reached 99.5%, P99 latency was 6.3 ms, and memory usage was significantly reduced without sacrificing performance.

Query Optimizer: Smarter, Leaner, Faster

With millions of tables, initializing stats took over 10 minutes, delaying per-table collection. Even worse, stats and plan caches consumed significant memory, increasing out of memory (OOM) risk.

Root causes included:

Low concurrency collection: Versions before TiDB 8.4 only supported a single goroutine for automatic statistical information collection, resulting in low throughput.
Time-consuming priority queue construction: Prior to TiDB 8.5, each stats task required building a new priority queue, often requiring large metadata loads from TiKV—adding significant latency.
Excessive statistical information loading: SQL execution triggered loading of full stats for all table columns and indexes, driving up memory usage.

These root causes led to the following optimizations:

Increase concurrency:TiDB 8.4 introduced the tidb_auto_analyze_concurrency system variable, allowing users to increase concurrency to improve statistical information collection throughput.
Optimizing priority queue construction logic: Stats queues are now built once at startup and updated incrementally, cutting repeated metadata loads.
Reduce statistical information loading: Only relevant columns and indexes load during SQL execution. TiDB 8.5 also reduced the default stats cache memory quota from 50% to 20
Instance-level shared execution plan cache:Introduce the tidb_enable_instance_plan_cache system variable, allowing all sessions within a TiDB instance to share the execution plan cache, improving memory utilization.

As a result of these many optimizations, Atlassian experienced:

Collection efficiency improvement: Stats throughput improved 80x (from 20 to 1,600 tables/minute).
Reduced memory usage: After optimization, a simple SQL query involving the IndexRangeScan operator on 100,000 tables consumes approximately 4 GB of memory, significantly reducing memory usage.
Reduced resource consumption: In an idle cluster with 1 million tables, CPU consumption for automatic statistical information collection dropped from 230% to 130%, and memory consumption decreased from 2.4 GB to 0.

Placement Driver (PD): Handling Millions of Heartbeats

Millions of databases and tables generated a large number of Regions. These Regions periodically report heartbeats to TiDB’s placement driver (PD), creating a massive number of heartbeat requests that put pressure on PD services. This can cause the Region routing module to become unavailable.

Root causes included:

Region heartbeats include metadata (e.g., Region location, Peers information) and statistical information (e.g., access traffic). The processing logic for both came in mixed and synchronous. Updating metadata required holding a write lock, while reading routing information also required the same lock. This led to inefficient processing of a large number of heartbeats, severe lock contention, and potential unavailability of the Region routing module or even the entire PD service.

This led to the following optimization:

Decouple the storage structures for Region metadata and statistical information, splitting their processing logic into multiple asynchronous tasks to reduce lock holding time.

After this optimization, the P99 for Region heartbeat processing dropped from 100ms to 2ms, easily supporting 10 million Regions.

Faster Cluster Maintenance

In scenarios with 3 million tables, TiDB node restarts took 30 minutes. What’s more, in scenarios with millions of databases and tables, TiDB nodes could experience OOM. For example, an idle cluster with 500,000 tables.

Root causes included:

TiDB nodes cached metadata and statistical information for all databases and tables. During restarts, this data needed to be rebuilt and reloaded, which was time-consuming. For example, loading statistical information for 230,000 tables took 6 minutes.

As a result, the following optimizations were made:

Introduced the lite-init-stats variable to load only necessary statistical information during TiDB restarts, avoiding the loading of histograms, TopN, and Count-Min Sketch info. Detailed info loads asynchronously when needed.
New metadata management mechanism enables TiDB nodes to cache partial database/table metadata, loading only what’s necessary during restarts.
TiDB 8.1 added the concurrently-init-stats feature for concurrent initialization of statistical data, greatly improving startup speed.

Atlassian saw significant performance gains after implementing these optimizations, as illustrated in the table below:

Comparison Item	Before Optimization	After Optimization
TiDB Metadata Management	Creating a table with 700,000 tables loaded full of InfoSchema, used 64 GB, and caused OOM.	Creating a table no longer triggers full InfoSchema loading, ensuring stable resource usage and preventing TiDB OOM.
TiDB Restart Time	In a scenario with 3 million tables, TiDB restart time was approximately 30 minutes.	TiDB restart time was reduced from 30 minutes to 3 minutes, a 90% reduction in time.

Backup and Restore (BR): Efficiency at Scale

Backing up 1 million tables originally took 12 hours. BR full restore also generated millions of Regions, requiring additional resources from TiDB to maintain. Additionally, merging millions of empty regions took over 10 hours.

Root causes included:

During each round of backup, BR distributed backup tasks to TiKV based on table-level concurrency. Due to uneven task distribution, the time for each round of backup was limited by the slowest TiKV execution time, leaving other TiKV CPU resources idle and prolonging the overall backup time. This was particularly evident in scenarios with millions of tables.
During BR restore, a large number of Regions were pre-split based on tables and scattered across TiKV nodes. In many SaaS scenarios, a large number of tables were small, with data sizes much smaller than the TiDB Region size. In such cases, multiple tables can be merged into a single Region.

To best solve these issues, the following optimizations were implemented:

Send backup requests for all ranges to all TiKV nodes, fully utilizing TiKV backup threads to control backup speed. This significantly reduces long-tail latency and backup time.
During BR pre-split, merge small tables into a single Region based on table data size, avoiding the creation of a large number of Regions in scenarios with millions of tables.

After these optimizations, Atlassian enjoyed significant improvements, as shown in the table below:

Comparison Item	Before Optimization	After Optimization	Improvement
Time to Back Up 1 Million Tables	4 hours	1 hour	75% reduction in time
Time to Restore 1 Million Tables	12 hours	9 hours	25% reduction in time
Number of Empty Regions Generated During Backup/Restore	Several million	Fewer than 100	Over 99% reduction in empty Regions

Scaling Millions of Tables: The Business Value for Atlassian

Beyond technical performance, TiDB delivers substantial business value for Atlassian. By solving the scalability and operational bottlenecks of legacy systems, TiDB has enabled Atlassian to streamline its SaaS architecture, reduce costs, and improve service reliability. Here are the key benefits realized:

Significantly reduced costs: Atlassian has already adopted TiDB to support Forge. As more applications migrate to TiDB, the number of database instances can be reduced to 1/100 of the originally projected count, significantly lowering infrastructure and operational costs.
Simplified development and operations: A single TiDB cluster can support 3 million tables, allowing a medium-sized SaaS application to be supported by just one TiDB cluster without the need for complex sharding based on tenant IDs. Additionally, TiDB’s lock-free Online DDL feature eliminates the need for third-party tools or low-load windows to perform schema changes, further reducing development costs.
Rapid business growth with zero downtime: TiDB’s distributed architecture supports seamless horizontal scaling, easily handling high QPS and large data volumes. Its online scaling capabilities allow flexible responses to traffic spikes, while rolling restarts and fast backup recovery for millions of tables significantly improve SaaS service availability and SLAs.

Together, these advantages underscore how TiDB goes beyond being just a performant database and becomes a strategic enabler for Atlassian’s SaaS growth. By consolidating infrastructure, reducing complexity, and boosting reliability, TiDB empowers Atlassian to scale faster, innovate more freely, and deliver a better experience to its customers worldwide.

Conclusion

Atlassian’s journey with TiDB illustrates how rethinking data infrastructure can unlock unprecedented scale, performance, and efficiency for modern SaaS platforms. By addressing the technical challenges of scaling millions of tables, TiDB has transformed Forge into a truly scalable, multi-tenant powerhouse.

More than a performance boost, TiDB enables Atlassian to simplify its architecture, reduce operational burdens, and future-proof its platform as customer demands grow. With TiDB at the core, Atlassian is not only keeping pace with growth but leading it.

As data volumes and application complexity continue to rise across the SaaS industry, Atlassian’s success with TiDB sets a powerful example of what’s possible when innovation meets the right technology foundation.

Ready to scale your SaaS platform without compromise? Get in touch with our database experts to see how we can help you build for the future.

Webinar

Effective Multi-Tenancy: Scaling SaaS Over 1 Million Tables in a Single Cluster

Watch Now

Thought Leadership

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

Scaling 3 Million Tables: How TiDB Powers Atlassian Forge’s SaaS Platform

The Challenge: Scaling Millions of Tables

Putting TiDB to the Test: Scaling Millions of Tables in a Single Cluster

DDL Performance: Turning Days Into Minutes

DML Efficiency: Faster Reads and Writes

Metadata Management: Keeping Memory in Check

Query Optimizer: Smarter, Leaner, Faster

Placement Driver (PD): Handling Millions of Heartbeats

Faster Cluster Maintenance

Backup and Restore (BR): Efficiency at Scale

Scaling Millions of Tables: The Business Value for Atlassian

Conclusion

Webinar

Related Resources

Conway’s Law in Reverse: Why AI Agents Need One Database, Not Ten

Build Persistent, Scalable AI Agent Memory with TiDB

The Model Resets. The State Remains.

Conway’s Law in Reverse: Why AI Agents Need One Database, Not Ten

Build Persistent, Scalable AI Agent Memory with TiDB

The Model Resets. The State Remains.

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

TiDB Cloud Starter