Introduction

You run a multi-tenant SaaS platform. Most days are quiet. But then one tenant surges and everyone else slows down. Latency SLOs slip, dashboards go red, and your team scrambles with throttles and hot fixes. That’s the noisy-neighbor problem in the database tier: a single burst monopolizes CPU and IO, leaders pile onto a few stores, and well-behaved tenants inherit the wait.

There are two common architectures to tackle this:

  • Shared-Schema: all tenants share the same tables, and their data is isolated by a tenant_id column. This approach typically relies on row-level security (RLS). PostgreSQL is usually the first choice for this design.
  • Schema/Database-per-Tenant: each tenant gets its own schema or database. Schema and database scalability are critical for this architecture.

This playbook focuses on how to stop that spiral using the schema/database-per-tenant model with TiDB: how to detect floods quickly, contain impact without downtime, and harden the system so the next surge is a non-event.

Who’s This Playbook For: Storage Leads & SREs

Storage leads own the database platform (i.e., capacity planning, cost, replication/backup strategy, schema governance, and upgrade cadence) and must keep a multi-tenant MySQL estate efficient as it grows.

On the other hand, Site Reliability Engineers (SREs) own reliability and performance in production, including SLOs/error budgets, observability, incident response, runbooks, and change management.

Top pains include:

  • Latency SLO misses that appear “out of nowhere”
  • Over-provisioning just to survive peak bursts
  • On-call fatigue from throttle scripts and fire drills

TiDB’s multi-tenant controls mute noisy neighbors while cutting costs: enforceable per-tenant budgets, data placement by heat, and least-privilege access without rewrites.

What We’re Solving Today: The Noisy-Neighbor Spiral

When one tenant surges, shared database resources concentrate on a few hot leaders and stores. Queues form, p95 and p99 climb, and well-behaved neighbors miss their SLOs. Teams often react with emergency throttles or by carving out per-tenant databases, which removes contention but destroys pooled efficiency and inflates cost.

Tenant A Floods the Cluster — SLAs Melt

A burst of small writes or range scans saturates CPU and IO on a narrow set of regions and leaders. Followers then spend extra time catching up, and background compactions contend with foreground work. Neighbor queries now wait on the same thread pools and the same hot ranges.

Symptoms to Recognize

  • Sudden p95 and p99 jumps for unrelated tenants during one tenant’s event.
  • Spikes in CPU, write amplification, and scheduler wait while other tenants’ QPS stays flat.
  • Growing DDL or compaction backlogs that delay otherwise routine work.

Typical Pressure Paths

  • CPU saturation: Hot leaders spend cycles on transaction scheduling, Raft messaging, and compactions. Neighbor requests queue behind them.
  • IO starvation: Write bursts trigger flushes and compactions. Background tasks delay reads for other tenants.
  • Hotspot amplification: Skewed keys or time-aligned workloads push many requests to a small set of regions. Rebalancing lags the surge.

Hard Isolation = Silo Sprawl & $$$

Why Costs Explode

  • N× fixed overhead: Every tenant needs nodes for HA, backups, and observability. Three nodes per tenant yields 300 nodes at 100 tenants before scaling for load.
  • Low average utilization: Dedicated capacity sits idle. Unused CPU and IO cannot be shared.
  • Operational sprawl: Patching, upgrades, schema changes, and backups multiply with tenant count. Every automation must target many clusters.

Functional Trade-Offs

  • Cross-tenant features become harder. Shared catalogs, analytics, and global search require fan-out across many databases or a new data pipeline.
  • Capacity planning fragments. You size for individual peaks instead of pooling headroom. The result is more over-provisioning and slower onboarding.

Fire-Drill Ops & On-Call Burnout

What It Feels Like in Practice

  • SREs are forced into manual throttling, cgroup tweaks, proxy limit changes, and ad-hoc scripts.
  • Incidents trigger cross-team blame and executive pressure to “fix it now.”
  • The node count only moves in one direction. Burnout follows.

DIY Isolation (Without TiDB): More Metal, More Headaches

When teams face noisy neighbors without TiDB, they usually stitch together “DIY isolation.” It works, until it doesn’t: per-tenant silos, OS/proxy throttles, app-level fan-out for reporting, and expensive over-provisioning just to survive spikes. In this section, we’ll walk through each of these stopgaps so you can clearly see the trade-offs.

Carve Siloed Databases for Every Tenant

Spin up siloed MySQL clusters or Kubernetes namespaces per tenant. Isolation rises. Cost and toil rise faster.

# Example anti-pattern: per-tenant clusters that do not share headroom
kubectl create ns tenant-a
helm install mysql-tenant-a bitnami/mysql -n tenant-a

Hand-Tune OS, VM, and Proxy Limits

Hand-tune cgroups, proxies, and throttle scripts to chase traffic spikes.

# OS-level CPU cap (illustrative)
cgcreate -g cpu:/tenantA && echo 40000 > /sys/fs/cgroup/cpu/tenantA/cpu.cfs_quota_us

Rewrite App Logic for Cross-Tenant Joins

Cross-tenant reporting and shared catalogs become application code problems.

-- Fan-out reporting pattern that grows with tenant count
-- (multiple connections and unions per tenant database)

Over-Provision or Pray During Spikes

Provision for worst case. Watch utilization stay low off-peak and bills stay high.

TiDB’s Multi-Tenant Shield (With TiDB)

TiDB gives you isolation without silos. In this section, we’ll explore how three native controls turn shared hardware into predictable, per-tenant performance.

Resource Groups: CPU & IO Budgets per Tenant in Seconds

Resource Groups let TiDB enforce per-tenant compute budgets using Request Units (RUs). You define tiers (e.g., bronze/silver/gold), and the scheduler throttles or shapes each tenant’s CPU/IO so a burst from one tenant won’t degrade others. Budgets are soft-capped (with optional bursting) and can be changed instantly—no app changes required.

CREATE RESOURCE GROUP rg_bronze RU_PER_SEC=3000  BURSTABLE;
CREATE RESOURCE GROUP rg_silver RU_PER_SEC=8000  BURSTABLE;
CREATE RESOURCE GROUP rg_gold   RU_PER_SEC=20000 BURSTABLE;
CREATE USER 'tenantA' IDENTIFIED BY '***';
ALTER USER 'tenantA' RESOURCE GROUP 'rg_gold';

Placement Rules: Pin Hot Data, Keep Cold Cheap

Placement Rules control where data lives and replicates. You can pin hot partitions to low-latency nodes (NVMe/“fast” zones) and park cold/archival data on cost-optimized storage—while keeping replica counts and resilience policies intact. This improves tail latency and reduces cost without changing application logic.

CREATE PLACEMENT POLICY p_hot  CONSTRAINTS='+zone=fast'  FOLLOWERS=2;
CREATE PLACEMENT POLICY p_cold CONSTRAINTS='+zone=cheap' FOLLOWERS=2;
-- Apply at table or partition granularity
ALTER TABLE orders PARTITION p0 PLACEMENT POLICY = p_hot;   -- hot tenant
ALTER TABLE orders PARTITION p1 PLACEMENT POLICY = p_cold;  -- archive

Fine-Grained Role-Based Access Control (RBAC): Least Privilege, Zero App Changes

Role-Based Access Control in TiDB lets you enforce per-tenant and per-object permissions at the database layer without touching application code.

CREATE ROLE r_tenantA;
GRANT SELECT,INSERT,UPDATE,DELETE ON orders TO r_tenantA;
GRANT r_tenantA TO 'tenantA';
SET DEFAULT ROLE r_tenantA TO 'tenantA';

You create roles that encapsulate the minimum privileges needed (least privilege), grant those roles on specific tables/schemas, then assign the roles to tenant principals (users or service accounts). By setting a default role, every new session inherits the right permissions automatically, so existing connection strings, ORMs, and query paths continue to work.

Shared Hardware, Isolated Performance

Budgets cap noisy tenants. Placement keeps hot data close to fast nodes. RBAC limits blast radius. Result: shared hardware with isolated performance and no more screaming neighbors.

Implementation Flight Plan

The following step-by-step plan will allow you to assess hotspots and tenant QoS, configure Resource Groups and placement, and roll out, monitor, and tune in phased cohorts with clear validation gates.

Profile tenants. Identify burst shapes and skew.

-- Store CPU snapshot (highest first)
SELECT address, cpu_usage
FROM information_schema.tikv_store_status
ORDER BY cpu_usage DESC;

-- Leader distribution (higher = more leadership load)
SELECT store_id,
       SUM(CASE WHEN is_leader = 1 THEN 1 ELSE 0 END) AS leaders
FROM information_schema.tikv_region_peers
GROUP BY store_id
ORDER BY leaders DESC;

Set budgets and data placement via SQL. Label stores by zone or storage class.

-- Budgets
CREATE RESOURCE GROUP rg_silver RU_PER_SEC=8000 BURSTABLE;
ALTER USER 'tenantB' RESOURCE GROUP 'rg_silver';

-- Placement labels and policies
-- (Store labels are configured in cluster; policies used here)
CREATE PLACEMENT POLICY p_fast CONSTRAINTS='+zone=fast' FOLLOWERS=2;

Phase rollout by tenant cohorts. Validate SLOs. Tune limits and placement.

-- Validate normalization after changes
SELECT digest_text, SUM_latency/1000 AS ms
FROM information_schema.statements_summary
ORDER BY ms DESC LIMIT 10;

Step 1: Assess Hotspots & Tenant QoS

Profile tenants. Identify burst shapes and skew.

-- Store CPU snapshot (highest first)
SELECT address, cpu_usage
FROM information_schema.tikv_store_status
ORDER BY cpu_usage DESC;

-- Leader distribution (higher = more leadership load)
SELECT store_id,
       SUM(CASE WHEN is_leader = 1 THEN 1 ELSE 0 END) AS leaders
FROM information_schema.tikv_region_peers
GROUP BY store_id
ORDER BY leaders DESC;

Step 2: Configure Resource Groups & Placement Labels

Set budgets and data placement via SQL. Label stores by zone or storage class.

-- Budgets
CREATE RESOURCE GROUP rg_silver RU_PER_SEC=8000 BURSTABLE;
ALTER USER 'tenantB' RESOURCE GROUP 'rg_silver';

-- Placement labels and policies
-- (Store labels are configured in cluster; policies used here)
CREATE PLACEMENT POLICY p_fast CONSTRAINTS='+zone=fast' FOLLOWERS=2;

Step 3: Rollout, Monitor, Optimize

Phase rollout by tenant cohorts. Validate SLOs. Tune limits and placement.

-- Validate normalization after changes
SELECT digest_text, SUM_latency/1000 AS ms
FROM information_schema.statements_summary
ORDER BY ms DESC LIMIT 10;

Operational Best Practices

This section turns guardrails into daily habits. Think of these as the routines that keep noisy neighbors quiet long after the initial rollout.

TiDB Dashboard: Heatmaps, Top SQL, Tenant Lens

Use heatmaps to spot drift and Top SQL to identify offenders quickly.

-- Quick Top SQL view by total latency
SELECT digest_text, SUM_latency/1000 AS ms, exec_count
FROM information_schema.statements_summary
ORDER BY ms DESC LIMIT 20;

Backups with Resource Throttling

Throttle BR and restores so maintenance respects tenant budgets.

tiup br backup full \
  --pd "http://pd:2379" \
  --storage "s3://bucket/prefix" \
  --ratelimit 120

Capacity Forecasts & Cost Guardrails

Track RU per tier, leader skew, and compaction backlog. Right-size nodes and set guardrails for automatic scale-out.

-- Simple RU consumption by user (illustrative if exposed)
-- SELECT user, SUM(ru_consumed) FROM mysql.resource_usage GROUP BY user;

Proving Noisy-Neighbor Isolation (Tenants A & B)

We’ll now run two concurrent sysbench loads to show how Resource Groups (and optional Placement Policies) keep one tenant’s surge from degrading another.

1. Minimal Setup

-- Two tenant DBs and users
CREATE DATABASE tenantA; CREATE DATABASE tenantB;
CREATE USER 'tenantA' IDENTIFIED BY '***';
CREATE USER 'tenantB' IDENTIFIED BY '***';
GRANT ALL ON tenantA.* TO 'tenantA';
GRANT ALL ON tenantB.* TO 'tenantB';

-- Optional: isolate hot tenant to faster nodes; keep others on standard tier
CREATE PLACEMENT POLICY p_fast  CONSTRAINTS='+zone=fast'  FOLLOWERS=2;
CREATE PLACEMENT POLICY p_std   CONSTRAINTS='+zone=std'   FOLLOWERS=2;
ALTER DATABASE tenantA PLACEMENT POLICY = p_fast;   -- premium / hot
ALTER DATABASE tenantB PLACEMENT POLICY = p_std;    -- standard

2. Prepare Data with sysbench (Run Once Per Tenant)

# Common vars
HOST=127.0.0.1 PORT=4000 PSWD=***
TABLES=24 SIZE=10000

# Tenant A (hot, write-heavy)
sysbench oltp_write_only --mysql-host=$HOST --mysql-port=$PORT \
  --mysql-user=tenantA --mysql-password=$PSWD --tables=$TABLES --table-size=$SIZE prepare

# Tenant B (steady, read-mostly)
sysbench oltp_read_only --mysql-host=$HOST --mysql-port=$PORT \
  --mysql-user=tenantB --mysql-password=$PSWD --tables=$TABLES --table-size=$SIZE prepare

Tenant A Baseline: No Isolation (Expect Interference)

Run both for 5 minutes at the same time:

# Terminal 1 – Tenant A (write-heavy surge)
sysbench oltp_write_only --time=300 --threads=64 \
  --mysql-host=$HOST --mysql-port=$PORT \
  --mysql-user=tenantA --mysql-password=$PSWD run

# Terminal 2 – Tenant B (read-only, steady)
sysbench oltp_read_only --time=300 --threads=64 \
  --mysql-host=$HOST --mysql-port=$PORT \
  --mysql-user=tenantB --mysql-password=$PSWD run

Illustrative outcome (no Resource Groups):

Metric (5m avg) Tenant A (write) Tenant B (read)
Throughput (tps) 18,900 22,400 → 15,300 during A’s spikes
Avg latency (ms) 6.1 3.88.5 during A’s spikes
Top SQL share (%) 72% 28%

As you can see in the above table, Tenant A’s write burst monopolizes CPU/IO. Tenant B’s latency doubles during spikes, which is a classic noisy neighbor.

Tenant B Baseline: Enable isolation with Resource Groups (seconds)

-- Tiered RU budgets; allow short bursts
CREATE RESOURCE GROUP rg_premium RU_PER_SEC=20000 BURSTABLE;
CREATE RESOURCE GROUP rg_standard RU_PER_SEC=8000  BURSTABLE;

-- Attach tenants (no app changes)
ALTER USER 'tenantA' RESOURCE GROUP 'rg_premium';
ALTER USER 'tenantB' RESOURCE GROUP 'rg_standard';

Now re-run the same sysbench commands concurrently.

Illustrative outcome (with Resource Groups):

Metric (5m avg) Tenant A (write) Tenant B (read)
Throughput (tps) 17,600 (shaped) 22,10021,700 (stable)
Avg latency (ms) 6.8 (shaped) 3.94.2 (minor ripple)
Top SQL share (%) 65% 35%

As you can see, Tenant A still gets high throughput (within premium budget), but Tenant B’s latency stays stable. The budget prevents A from starving B.

What to Watch During the Run

In the TiDB Dashboard, keep an eye on:

  • Heatmaps: look for flat tail latency under load after enforcing budgets.
  • Top SQL: confirm heavy digests are attributed to Tenant A during its surge.
  • KV/Regions: ensure leader distribution isn’t skewed; rebalancing stable.

You can also run a couple quick SQL checks:

-- Tenant-scoped view (replace with your statement fingerprint or user filter)
SELECT user, digest_text,
       SUM(exec_count) AS execs,
       ROUND(SUM(sum_latency)/1000) AS total_ms
FROM information_schema.statements_summary
GROUP BY user, digest_text
ORDER BY total_ms DESC LIMIT 10;

-- Leader distribution sanity check
SELECT store_id,
       SUM(CASE WHEN is_leader=1 THEN 1 ELSE 0 END) AS leaders
FROM information_schema.tikv_region_peers
GROUP BY store_id
ORDER BY leaders DESC;

Backups and Maintenance that Respect Budgets

Keep “background” work quiet so it doesn’t reintroduce noisy-neighbor effects:

# Throttled BR full backup (respects rate limit)
tiup br backup full \
  --pd "http://pd:2379" \
  --storage "s3://bucket/prefix" \
  --ratelimit 120   # MB/s – tune per tier

Capacity Forecasts & Guardrails

Track consumption and set caps before tenants get noisy.

-- (Illustrative) RU by user if exposed in your env
-- SELECT user, SUM(ru_consumed) AS ru_total
-- FROM mysql.resource_usage
-- GROUP BY user ORDER BY ru_total DESC;

Policy tips:

  • Premium tenants → higher RU, fast tier placement.
  • Standard tenants → moderate RU, standard tier.
  • When a tenant upgrades/downgrades, it’s a SQL change, not a migration.

With Resource Groups (and optional Placement Policies), Tenant B’s experience remains consistent, even while Tenant A surges. That’s the day-two habit that keeps noisy neighbors quiet long after rollout.

Why TiDB Handles Noisy Neighbors Well

Noisy-neighbor incidents happen when one tenant’s burst steals CPU and IO from everyone else. TiDB gives you performance isolation on shared hardware, so you keep multi-tenant efficiency without sacrificing SLAs or blowing up costs.

  • Performance isolation without silos. Use Resource Groups (RU budgets) to give each tenant a guaranteed compute and IO envelope, plus optional bursting. One tenant’s surge is shaped before it harms others.
  • Lower cost through shared hardware and visible spend. Decoupled compute and object storage let you scale up and down by seconds, not migrations. RU metering shows per-query and per-tenant cost so you can cap noisy workloads and reduce over-provisioning.
  • Operational simplicity. All controls are SQL-first: create or change budgets, move tenants between tiers, and throttle background jobs without app changes. Online DDL and rolling upgrades keep change inside SLOs, so you do not need maintenance windows.
  • Hot where it matters, cost-efficient where it doesn’t. Placement Policies pin hot partitions to fast nodes and park cold data on cost-efficient storage while preserving replication and availability.
  • One platform for mixed workloads. OLTP and analytics run on the same data with TiKV (row storage) and TiFlash (columnar storage). Heavy reads move off the transactional hot path, which protects tail latency during tenant spikes.
  • Clean multi-tenant patterns. Schema or database per tenant scales to very large counts, keeps blast radius small, and makes auditing, residency, and recovery straightforward.

With TiDB, p95 and p99 remain predictable during bursts, tenants stop tripping over each other, and your team manages budgets, placement, and change with a few SQL statements instead of fire drills.

Proof in Production

Real systems, real stakes. In this section, you’ll discover how real-world teams tamed noisy neighbors at massive scale, consolidating fleets, enforcing per-tenant budgets, and shipping changes without downtime.

Atlassian: From Hundreds of Postgres Clusters to 16 TiDB Clusters with 50× Efficiency

Atlassian needed to run millions of tenant schemas and hundreds of tables per product while supporting per-tenant features (BYOK, data residency, PITR) and a sprawling plugin ecosystem. Traditional siloed Postgres fleets multiplied operational toil and made fleet-wide schema changes painfully slow. After evaluating distributed SQL options, Atlassian moved to TiDB to get deterministic multi-tenant performance, far denser bin-packing, and online evolution at Jira scale without breaking the developer experience.

Outcomes

  • Successfully tested with 4,000+ schemas across 3M+ tables within a single TiDB cluster.
  • ~50× improvement in bin-packing efficiency (many more tenants per cluster).
  • Fleet consolidation from hundreds of Postgres clusters to about 16 TiDB clusters (12 global regions + 4 regulated).
  • Zero-downtime major-version upgrades.
  • 6–7× faster DDL pipeline, enabling 24-hour fleet-wide schema evolution.
  • Stronger multi-tenant controls (resource isolation, placement) while reducing infra and ops costs.

Plaid: Faster Upgrades, Higher Uptime, Non-Disruptive Scale

Plaid’s high-growth payments and data-network services outgrew Amazon Aurora MySQL operationally: 300+ clusters and 800+ servers made upgrades a months-long project and write scaling difficult. TiDB’s MySQL compatibility, horizontal scale, and online schema changes let Plaid migrate service by service with a blue-green-red cutover pattern and feature-flag rollouts, preserving developer velocity while raising availability.

Outcomes

  • 41 services migrated in roughly a year with minimal disruption.
  • Non-disruptive scaling (reads and writes) and faster upgrades vs. Amazon Aurora MySQL.
  • Higher uptime and lower maintenance burden through online DDL and elastic scale-out.
  • Lower operational complexity by retiring dozens of bespoke Amazon Aurora clusters and procedures.

Next Steps

Ready to put this playbook into practice? Here are a couple ways you can take the next step with TiDB.

 

Launch a Free TiDB Cluster

Spin up a free TiDB sandbox and prove that noisy neighbors don’t wreck your SLOs. In minutes, you’ll validate per-tenant budgets, pin hot data to fast nodes, and watch neighbor latency stay flat during a surge.

Here’s what you’ll get:

  1. Validate Resource Groups and Placement Rules with your traffic shape
  2. See predictable per-tenant p95 without per-tenant silos
  3. Walk away with screenshots you can share with leadership

Start for Free

Book a Multi-Tenant Architecture Workshop

Turn this playbook into a roadmap. In a 60–90 minute working session with TiDB experts, we’ll map your tenant tiers, set RU budgets, and design data placement so you can protect SLOs on shared hardware without exploding fleet size.

Here’s what you’ll get:

  1. A tailored tenant tiering model
  2. A draft placement plan for hot vs. cold datasets
  3. A phased rollout strategy and concrete runbook updates

Request a Workshop