Introduction
You didn’t choose to spend your career maintaining a maze of MySQL shards. It happened the way it always happens.
First it was a single MySQL cluster with replicas for MySQL high availability. Then growth hit. Write throughput climbed. Tables ballooned. Indexes got heavier. Replication lag started showing up in the wrong dashboards at the wrong times.
So you “did the responsible thing” and implemented horizontal sharding. It worked. Until it didn’t.
Now you are living with the full tax of MySQL sharding:
- “Simple” product features that require a cross shard query.
- Cross-shard joins that turn into app-side fan-out and data stitching.
- Re-sharding events that consume weekends (and confidence).
- A routing layer that is now a production dependency you cannot delete.
- A fleet of shards that multiplies backups, upgrades, schema changes, and on-call load.
This playbook is a practical path out. You will see what “survival mode” looks like in a MySQL-only world, and what changes when you move to TiDB, an open source distributed SQL database that removes manual database sharding as an architectural requirement.
A: MySQL sharding can scale write throughput, but it shifts complexity into the application layer: routing, re-sharding, and consistency management. Distributed SQL systems like TiDB remove the need for manual sharding by handling scale-out and transactions inside the database.
Who’s This Playbook For: Storage Leads & SREs
Storage leads own the database platform: capacity planning, cost, MySQL replication strategy, upgrade cadence, and the long-term shape of your database horizontal scaling plan.
SREs own production reliability: SLOs, paging, incident response, and the runbooks that keep sharding “stable enough” to survive another quarter.
If any of these sound familiar, you’re the audience:
- Your MySQL scale out strategy depends on shard-count math and luck.
- Your team debates shard key design more than product design.
- Your best engineers are maintaining routing and re-shard tooling instead of shipping.
- You have at least one table that cannot be re-sharded without downtime risk.
This playbook is designed to make the trade-offs explicit, then hand you a migration flight plan that you can actually execute.
What We’re Solving Today: Manual MySQL Sharding Chaos
Manual sharding is not one problem. It is three problems that compound.
1. Cross-Shard Joins — Data Silos in Disguise
Shards create silos. The first time your product needs to join data across silos, “SQL” stops being your interface. Your interface becomes orchestration code.
What it looks like in practice:
- Reporting becomes: query each shard → merge in app → sort/aggregate → cache → pray.
- Global constraints are “best effort”.
- Latency becomes a function of your slowest shard.
-- Example: cross shard query fan-out pattern (anti-pattern)
-- shard_01.user_orders, shard_02.user_orders, shard_03.user_orders...
SELECT * FROM shard_01.user_orders WHERE user_id = 42
UNION ALL
SELECT * FROM shard_02.user_orders WHERE user_id = 42
UNION ALL
SELECT * FROM shard_03.user_orders WHERE user_id = 42;
This works until shard count grows and every “simple” query becomes a distributed systems project.
2. Re-Sharding Events — The Weekend-Killer
Re-sharding is the moment your architecture reveals what it really costs.
When you outgrow a shard boundary, you have to:
- Pick a new sharding function.
- Move data (often terabytes).
- Keep the system consistent during the move.
- Cut traffic over safely.
- Clean up the old world.
Even if you do it “online,” you still pay in operational risk, performance cliffs, and engineering time. This is why resharding becomes a cultural event, not a routine task.
3. App-Side Routing — Code You Wish You Didn’t Own
Once you shard, your application becomes the query planner.
You now own:
- Routing rules.
- Shard maps.
- Consistent hashing, lookup tables, or both.
- Hot-spot detection and mitigation.
- “Special cases” for tenants, regions, or high-value customers.
# Example: app-side routing logic (illustrative)
def route(user_id: int) -> str:
shard = user_id % 64
return f"mysql-shard-{shard:02d}.prod.internal"
# Now every query path depends on routing being correct forever.
The routing layer is rarely “done.” It is a permanent tax that expands with every new use case.
The Math of the Sharding Tax
Manual sharding introduces a “tax” on your infrastructure and engineering velocity that grows exponentially, not linearly, with your shard count.
- The Latency Floor: In a sharded environment, a cross-shard query’s response time is governed by the “Long Tail” (the $P99$ of your slowest shard). If one shard experiences a minor compaction or background task, your entire app-side “UNION ALL” operation stalls.
- The Complexity Equation: A standard join on a single database is $O(1)$ in terms of orchestration complexity. A cross-shard join requires app-side fan-out, data stitching, and memory-intensive sorting, turning your application into a makeshift query engine.
- The “Zombie” Capacity Cost: Because shards are silos, you cannot easily share resources. If Shard A is at 90% capacity and Shard B is at 20%, you are forced to re-shard Shard A even though your total fleet has plenty of “zombie” headroom.
DIY Sharding Survival: MySQL-Only Tactics
This section is intentionally blunt. These tactics are common because they are achievable. They are also why manual sharding becomes a long-term trap.
Design Shard Keys — Guess, Check, Pray
Shard keys are permanent until they aren’t, and when they aren’t, you re-shard.
Shard key design usually fails in predictable ways:
- You optimize for current access patterns, then the product evolves.
- One tenant / region / feature creates hot spots.
- Cardinality changes and shards skew.
- Secondary access paths demand cross-shard reads.
Use the following checklist in your shard key design review:
- Does the key distribute writes evenly today?
- Will it distribute writes evenly after 10× growth?
- Do your top 3 queries require joins on non-shard-key columns?
- Can you support new “global” features without fan-out?
- If you must re-shard, can you do it without downtime?
If you cannot answer these confidently, you are not “designing.” You are forecasting under uncertainty.
Build and Maintain a Routing Layer
A routing layer starts small, then becomes a platform:
- Service discovery for shard endpoints.
- Retry and failover behavior.
- Read/write splitting behavior.
- Connection pooling per shard.
- Versioning during re-shard migrations.
Most teams eventually rebuild this multiple times because the first version cannot survive growth.
Hand-Craft Re-Shard Pipelines
Re-sharding pipelines often include:
- Backfill jobs (dump + load).
- Incremental replication streams.
- Dual-write windows.
- Cutover tooling.
- Consistency validation (diffing, sampling, checksum tooling).
If you are using MySQL replication to “bridge” during re-sharding, you also inherit replication lag, replica drift risk, and operational complexity.
Chase Cross-Shard Consistency
Transactions are where sharding gets expensive.
Once operations span shards, you are forced into one of these:
- Application-level two-phase commit (complex and fragile).
- Saga patterns (eventual consistency + compensations).
- “Just don’t do cross-shard writes” (product constraints disguised as architecture).
This is the hidden cost behind “MySQL can scale out.” It can, but not without changing your correctness model.
Make it Easier on Yourself with TiDB’s Anti-Sharding Architecture
TiDB is built to remove manual sharding from your application design. Instead of building a sharded MySQL cluster and pushing complexity into the app, TiDB provides a distributed SQL layer that handles scale-out and consistency inside the database.
Distributed SQL That Speaks MySQL
TiDB is MySQL compatible at the protocol and syntax level, which means most applications can migrate without rewriting their query layer.
What changes immediately:
- You keep SQL as your interface (not orchestration code).
- Cross-table joins remain joins.
- You reduce app-side routing and shard map dependencies.
-- Same intent, no cross-shard query fan-out:
SELECT o.*
FROM user_orders o
JOIN users u ON u.id = o.user_id
WHERE u.id = 42;
Online Auto-Scaling—No Re-Shard Required
With manual shards, “scale” means “create more shards, then move data.”
With TiDB, “scale” means adding capacity and letting the system rebalance, instead of initiating a re-sharding event. This is the core escape hatch from resharding downtime and weekend cutovers.
Global ACID Transactions Across Every Shard
Manual sharding forces you to choose between correctness and complexity.
TiDB keeps ACID semantics while distributing data across nodes, so cross-range updates do not automatically become app-managed sagas.
That shift alone often eliminates entire classes of correctness bugs that only exist because your database stopped being a single transactional system.
How TiDB Eliminates Manual Sharding
TiDB achieves “unsharded” scale by decoupling the SQL processing layer from the storage layer.
- Automatic Region Sharding: Unlike MySQL, where you manually define shard keys, TiDB automatically splits data into small, manageable “Regions” (configurable, ~256 MB by default in recent TiDB versions).
- Dynamic Rebalancing: The Placement Driver (PD) acts as the “brain,” constantly monitoring load. If one node becomes a hotspot, TiDB automatically moves Regions to quieter nodes in the background—no manual re-sharding or weekend cutovers required.
- Raft-Based Consistency: TiDB uses the Raft consensus algorithm to ensure that every write is synchronously replicated, providing global ACID transactions across the entire cluster without the fragility of app-level two-phase commits.
Tooling for the Transition
| Tool | Function | Best For |
| Dumpling | Consistent export from MySQL. | Snapshot each shard for initial load. |
| IMPORT INTO (TiDB) | Distributed bulk import (physical). | Fast initial load from CSV/SQL/Parquet (often replaces Lightning). |
| TiDB Lightning (Optional) | Bulk import tool (physical). | When you prefer the Lightning workflow/existing runbooks. |
| TiDB Data Migration (DM) | Binlog replication + shard merge. | Near-zero-downtime cutover into unified TiDB tables. |
Step-by-Step Migration Guide
This is a practical flight plan for moving from MySQL shards into TiDB with clear validation gates.
Map Your Shard Topology
Before executing your flight plan, evaluate your current MySQL environment against these critical markers.
| Category | Sharding Pain Signal | What TiDB Changes |
| Schema | Schemas drift across shards. | Single logical schema. |
| Queries | UNION ALL/app-side routing. | Native distributed joins. |
| Scaling | Re-sharding causes downtime. | Online horizontal scaling. |
| Data Integrity | App-enforced consistency. | Built-in global transactions. |
| Ops | Engineers manage shards. | Engineers ship features. |
Move Data with Dumpling, IMPORT INTO, and TiDB Lightning
For large shard datasets, TiDB docs outline a common approach:
- Export each shard with Dumpling.
- Import into TiDB using IMPORT INTO (Physical Import Mode) or TiDB Lightning.
- Use DM to replicate incremental changes from multiple shards into TiDB (merge mode where needed).
# Export (illustrative)
dumpling -u root -p '***' -h mysql-shard-01 \
--output /data/export/shard01 \
--filetype sql
# Import (illustrative)
IMPORT INTO target_table
FROM ‘s3://bucket/prefix/*.csv’
WITH THREAD=16;
# Import (illustrative)
tidb-lightning -config lightning.toml
If your migration includes shard-merge behavior (multiple upstream tables into one downstream table), DM has explicit shard-merge support and best practices, including guidance on task modes and limitations.
Cutover, Validate, Optimize
Cutover is where migrations succeed or fail. Keep it boring.
Common, low-risk cutover patterns include:
- Backfill historical data.
- Start incremental replication to keep TiDB current.
- Run validation checks continuously.
- Schedule a short write freeze.
- Drain writes from shards, let replication catch up.
- Flip application connections to TiDB.
- Monitor, then decommission shard infra in phases.
Validation gates (examples):
-- Estimated row counts (stats-based). Use for rough sizing,
not correctness
SELECT table_schema, table_name, table_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema')
ORDER BY table_rows DESC
LIMIT 20;
-- Top latency statements (post-cutover hotspot hunt)
SELECT digest_text, sum_latency/1000 AS ms, exec_count
FROM information_schema.statements_summary
ORDER BY ms DESC
LIMIT 20;
For correctness validation (recommended), you can run sync-diff-inspector between the source MySQL database and the target TiDB database (row + checksum by chunk).
Operational Best Practices
This section turns your new architecture into stable day-two operations.
Observe & Alert with TiDB Dashboard
Treat observability as part of the platform contract:
- Hotspot visibility (regions / stores / skew)
- Top SQL by latency and contention
- Capacity trend monitoring so scale-out is proactive, not reactive
Use a consistent weekly cadence: “top offenders,” “new hotspots,” “growth trend,” “cost guardrails.”
Backup & Restore—Shards Optional
Sharded backups are easy to start and hard to trust:
- Different shard snapshots at different times.
- Restore complexity grows with shard count.
- Disaster recovery becomes a choreography problem.
Move toward a restore posture that is tested, repeatable, and operationally boring.
# Illustrative BR full backup
tiup br backup full \
--pd "${PD_IP}:2379" \
--storage "s3://bucket/prefix"
Elastic Scale & Cost Guardrails
Manual sharding hides cost in two places:
- Idle capacity in each shard (fragmented headroom).
- Engineering time spent maintaining the sharding ecosystem.
Set explicit guardrails:
- Scale triggers (CPU, storage, hotspot thresholds).
- Budget policies (when scale-out is allowed, when it is reviewed).
- Performance SLO checks before and after capacity changes.
Proof in Production
Real systems, real stakes. Here’s how two teams used TiDB to break out of MySQL scale and ops limits to keep performance predictable as workloads grew.
Flipkart: 1M+ QPS with Zero-Downtime Maintenance
Flipkart ran 900+ MySQL clusters and still hit classic scale pain: failover delays, async replication risk, a 3.5 TB storage ceiling that forced sharding, and limited cloud-native/Kubernetes support.
TiDB gave them a distributed SQL foundation they could operate on Kubernetes and maintain without user-facing downtime.
Outcomes
- Sustained 1M+ QPS with P99 read latency ~7.4 ms and 120K writes/sec (~13 ms) in internal benchmarks.
- Achieved zero downtime scheduled maintenance via automated node draining and rebalancing workflows.
- Reduced alert noise by restructuring and suppressing redundant signals, improving response efficiency.
Bolt: Modernizing MySQL to Scale Thousands of Microservices on AWS
Bolt’s MySQL stack became a bottleneck for a fast-growing microservices footprint. Schema changes were painfully slow (adding a column to a 1 TB table could take up to a week). Even worse, multi-master clustering (Galera) still left gaps in durability and operational confidence.
Bolt needed horizontal scaling, strong consistency, and simpler HA and DR without redesigning every service.
Outcomes
- Delivered up to 10× query performance improvement, with previously timed-out queries completing in seconds.
- Supported ~7K package information updates per second.
- Scaled production usage to seven TiDB clusters (including a downstream DR cluster) and processed ~35K QPS on TiDB for critical systems like ordering, commerce, and payments.
Next Steps
If you’re ready to stop debating sharding in theory and start validating an exit plan with real workload data, these next steps will get you to a clear go/no-go decision quickly.
Try TiDB CloudIf you want to see the impact fast without committing to a migration plan, start here. Signing up for a free trial of TiDB Cloud turns “it should work” into proof as you can run your real workload patterns against a single SQL endpoint. Here’s what you’ll get:
|
Book a Data-Model WorkshopIf you need alignment and a concrete plan before you run tests, start here. This working session maps your current shard reality into a phased migration sequence with clear validation gates and a clean decommission path. Here’s what you’ll get:
|