
This article is based on a technical presentation by the Flipkart engineering team at TiDB User Day India 2025.
Flipkart is one of India’s largest e-commerce platforms, serving millions of users across a complex ecosystem. Its operations span the full customer journey, from product discovery to order placement, and finally to delivery. This includes three primary flows:
- Pre-Order Path: Product discovery on the website or app, where users browse, filter, and select products.
- Order Path: Checkout, cart operations, payments, and accounting systems.
- Post-Order Path: Fulfilment logistics, warehouse management, last-mile delivery, and supply chain operations.

Slide showing Flipkart’s scale in India
Supporting such a wide-ranging journey requires a highly available, scalable, and efficient data infrastructure, especially when deployed at Flipkart’s scale, which includes both cloud and on-premise environments.
Before 2021, the backbone of this infrastructure was predominantly MySQL, with some Vitess deployments. However, the limitations of this architecture became increasingly evident as Flipkart scaled. These limitations set the stage for the evaluation and eventual adoption of TiDB, a distributed SQL database system.
The Limitations of MySQL at Flipkart

Flipkart Senior Engineering Manager, Vaidyanathan S delivering his talk at TiDB User Day India 2025
Flipkart used a combination of 900 standalone MySQL clusters (often with multiple replicas) and sharded MySQL configurations. In some instances, they also relied on Vitess to handle sharding. But as demand grew, this architecture began to reveal significant flaws:
- Failover and Availability: On-prem infrastructure is more prone to failure by nature. Flipkart’s MySQL deployments used primary-secondary setups, and primary node failures caused short but unavoidable service outages. These outages were not acceptable given the company’s transaction volume and service guarantees.
- Replication and Data Loss Risk: Most MySQL replication was configured asynchronously. If a primary failed before replication completed, any in-flight data could be lost. This was a critical risk, especially for financial and logistics data.
- Vertical Scaling Limits: Virtual machines in Flipkart’s data centers could support up to approximately 3–3.5 TBs of storage. This placed a hard ceiling on how much data a single MySQL instance could handle. To work around this, Flipkart had to shard databases, which increased operational complexity and impacted application design.
- Operational Overhead of Sharding: Sharding MySQL means distributing data across multiple instances. This complicates application logic, monitoring, backup, and recovery. As new applications were added or existing ones scaled, the overhead grew rapidly.
- Lack of Cloud-Native Capabilities: MySQL was not inherently designed to work with cloud-native or containerized environments. It did not support seamless operation on Kubernetes, and lacked automation-friendly features needed for flexible deployments.
Transitioning to TiDB
Flipkart explored TiDB primarily because it offered solutions to the major pain points posed by MySQL:
- It is a distributed SQL database capable of horizontal scalability.
- It is MySQL-compatible, which meant that most application-level SQL logic could be reused with minimal change.
- TiDB has built-in high availability mechanisms via the Raft consensus algorithm, eliminating the need for manual failover strategies.
- It is cloud-native, designed to run on Kubernetes using an operator-based model.
- Despite consuming more disk space, TiDB includes compression that offsets this increased usage to some extent.
Initial Adoption Challenges
While TiDB offered numerous advantages, internal adoption was not straightforward. There were misconceptions that TiDB would work identically to MySQL and that applications could simply switch database endpoints without any impact. In practice, some queries needed to be rewritten or optimised to perform well on TiDB. This required teams to learn new debugging and optimisation approaches.
To address any challenges, Flipkart conducted internal benchmark testing. These tests demonstrated that TiDB could handle over 1 million QPS with 7.4 ms P99 latency and 120K writes per second at 13 ms. These benchmarks helped establish confidence in TiDB.
Deployment Strategy and Architecture
Flipkart initially deployed TiDB on virtual machines. However, managing the many components of TiDB (row storage, metadata orchestrator, TiDB server nodes) on VMs proved difficult. Recovery processes were manual and fragile, and scaling was inefficient. As a result, the team transitioned to Kubernetes-based deployments, using the TiDB Operator to manage lifecycle events like creation, scaling, and recovery.
With Kubernetes:
- Each tenant or business unit could get its own isolated TiDB cluster, allowing for better fault isolation and workload control.
- A self-service platform was built internally so that developers and teams could spin up TiDB clusters as needed, without requiring direct infrastructure team involvement.
This shift significantly improved deployment agility and laid the foundation for robust scaling.
Monitoring and Alerting Strategy
As Flipkart’s TiDB footprint expanded, so did the volume of alerts. Alerts were triggered at multiple levels:
- Kubernetes control loops (e.g., node or pod failures)
- TiDB Operator (e.g., replication or service issues)
- Management plane or monitoring tools
- Manual checks or escalations
This resulted in an overload of alerts, many redundant or unactionable. The team found it difficult to distinguish between critical and non-critical events.
Simplifying and Structuring Alerting
To handle this:
- Alerts that Kubernetes or the TiDB Operator was already handling were suppressed to avoid duplication.
- All alerts were categorised into two buckets:
- Platform-Level Alerts: Infrastructure or deployment-related issues (e.g., node failure, disk issues)
- Client-Level Alerts: Application-specific alerts or service impacts (e.g., query errors).
This classification made it easier for the right teams to take action and significantly reduced alert fatigue.
Debugging and Internal Enablement
Adoption of TiDB exposed a much larger set of metrics and diagnostics than MySQL. Many engineers were unfamiliar with TiDB dashboards and didn’t know how to interpret the vast amount of telemetry data the system provided.
To overcome this, Flipkart:
- Ran internal training workshops, in collaboration with PingCAP, to teach teams how to read and troubleshoot using TiDB’s monitoring tools.
- Developed internal chatbots using AI, capable of answering common operational questions, reducing reliance on human escalations.
- Introduced teams to TiDB.AI, an AI-powered support tool developed by PingCAP that could answer technical questions and suggest debugging steps based on TiDB logs and metrics.
This multifaceted enablement strategy improved confidence and reduced the load on central platform teams.
Recovery and Maintenance at Scale
Operating in on-premise data centers meant that Flipkart was more vulnerable to hardware failures like:
- Disk corruption
- Node-level crashes
- Planned downtime due to upgrades or hardware replacement
These events required cluster rebalancing, which could degrade performance if not managed carefully. Initially, this was handled manually, and it often had to be scheduled late at night to avoid affecting production traffic.
The Flipkart Operator and Automated Maintenance
To streamline maintenance, Flipkart built a custom Kubernetes operator that worked alongside the TiDB Operator. This operator was integrated with data center infrastructure systems via APIs. When the infra team scheduled maintenance for a node (e.g., taking it offline for upgrades), they would notify the operator using a Custom Resource Definition (CRD).

Flipkart Senior Software Engineer, Reeshabh Raj, presenting the demo of the Flipkart Operator at TiDB User Day India 2025
The operator would then:
- Scale up the TiDB cluster by adding new row storage nodes to handle load redistribution.
- Rebalance regions and data away from the node scheduled for downtime.
- Drain and delete the affected node.
- Recreate the pod and persistent volume on a new node.
- Rebalance data again so that the new node resumes its role.
- Scale down the temporary capacity added during the maintenance.
This process was fully automated and scheduled to run during low-traffic periods, typically between 2 AM and 3 AM. The system ensured that no user-facing downtime occurred, even during major infrastructure maintenance.
Measured Outcomes
- Scalability: TiDB demonstrated the ability to process over 1 million queries per second, with write throughput reaching 120K writes per second.
- Low Latency: P99 latency for reads was measured at 7.4 milliseconds, while writes achieved 13 ms.
- Reduced Alert Noise: Classification and suppression strategies significantly reduced alert volume, improving the signal-to-noise ratio.
- Zero Downtime Maintenance: With the custom operator in place, scheduled maintenance could be performed without impacting services.
- Improved Internal Adoption: Benchmarking results, training, and internal automation tools led to broader acceptance of TiDB among Flipkart engineering teams.
Flipkart’s move from MySQL to TiDB was a significant architectural evolution driven by the need for scale, availability, and operational simplicity. While the transition required effort, especially in internal enablement, alert management, and maintenance automation, it ultimately empowered Flipkart to run a modern, scalable SQL platform tailored for both on-prem and cloud environments.
The combination of Kubernetes-native tooling, distributed architecture, and self-service capabilities has enabled Flipkart to meet its infrastructure demands at scale while improving developer experience and operational resilience.
Elevate modern apps with TiDB.
TiDB Cloud Dedicated
A fully-managed cloud DBaaS for predictable workloads
TiDB Cloud Serverless
A fully-managed cloud DBaaS for auto-scaling workloads