Flipkart is India’s largest e-commerce company with Walmart, the global retail giant, owning a majority controlling stake in the business. It has more than 400 million registered sellers and buyers, gets over 10 million daily page visits, sends over 8 million monthly shipments, and owns 22 state-of-the-art warehouses in India. During “The Big Billion Days 2022,” a large shopping festival equivalent to Black Friday in the US, Flipkart served billions of customer visits and secured billions of dollars of gross merchandise value (GMV).
A large MySQL fleet behind Flipkart
With over 400 applications, thousands of microservices, and countless varieties of end-to-end e-commerce operations running across over 700 MySQL clusters, Flipkart runs one of India’s largest MySQL fleets.
Massive numbers of applications and services generate petabytes of data and varieties of data types and formats. Flipkart’s tech stack (made up of MySQL, Redis, and Aerospike) became ever more complex to process and store this amount and variety of data. However, as its business kept growing, the previous database solutions started to hit their limits. Flipkart urgently needed new alternatives to meet its growing business needs.
Major challenges with MySQL
Flipkart faced three major challenges with its MySQL solution: scalability, reliability, and efficiency.
As Flipkart’s business rapidly grew with continuously spiking data size and throughput, its traditional transactional data stores reached their limits. Scalability became a major roadblock. Flipkart had to consider alternative architectures to handle the increasing data volume and to support the continued success of its business.
Flipkart had super high requirements for system reliability, especially during its shopping festivals, which usually got multiple billions of dollars in GMV. Any critical service outage could cause significant losses. Flipkart urgently needed a much more available database that could tolerate various types of unexpected failures such as individual node failures, rack failures, and even regional-level failures, and quickly and automatically failover.
Flipkart has a massive number of data stores resulting in complexity in operation, maintenance, and management, challenges in data accessibility and consistency, and skyrocketing costs. So, every bit of efficiency matters. Flipkart needed a new database solution to simplify the whole storage system architecture, ensure easy operation and maintenance, and minimize the overall costs.
How TiDB helps Flipkart succeed: Coin Manager as an example
After thorough evaluations and testing, TiDB beat out several other vendors to win the trust of Flipkart’s team. In early 2021, they adopted TiDB in their production environments. As of now, TiDB supports 15 applications and stores and processes more than 90 terabytes of data. The largest TiDB cluster has over 40 pods holding more than 30 terabytes of data.
Among the 15 use cases of TiDB inside Flipkart, MySQL was the major legacy system. Therefore, these use cases shared similar pain points and characteristics. To explore in detail on why Flipkart chose TiDB and how TiDB helps Flipkart succeed, we’ll take Coin Manager as an example and take a deep dive into it.
Coin Manager: Flipkart’s rewards services platform
Coin Manager is Flipkart’s rewards services platform. It manages customers’ SuperCoins, which are points awarded to customers when they shop on Flipkart or complete given tasks. SuperCoins can be used as digital currency to shop on Flipkart or other partnered platforms. When customers make a payment, they can pay with SuperCoins or via their banking system. So, Coin Manager also acts as a payment processor for SuperCoins.
Previous setup with sharded MySQL
Previously, the Coin Manager team used sharded MySQL as its database solution. Coin Manager ran in two regions. Region-1 served both read and write workloads, while Region-2 served as a failover region and could also handle some read workloads.
Topology of the previous sharded MySQL solution
The team employed 5-way consistent hashing to manage the sharding. This resulted in a complex setup with five shards, each with a master, hot standby, a read replica in the active region, and a read replica and a hot standby in the passive region.
Maintaining the two clusters was a constant challenge because the team had to separately monitor the health of 25 nodes and 20 replication channels. In addition, a coordinated failover of the five shards would be required in the event of a disaster. As the data size kept growing, the team needed to add more shards. Resharding the data was even more painful.
Requirements for the new database
To tackle all these pain points, the Coin Manager team needed a new database solution and had the following requirements.
A SQL data model with ACID properties
A SQL data model can ensure the data is organized, easy to manage, and easy to query. It can also enable the audibility and traceability of all SuperCoin-related transactions. SQL’s ACID properties guarantee that all SuperCoin-related transactions are performed atomically, consistently, and durably. This ensures that customers’ SuperCoin data is always accurate and up to date.
Millions of customers rely on SuperCoins for their purchases. The database must be highly available to keep the Coin Manager system always-on even in the face of disasters such as single-node and regional-level failures. Even a brief period of downtime could result in significant customer dissatisfaction.
With millions of customers earning and spending SuperCoins, the amount of data generated has reached three terabytes, and is growing at a rate of 300 GB per month. The new database must be horizontally scalable to accommodate this rapidly increasing data without disrupting services.
Withstand high throughputs with low latency
The new database must be able to handle large amounts of write and read requests, while providing quick response times to ensure a seamless customer experience. The new database must withstand a throughput of at least 12,500 writes per second and 5,000 reads per second. Write latencies must be less than 100 ms and read latencies less than 20 ms.
Current setup with TiDB
Topology with TiDB
In early 2021, the Coin Manager team switched from their previous sharded MySQL solution to TiDB. TiDB has greatly improved the reliability and simplified operations. Coin Manager still runs in two regions, with the active side serving both reads and writes and the passive side serving only reads. The underlying database structure is now much simpler. Two TiDB clusters run in a Kubernetes cluster, each consisting of three Placement Driver (PD) nodes, 10 TiDB nodes, 33 TiKV nodes, and 11 TiFlash nodes. When data volume rises, TiDB can automatically add nodes to scale out and handle the growing workloads.
Compared to the previous sharded MySQL, the new setup with TiDB uses more nodes, but is much simpler. Moveover, the system is more resilient with TiDB, because the combination of TiDB and Kubernetes enables stateful sets to be self-healing in the event of disasters. The use of TiDB Operator, an automatic operation system for TiDB clusters in Kubernetes, also makes operation and maintenance much easier.
Key benefits TiDB brings to Flipkart
With TiDB, Flipkart was able to simplify its applications by retaining their SQL data model while guaranteeing ACID at the same time. This is not always easy when you switch from a SQL data model to most other NoSQL data models. In addition, its applications no longer need to implement any sharding logic, and all the data can be stored in one database.
With TiDB, Flipkart no longer needs to manage multiple shards, nodes, and replication channels. Instead, they just need to manage two TiDB clusters and one replication channel across the two clusters. The combination of Kubernetes and the TiDB Operator made this management process even easier.
High availability with fast failover and no single point of failure
TiDB is highly available. During the past two years when TiDB was deployed in Flipkart’s production environments, there was no single point of failure or system downtimes. Even in the event of incidents where some nodes died, TiDB could achieve fast and automatic failover to keep the system running without impacting the business.
Fast, elastic, and infinite scaling
As an ecommerce platform, Flipkart has peak times and slack times. To accommodate such rapidly changing needs, TiDB can automatically scale up, down, in, or out its storage and computing nodes. More importantly, such scalability is almost infinite and can meet Flipkart’s rapidly growing business both now and in the future.
Majority of Flipkart’s products and services used MySQL previously as its storage solution. TiDB is highly MySQL compatible, so Flipkart could switch from its legacy databases to TiDB with minimal or even zero changes to its existing applications.
Future plans with TiDB
After using TiDB in production for almost two years, Flipkart has tackled many of its pain points and improved its overall service performance. Flipkart plans to apply TiDB to more of its applications and services and try more TiDB tools.
Apply TiDB to higher throughput user-facing applications
TiDB is performing well in supporting applications with a throughput of around 10,000-20,000 reads/writes per second. Flipkart plans to apply TiDB to more applications with hundreds of thousands reads and writes per second or even more.
Apply TiDB to multi-region active-active applications
Flipkart plans to run their applications in an active-active configuration across multiple regions because it is running out of its regions now and wants to ensure business continuity. Flipkart is in discussion with TiDB engineers on how to create the optimal topology for such needs.
Leverage TiDB’s HTAP capabilities
Previously, Flipkart used TiDB more as an Online Transactional Processing (OLTP) database to support its transactional workloads, and it hasn’t yet explored TiDB’s Hybrid Transactional/Analytical Processing (HTAP) capabilities. This is because Flipkart has an internal big data analytics platform for most of its analytics. It is a separate database system which could hardly handle real-time analytical queries on top of fresh transactional data. So in the near future, Flipkart plans to explore and leverage TiDB’s HTAP capabilities for both its transactional and real-time analytical needs.
This customer story is created based on the talk given by Kaustav Chakravorty at the Virtual HTAP Summit.
A fully-managed cloud DBaaS for predictable workloads
A fully-managed cloud DBaaS for auto-scaling workloads