When I first started building TiDB with my co-founders, we encountered countless challenges, pitfalls, and critical design choices that could have made or broken the project. To build an enterprise-grade distributed database like TiDB from scratch, we have to constantly make difficult decisions that balance the speed of development with long-term considerations for our customers and our team.
Three years and two big releases later, TiDB 2.0 is now deployed in-production in more than 200 companies. Along the way, our team answered many questions from our users as they evaluated TiDB and other distributed databases. Choosing what to use in your infrastructure stack is an important decision and not an easy one. So I’ve gathered our learning from over the years, and summarized them into the following “9 why’s” that every engineer should ask when looking at a distributed database. Hopefully, this list can make your decision-making a bit easier.
There’s no single technology that can be the elixir to all your problems. The database realm is no different. If your data can fit on a single MySQL instance without too much pressure on your server, or if your performance requirement for complex queries isn’t high, then a distributed database may not be a good choice. Choosing to use a distributed database typically means additional maintenance cost, which may not be worthwhile for small workloads. If you need High Availability (HA) for small workloads, MySQL’s master-slave replication model plus its GTID solution might just be enough to get the job done. If not, you can extend it with Group Replication. With a community as active and large as MySQL’s, you can Google pretty much any issue and find the solution. In short, if a single instance database is enough, stick with MySQL. When we first began building TiDB to be MySQL compatible, our goal was never to replace MySQL, but to solve problems that single instance databases inherently cannot solve.
So when is the right time to deploy a distributed system like TiDB? Like any answer to a hard question, it depends. I don’t want to give a one-size-fits-all answer, but there are signals to look for when deciding whether a distributed database is the right answer. For example, are you:
If you find yourself asking these types of questions on a regular basis, it’s time to consider whether a solution like TiDB can help you.
MySQL and TiDB are not mutually exclusive choices. In fact, we spent an enormous amount of work helping our users continue to use MySQL by building tools to make the migration from MySQL to TiDB seamless. This preserves the option of simultaneously using MySQL for single instance workloads where it shines. So if your data is still small and workloads light, keep on using MySQL—TiDB will be waiting for you in the future.
Easy Maintenance. We separated the TiDB platform’s SQL layer (stateless and also called TiDB) from its storage layer (persistent and called TiKV) in order to make deployment, operations, and maintenance more simple. It is one of the most important design choices we made. This may seem counter-intuitive (don’t more components make deployment more complex?). Well, being a DevOps isn’t just about conducting deployment, but also quickly isolating issues, system debugging, and overall maintenance–a modular design really shines in supporting these responsibilities. One example: if you found a bug in your SQL layer that needs an urgent update, a rolling update can be quite time-consuming, disruptive, even risky, if your entire system is fused together and not layered separately. However, if your SQL layer is stateless and separated, an update is easy and doesn’t disrupt other parts of your system.
Better Resource Usage. A modular design is also better for resource allocation and usage. Storage and SQL processing rely on different kinds of computing resources. Storage is heavily dependent on input-output (I/O) and affected by the kind of hardware you use, e.g. PCIe/NVMe/Optane SSDs. SQL processing relies more on CPU power and RAM size. Thus, putting SQL and storage in separate layers help make the entire system more efficient in using the right kind of resources for the right kind of work.
TiDB is designed to support an HTAP (hybrid transactional and analytical processing) architecture, where both OLTP and OLAP workloads can be handled with great performance together. Each type of workload is optimized differently and needs different kinds of physical resources to yield good performance. OLAP requests are typically long, complex queries that need a lot of RAM to run quickly. OLTP requests are short and fast, so must optimize for latency and throughput in terms of OPS (operations per second). Because TiDB’s SQL layer is stateless and separated from storage, it can intelligently determine which kind of workload should use which kind of physical resources by applying its SQL parser and Cost-Based Optimizer to analyze the query before execution.
Development Flexibility and Efficiency. Using a separate key-value abstraction to build a low-level storage layer effectively increases the entire system’s flexibility. It can easily provide horizontal scalability–auto-sharding along the keys is more straightforward to implement than a table with complex schema and structure. A separate storage layer also opens up new possibilities to take advantage of different computing modes in a distributed system. TiSpark, our OLAP engine that leverages Spark, is one such example, where it takes advantage of our layers design and sits on top of TiKV to read data directly from it without any dependency or interference from TiDB. From a development angle, separation allows multiple programming languages to be used. We chose Go, a highly efficient language to build TiDB and boost our productivity, and Rust, a highly performant systems language to develop TiKV, where speed and performance is critical. If the entire system isn’t modularized, a multi-programming language approach would be impossible. Our layered architecture allows our SQL team to work in parallel with our storage team, and we use RPC (more specifically gRPC in TiDB) for communications between the components.
Many people have asked me: can TiDB replace Redis? Unfortunately no, because TiDB is simply not a caching service. TiDB is a distributed database solution that first and foremost supports distributed transactions with strong consistency, high availability, and horizontal scalability, where your data is persisted and replicated across multiple machines (or multiple data centers). Simply put, TiDB’s goal is to be every system’s “source of truth.” But to make all these features happen in a distributed system, some tradeoffs with latency is unavoidable (and any argument to the contrary is defying physics). Thus, if your production scenario requires very low latency, say less than 1ms, you should use a caching solution like Redis! In fact, many of our customers use Redis on top of TiDB, so they can have low latency with a reliable “source of truth.” That way, when the caching layer goes down, there’s a database that’s always on with your data–consistent, available, and with unlimited capacity.
A more meaningful measuring stick of a distributed database is throughput, in the context of latency. If a system’s throughput increases linearly with the number of machines added to the system (more machines, more throughput), while latency is holding steady, that is the true sign of a robust distributed database. Some of our production users are already processing up to 1 million queries per second on a TiDB cluster; this level of throughput is nearly impossible to achieve on a single machine.
We have a saying inside PingCAP, “make it right before making it fast.” Correctness, consistency, and reliability always trump speed. Without them, what good does low latency do?
We get this question a lot: why did you make TiDB compatible with MySQL? First of all, MySQL has such a large community and strong grassroots foundation that many DBAs and developers are skilled in it; building TiDB to speak MySQL instead of presenting a new protocol drastically reduces adoption and migration cost. Second, MySQL is mature and accumulated a lot of tests and third-party frameworks over the years, which we leveraged when building TiDB to guarantee that it’s correct and dependable. Third, a large amount of production workloads, often mission-critical ones, are already being handled by MySQL in many industries and scalability is becoming a common headache, thus building a solution to address this pain point head-on makes perfect go-to-market sense.
Like I said, MySQL is great when applied appropriately; we don’t want to replace it. Instead, we believe being compatible with the MySQL world is the best way to stay close to our users’ real-life problems and help them. Additionally, ever since MySQL was acquired by Oracle, its performance and stability have both experienced steady improvements over the last few years, sometimes even out-performing Oracle in certain scenarios. We are bullish on MySQL’s future development and trajectory and want to grow with such a healthy community.
The answer to this question is where TiDB differs from many similar products in the market. Many projects choose to fork MySQL’s source code directly, like Aurora, and implement a storage layer underneath that fork to provide scalability. There are two obvious advantages to this approach: 1. less work and pressure on development; 2. guarantee 100% MySQL compatibility for users.
However, the disadvantages to this approach are also glaring. MySQL has been around for more than 20 years and was never designed for distributed scenarios, thus its SQL layer cannot take full advantage of the computing resources and parallel nature of a distributed system to generate the most optimal query plans. Swapping out an old storage engine for a new one may scale storage capacity, but it doesn’t scale computation capacity. That’s why we built a MySQL compatible layer from scratch (in TiDB) and added a coprocessor layer that connects with TiKV to leverage parallel computing possibilities in a distributed system to achieve top query performance.
You might ask: isn’t writing a MySQL layer from scratch really difficult. Yes and no. It does take a long time if you plan to rebuild every single feature of MySQL. But if you are more strategic and forgo certain functions, like stored procedures, that are not longer widely used, then your workload becomes much more manageable. (Here’s the full list of MySQL functions TiDB currently doesn’t support.) Plus, you can use a new programming language like Go to build (and maintain), which actually increases productivity and efficiency over time.
RocksDB. Many engineers like to build better wheels, and I’m no different. However, when you try to build an enterprise-grade infrastructure technology for industry users, like a distributed database, your consideration must be more practical. Currently, to build a new database, there are two types of storage structure to choose from: 1. B+ Tree; 2. LSM Tree. Instead of building our own storage engine from scratch, we chose LSM Tree with RocksDB.
For B+ Tree, write operations can be slow because each operation requires two “writes”: 1. in log; 2. “dirty” page refresh with the new data. Even if you change one byte, an entire page is affected (in MySQL’s default InnoDB engine, page size is 16K). LSM Tree has a similar issue, but it serializes all random writes where each page is returned in order, which B+ Tree cannot do. LSM Tree is also more friendly to compression with a more tightly packed storage format than B+ Tree.
We chose RocksDB’s LSM Tree-based storage engine because: 1. RocksDB has a massive and active community; 2. RocksDB has already been battle-tested within Facebook; 3. RocksDB’s connectors are nicely generalized and easy to use; and 4. it exposes many knobs that can be optimized directly (unlike LevelDB). As our team gain an increasingly deep understanding of RocksDB, we’ve become one of its biggest users and contributors. RocksDB has been gaining more adoption in both industry and academia; in fact, many of the latest papers related to LSM Tree improvement are developed based on RocksDB. As RocksDB becomes more “rock-steady,” so will TiDB.
etcd Raft. We chose to leverage the etcd Raft library for much of the same type of reasons that led us to RocksDB. Before deciding on etcd, we first decided to use Raft as our consensus algorithm after considering MultiPaxos, because Raft is very friendly to development of an industry-facing product (even the RPC structure was clearly described in the original Raft paper!). Choosing Raft led us to etcd, which at the time was already a mature open-source project that implements Raft with many adopters. Furthermore, we were impressed by (and thus leveraged) etcd’s friendly design towards conducting rigorous testing. Its stateful machine connectors are beautifully abstracted, so much so that you can completely separate them from the operating system’s API, which dramatically reduces the difficulty and cost of writing unit tests. etcd also has many hooks that allow developers to easily inject different types of faults and errors, clearly showing that its team places great emphasis on rigorous testing, which gave us a lot of confidence.
Our team is now one of etcd’s most active contributors, providing everything from bug fixes to major new feature contribution, like Learner. We didn’t completely fork etcd for our own use, because etcd was written in Go and we built TiKV in Rust, so we worked on realizing etcd’s Raft implementation in Rust, which also meant that we had to port etcd’s suite of unit and integration tests to make sure that what we replicated in TiKV has the same high level of correctness. TiKV also differs from etcd in that it has a Multi-Raft model, where one cluster can have a large number of Raft groups that are independent of each other, and the system must be able to safely and dynamically split and transfer different Raft groups within this complexity. For more info on how we made Multi-Raft happen, check out this blog.
TiDB’s in-production hardware requirements are quite high, but for good reasons. On the storage side, we highly recommend using SSDs (NVMe/PCIe/Optane) inside machines that also have decent CPU speed and RAM size, along with fast network bandwidth. We make these requirements a priority, because only good hardware can show the true strengths of good software.
Like I mentioned, we use RocksDB and the nature of a LSM Tree-based storage engine leads to high input-output disk requirement. RocksDB also has parallel compaction mechanism. All these high-end features need high-end storage unit, i.e. SSDs, to perform well. On the CPU side, because the TiDB platform uses the feature-rich gRPC to communicate between components and gRPC relies on HTTP2, which entails additional CPU cost, good CPU speed is necessary to achieve good performance. Also, TiKV’s implementation of RocksDB requires a large number of Prefix Scan, meaning many binary searches and string comparison, which is quite different from traditional offline data warehouse. This also leads to high requirement for CPU speed.
On the stateless SQL layer, TiDB is quite computation intensive, which leads to RAM size requirement. When TiDB executes certain types of join, say HashJoin, it may carve out a big chunk of RAM and execute the entire join operation in-memory, in order to return results faster. This could lead to huge RAM consumption if the queries are complex or if the tables are large. When we design this layer, our philosophy is to maximize RAM for performance, which is a different approach from the classic single instance database optimization that sends queries to be processed on disk if they cannot fit in RAM. In TiDB, we would rather refuse to execute a query and send users an error, if we run out of RAM, than pushing the query to disk and let it linger. Sometimes a query in limbo is much worse than a failed query.
Many of our adopters use TiDB to replace and upgrade their online transactional workloads, which typically have very high performance requirements. It’s also the use case TiDB is best suited to tackle. However, in our early days of deployment, we didn’t enforce any strict requirement on hardware, which led to performance issues and unmet expectations when services hit peak traffic. Even though on the software level, TiDB had many optimizations to handle peak workloads, that power didn’t yield good results because of sub-optimal hardware. Thus, we started to enforce certain hardware requirements when testing TiDB, so there’s less headache and less worries for our users.
We implemented range-based sharding inside TiKV instead of hash-based, because our goal from the beginning was to support a full-featured relational database, and a relational database must support various types of Scan operations, like Table Scan, Index Scan, etc. Even though range-based sharding is harder to build, hash-based sharding can’t maintain a given table’s data in sequence, thus even a small scan on a few rows in a hash-based setting could mean jumping around multiple shards between multiple nodes. Range-based sharding won’t have this issue.
Of course, that doesn’t mean range-based sharding doesn’t have its own tradeoffs. For example, if your query pattern is mostly sequential writes where write operations far exceed reads with no Update or Delete, like log, then hotspots could form. For this scenario, we plan to use Partition Table as a solution to alleviate the system’s pressure caused by this write pattern (this is on our roadmap for 2018). Another situation that is much harder to resolve is if your request pattern contains frequent reads/writes on a small table that happens to be in the same Raft region. Right now, each TiKV Raft region is split along ordered key-value pairs and defaulted to 96MB. If such a small table with frequent reads/writes is less than 96MB, a hotspot would no doubt form, and the only solution (so far) is to manually split the region into two and redistribute them on different nodes. This “fix” won’t solve the hotspot issue, but only somewhat alleviate it by making the hotspot smaller. This is a problem that a distributed database isn’t well-suited to solve, so if you have this type of request pattern, we suggest you use an in-memory caching layer, like Redis, or put this small table directly in your application layer.
When a company starts out small, any database and infrastructure would work, but when that company starts to experience explosive (often unexpected) growth, your choice of infrastructure technology matters a great deal. Making the right choice that can easily scale in either direction depending on your business needs could mean the life and death of your company. We all remember Twitter’s infamous “fail whale” when that service was constantly going down due to not just user growth, but also poor infrastructure. When your database becomes the bottleneck, the solution is not to manually shard your tables, sacrifice relational capabilities, make a bunch of copies of the same table, and rinse and repeat this vicious cycle. The correct (and most cost-effective) way is to use a distributed database that scales elastically, so all you have to do is add new machines to increase capacity. Adding more machines may seem like a growing cost, but it’s far cheaper comparing to what you save in human resources and the precious engineering time you need to respond and adapt in a competitive environment.
When we first started designing and building TiDB, we carried a lot of these lessons, from both our own experiences and other fellow DBAs and infrastructure engineers, with us. A truly useful and robust distributed database should instantly solve scalability bottlenecks and render all the “quick fixes” like manual sharding obsolete, so application developers can focus on doing what they do best–serve customers and grow the business, not managing database shards. Recently, we’ve seen two of our users, Mobike (dockless bikesharing) and Zhuan Zhuan (online marketplace), do exactly that. Both companies have been experiencing explosive growth, and because they deployed TiDB right as this growth was taking off, their database infrastructure was not the bottleneck that prevented them from growing the way they did. In fact, Zhuan Zhuan is all-in on TiDB, because it knows that a well-built distributed database is mission critical to its future.
Note: An abridged version of this article was published on The New Stack.