The Long Expedition toward Making a Real-Time HTAP Database

Recently, Hybrid Transactional and Analytical Processing (HTAP) has become a hot topic as large players like Google (AlloyDB), Snowflake (Unistore) and Oracle (HeatWave) have joined the game. But still, many people don’t know what makes an HTAP database.

TiDB is an open-source, distributed HTAP database. Thousands of companies in different industries have benefited from TiDB’s HTAP abilities over the years. However, building an HTAP database is quite a long expedition, and we are still on our way.

I am one of the designers and developers of TiDB’s HTAP architecture. In this post I’ll share some stories behind our HTAP design decisions and how we learned from our customers and built an HTAP database trusted and recognized both by customers, researchers, and developers.

Where the HTAP dream begins

Earlier TiDB versions, mainly those before TiDB 2.0, were designed to be strong Online Transactional Processing (OLTP) databases with horizontal scalability, high availability, strong consistency, and MySQL compatibility.

The figure below shows TiDB’s original architecture with three key components: the TiDB server, the Placement Driver (PD), and the TiKV server. The architecture did not include later features such as a columnar engine, massively parallel processing (MPP) architecture, and vectorization.

TiDB’s original architecture

PD was the brain of the whole database system, which scheduled the storage nodes and kept their workloads balanced. TiKV was the storage engine which stored data in a distributed way and used a row format with cross-node ACID transaction support. The TiDB server was the stateless SQL compute engine with a classic volcano execution design.

When we tried to get early adopters, we found most users were hesitant to use a brand new database in their mission critical transactional cases. Instead, they were inclined to use TiDB as their backup database for analytical workloads. Conversations like below happened many times:

“This is an edgy distributed design … I bet you don’t want to miss the amazing experience with no sharding.” We tried to persuade our potential customers to adopt TiDB.

“Hmm,” they replied. “Sounds interesting. Can we use it as a read-only replica for analytical workloads first? Let’s see how it works.“

We implemented the coprocessor framework to TiDB to speed up analytical queries. It allowed limited computation such as aggregations and filtering to be pushed down to TiKV nodes and executed in a distributed manner. This framework worked very well and gained us more TiDB adopters.

TiKV coprocessor

Thank you, Apache Spark!

It was all good in the beginning, until we encountered more complicated analytical scenarios. Our customers told us that TiDB worked very well in OLTP scenarios, but was “a bit slow” in Online Analytical Processing (OLAP) scenarios, especially when they used TiDB to analyze a large volume of data and perform big JOINS. Also, TiDB did not work well with their big data ecosystem.

In short, the problem was TiDB’s unmatched computation power with its scalable storage.

TiDB’s storage system could scale, but the TiDB server, the computational component, could not. In OLTP scenarios, this problem could be fixed by adding multiple TiDB servers on top of TiKV. But in OLAP scenarios, each query could be very large. Since TiDB did not have an MPP architecture, TiDB servers could not share a single query workload. Operations like large JOINs became unacceptably slow. We urgently needed a scalable computation layer that could shuffle data around and work with the scalable storage layer together to deal with large queries.

To fix this problem, we either needed to have our own MPP framework or we had to leverage an external engine. We only had a small team then, so in TiDB 3.0 we decided to leverage Apache Spark, a well-implemented and unified computation engine, and built a Spark plugin called TiSpark on top of TiKV. It included a TiKV client, a TiDB compatible type system, some coprocessor specific physical operators, and a plan rewriter. TiSpark partially converts the Spark SQL plan into the coprocessor plan, gathers results from TiKV, and finishes the computing in the native Spark engine. Thanks to Spark’s flexible extension framework, all this was achieved without changing a single line of Spark code.

TiSpark architecture

TiSpark empowered TiDB in large scale analytical scenarios; but for small- to medium-sized queries, it didn’t work very well. To fix this problem, we also improved TiDB’s native compute engine. We changed TiDB’s optimizer from rule-based to cost-based, and also improved the just-in-time (JIT) design.

One more thing: TiSpark bridged the gap between TiDB and big data ecosystems. In many scenarios, TiDB serves as the warm data platform between the OLTP layer and data lakes due to its seamless integration of Spark.

No columnar store, no HTAP

Let’s revisit what we had in TiDB 3.0. We had coprocessors on top of the storage layer, a cost-based optimizer with some smart operators, a single node vectorized compute engine, and a Spark accelerator. Despite all this, TiDB was still not a true HTAP database.

First, TiDB didn’t have a columnar storage engine, which was crucial for analytics. Second, TiDB could not support workload isolation, and workload interference happened often. In the case of ZTO Express, one of the largest logistics companies in the world, the customer had to reserve quite a lot of TiKV resources for TiSpark due to the inefficiency of the row format and bad workload isolation. Performance profiling showed that TiSpark on TiKV burned unnecessary I/O bandwidth and CPU on the unused columns. In addition, ZTO Express had to add quite a few extra TiKV nodes to lower the peak resource utilization and leave enough safety space for OLTP workloads. Otherwise, the highly concurrent queries for individual packages would be greatly impacted by reporting and cause unstable performance.

We tried things like thread pool and some “smart” scheduling techniques, but they were risky and not flexible enough to put two workloads in the same machine. After some failed experiments, we built a special component that alleviated both these headaches: TiFlash, a distributed columnar store that replicated data from TiKV in a columnar format.

TiFlash looked a bit different than it does today. The TiFlash prototype was on top of Ceph, an open-source, software-defined storage platform, with the change data capture (CDC) as the data replication channel. This is similar to Snowflake on top of the object storage. I will not call this a “wrong decision” since we still add object storage support, but it was too aggressive for most TiDB users. At that time, they favored on-premises solutions, and Ceph was too cumbersome if added to our product.

After some unsuccessful user trials, we turned to a Raft learner-based design: TiFlash replicating data from TiKV via the Raft protocol as a non-voting role. Users could dedicate different machines to run the analytical engine and make the asynchronous replication from the OLTP engine. With the help of the Raft protocol, TiFlash could also provide consistent snapshot reads by checking replication progress as well as multiversion concurrency control (MVCC). This led to complete workload isolation.

The diagram below shows the architecture of TiKV and TiFlash. The lower left shows the TiFlash storage layer, and the lower right shows the TiKV storage layer. Data was written into the TiDB server and then was synced from TiKV to TiFlash via a Raft learn protocol. The async replication did not impact the normal OLTP workloads in TiKV.

TiFalsh and TiKV architecture

TiFlash had an updatable column store. In general, a columnar store is not fit for online updates based on primary keys. Traditional data warehouses or databases only support batch updates each hour or day. To solve this problem, we introduced a delta tree, a new design that can be seen as a combination of a B+ tree and log-structured merge (LSM) tree. However, the delta tree has larger leaf nodes than a B+ tree, and it has double the layers of an LSM tree. It divides the column engine into a write-optimized area and a read-optimized area.

LSM tree vs delta tree

With the inception of TiFlash, TiDB 4.0 became a true HTAP database. If you are interested in knowing more about the HTAP design, I recommend that you read the paper, TiDB: A Raft-based HTAP Database^[1].

A smart MPP design and an even smarter optimizer

TiDB became a true HTAP database when it introduced TiFlash in TiDB 4.0, but we still had a technical issue to address: TiSpark was the only distributed query engine in the TiDB ecosystem. It was not suitable for small- to medium-sized interactive cases, and its MR-style shuffle model was quite heavy. We needed a native computation engine in the distributed framework with an MPP style. Moreover, we reached a sticking point: to add new features or optimize our code, we’d have to modify the Spark engine itself. But if we did that, it would be a heavy burden to keep in sync with the upstream.

After quite a few intense debates, we decided to go for the MPP architecture in TiDB 5.0. With this new architecture, TiFlash would be more than a storage node: it would be a fully-functioning analytical engine. The TiDB server would still be the single entrance to SQL, and the optimizer would choose the most efficient query execution plan based on cost, but it had one more option: the MPP engine.

In TiDB’s MPP mode, TiFlash complements the computing capabilities of the TiDB servers. When the TiDB server deals with OLAP workloads, it steps back to be a master node. The user sends a request to the TiDB server, and all TiDB servers perform table joins and submit the results to the optimizer for decision making. The optimizer assesses all the possible execution plans (row-based, column-based, indexes, single-server engine, and MPP engine) and chooses the optimal one.

TiDB’s MPP mode

The following diagram shows how the analytical engine breaks down and processes the execution plan in TiDB’s MPP mode. Each dotted box represents the physical border of a node.

A query execution plan in MPP mode

Our first version of the MPP framework already exceeded the TPC-H performance of some traditional analytical databases like Greenplum. It also greatly expanded TiDB’s use cases and, in 2021, it helped us acquire quite a few important HTAP customers. All the days and nights we spent at customer sites led to solid improvements to the MPP architecture in TiDB 6.0. The TiFlash engine finally entered was maturing fast.

Citius, Altius, Fortius

TiDB 5.0 delivered the first version of the TiFlash analytical engine with an MPP execution mode to serve a wider range of application scenarios. In TiDB 6.0, we improved TiFlash even more, making it support:

More operators and functions. The TiDB 6.0 analysis engine adds over 110 built-in functions as well as multiple JOIN operators. Moreover, MPP mode supports the window function framework and partition table. This release substantially improves TiDB analysis engine performance, which in turn benefits computing.

An optimized thread model. Earlier versions of TiDB placed little restraint on thread resource usage for the MPP mode. This could waste a large amount of resources when the system handled high-concurrency short queries. Also, when performing complex calculations, the MPP engine occupied a lot of threads, which led to performance and stability issues. To address this problem, TiDB 6.0 introduces a flexible thread pool and restructures the way operators hold threads. This optimizes resource usage in MPP mode and multiplies performance with the same computing resources in short queries and better reliability in high-pressure queries.

A more efficient column engine. By adjusting the storage engine’s file structure and I/O model, TiDB 6.0 not only optimizes the plan for accessing replicas and file blocks on different nodes, but it also improves write amplification and overall code efficiency. Test results from our customers indicate that concurrency capability has improved by over 50% to 100% in high read-write hybrid workloads with CPU, and memory resource usage has dramatically reduced.

Looking ahead

Building an HTAP database is a long journey, and our efforts paid off. More and more TiDB users are benefitting from TiDB HTAP abilities for faster decision making, better user experience, and quicker time to market.

Even though we’ve traveled far, there’s still a long road ahead. In fact, HTAP is never a pure technical term, but also represents users’ needs. It has been evolving over the years, from in-memory technologies in the very beginning, to various designs today. For example, SingleStore follows the “classic” in-memory architecture with a single engine; HeatWave is mainly in-memory with a separated engine; TiDB and AlloyDB use on disk storage and separate workloads into different resources. Although HTAP practitioners make different design decisions, one thing in common never changes: users’ need is the utterly most important. HTAP designs will continue to evolve to solve users’ problems smartly. There are still many problems from the user side, such as real-time data modeling and transforming, and better leveraging the cloud infra, needing to be fixed.

Nevertheless, I believe HTAP databases will eventually prevail in the database world. Before that day comes, we will continue our long expedition.

If you are interested in TiDB, you’re welcome to join our community on Slack and TiDB Internals to share your thoughts with us. You can also follow us on Twitter, LinkedIn, and GitHub for the latest information.

Keep reading:
Using Retool and TiDB Cloud to Build a Real-Time Kanban in 30 Minutes
Build a Better Github Insight Tool in a Week? A True Story
The Beauty of HTAP: TiDB and AlloyDB as Examples

References: