Hash Join Improvements in TiDB: Up to 5X Faster with No Tuning

Hash joins are a common performance bottleneck in SQL workloads, and TiDB was no exception. In TiDB 8.5, we introduced a brand-new hash join executor that improves performance across the board. This new engine fully exploits modern hardware with multi-threaded build, vectorized probe, and a more efficient spill strategy. Internal benchmarks show it doubles the performance of common join workloads, reduces latency, and improves predictability under tight memory constraints.

This is the first part of a two part blog series. In this post, we’ll focus on high-level updates and what users need to know to benefit from the performance gains.

Why Joins Still Matter

Joins are at the heart of both analytical and transactional workloads. When they slow down, they do not just impact a single query. They can bottleneck entire applications.

Hash joins are often the go-to for performance, but under real-world conditions like skewed data or constrained memory, they can become unreliable and hard to debug. TiDB’s original hash join executor had several limitations: it built the hash table using a single thread, relied on Go’s general-purpose map structure, and had limited spill capabilities when memory ran out.

These issues led to high latency and inconsistent performance. If you have ever had a TiDB join stall, this was likely the cause.

Background

TiDB uses hash joins to execute a wide range of SQL queries, particularly when equality conditions are involved across large datasets. In a typical hash join, the executor selects one table as the build side and constructs an in-memory hash table from its rows. The other table is used to probe for matches.

For example, the following queries all trigger a hash join in TiDB:

-- Classic inner join
SELECT * FROM lineitem l JOIN orders o ON l.orderkey = o.orderkey;

-- Semi join with filter
SELECT * FROM lineitem l WHERE EXISTS (
  SELECT 1 FROM orders o WHERE l.orderkey = o.orderkey AND o.orderstatus = 'F'
);

-- Join with additional filter
SELECT * FROM lineitem l JOIN orders o
  ON l.orderkey = o.orderkey AND l.partkey > o.custkey;

In earlier versions of TiDB, the build side was processed by a single thread using Go’s built-in map structure. This structure was not optimized for large-scale joins. It introduced pointer indirection, poor CPU cache locality, and runtime map resizes, which became performance bottlenecks as data volumes grew.

The probe phase had similar issues. Joins were executed one row at a time, which prevented any form of vectorization. For join types like semi joins, the executor did not short-circuit after finding a match. Instead, it continued scanning for duplicates. If the join involved string keys, collation logic was re-evaluated on each candidate match, further increasing cost.

When memory usage exceeded configured limits, TiDB attempted to spill intermediate data to disk. However, only the row data was spilled; the hash table remained in memory. This limited how much memory could be reclaimed. Under constrained memory, this often resulted in query failures or significantly degraded performance due to inefficient disk access patterns.

These limitations motivated a complete redesign of the hash join executor in TiDB 8.5, with a focus on parallelism, memory efficiency, and predictable performance.

The Legacy Bottlenecks

The original hash join executor in TiDB had several architectural limitations that created performance challenges, especially in large or complex queries like the examples shown in the previous section. These issues affected all three major phases of execution: build, probe, and spill.

Build phase

The build phase was limited by its single-threaded design. Even when TiDB was running on a system with 16 or 32 cores, only one core was used to construct the hash table. This created a hard limit on throughput, particularly for joins like the lineitem–orders query:

SELECT * FROM lineitem l JOIN orders o ON l.orderkey = o.orderkey;

Additional bottlenecks included:

Single-threaded execution. The hash table for the build side (e.g., orders) was constructed by a single worker.
Go map inefficiencies. The use of Go’s built-in map introduced pointer chasing and cache misses, which slowed down CPU-bound operations.
Dynamic resizing. As the map grew, it resized unpredictably during query execution, causing latency spikes.

Probe phase

Once the build was complete, the probe side processed rows one at a time. This was costly, especially in queries with join filters or non-equality conditions, such as:

SELECT * FROM lineitem l JOIN orders o
  ON l.orderkey = o.orderkey AND l.partkey > o.custkey;

Key inefficiencies during this phase:

No vectorization. Each input row triggered a full function call for key matching and predicate evaluation.
Unoptimized join semantics. For semi joins, TiDB scanned all matching build rows, even if a single match would have been sufficient.
Repeated comparisons. String keys with collation rules were re-evaluated on each candidate match, adding significant CPU overhead.

Spill phase

When queries exceeded their memory quota, TiDB attempted to spill data to disk. However, this mechanism was incomplete and introduced further problems:

Incomplete spilling. Only row data was written to disk. The in-memory hash table remained resident, limiting how much memory could be recovered.
Inefficient I/O. Probing against spilled data required random disk access, which led to unpredictable and often poor performance.

These bottlenecks made join-heavy workloads hard to scale, especially when memory was constrained or the input data was large or skewed.

Design Principles for a Fresh Start

To address the limitations of the original executor, we redesigned the hash join engine from the ground up in TiDB 8.5. The new design is built around five principles that improve performance, predictability, and scalability.

Exploit all available CPU cores
1. The build-side data is physically partitioned.
2. Each partition is processed in parallel by separate threads.
3. This eliminates the single-threaded bottleneck and shortens the build phase significantly.
Store build-side data in row format
1. The build input is stored as rows rather than columns.
2. This improves cache locality during result construction.
3. It also simplifies join result materialization, especially for wide rows.
Use a fixed-size, chain-based hash table
1. Memory for the hash table is allocated up front.
2. Hash collisions are handled using chaining rather than probing.
3. This avoids dynamic resizing and ensures consistent performance.
Run probes using a vectorized, chunk-based path
1. Probing is performed in batches instead of row by row.
2. Join keys and filters are evaluated together using vectorized logic.
3. This improves throughput and reduces per-row overhead.
Enable spill that includes hash buckets
1. When memory is low, entire partitions, including rows and hash table buckets, are spilled to disk.
2. Both spilling and restoring are sequential and partition-aware.
3. This allows spilling to scale while maintaining acceptable performance under constrained memory.

These changes allow the new engine to scale up across cores and scale down under tight memory, while maintaining predictable behavior and high performance.

How the New Engine Works

The new hash join engine introduces parallelism, vectorization, and a smarter spill path to address each phase of execution. The following table summarizes what changed and how it improves performance.

Stage	What Changed	Why It Is Faster
Pre-build	Input chunks are hash-partitioned and simultaneously converted from column format to row format.	Eliminates the need for a second scan and improves row locality for result construction.
Build	Each partition is built in parallel using a thread-local, fixed-size hash table.	Fully utilizes CPU cores. In TPC-H 50 tests, hash table build time dropped from 73 seconds to 7.
Probe	Probing is done in batches. Tagged pointers quickly reject most false matches, and filters are evaluated using vectorized expressions.	Reduces unnecessary comparisons and improves cache efficiency. Probe time dropped from 208 seconds to 77 in TPC-H 50.
Spill	If memory limits are reached, entire partitions (including hash buckets) are written to disk sequentially and restored in the same form.	Minimizes random I/O. Performance remains stable even with 1 GB memory quota.

This new execution model not only speeds up in-memory joins but also makes out-of-memory behavior much more predictable and efficient.

Real-World Impact

To evaluate the new hash join engine, we tested it against the legacy implementation using representative queries and datasets. The improvements are significant across a variety of join types and scales.

Benchmark: Simple inner join

Joining two large tables with a straightforward equality condition.

SELECT * FROM lineitem l JOIN orders o ON l.orderkey = o.orderkey;

Legacy: 65 seconds
New engine: 29 seconds
Speedup: 2.2×

Benchmark: Join with non-equality predicate

Adds a post-join filter, increasing CPU and memory pressure.

SELECT * FROM lineitem l JOIN orders o
  ON l.orderkey = o.orderkey AND l.partkey > o.custkey;

Legacy: 143 seconds
New engine: 29 seconds
Speedup: 4.9×

Benchmark: Full TPC-H 50 GB

End-to-end results for the TPC-H benchmark suite at scale factor 50.

Hash build time:
- Legacy: 73 seconds
- New: 7 seconds
Hash probe time:
- Legacy: 208 seconds
- New: 77 seconds
Total query runtime:
- Legacy: 647 seconds
- New: 561 seconds

At this scale, join performance is no longer the primary bottleneck. The remaining time is dominated by scan and aggregation operators.

Even under constrained resources, the new executor performs reliably. With just a 1 GB memory quota, the new engine completes queries successfully while the legacy path either fails or becomes 6× slower.

What This Means for You

The improvements in TiDB’s hash join executor are designed to deliver consistent performance, even as workloads scale or memory becomes constrained. Whether you’re running complex analytic queries or high-throughput transactional joins, the benefits are immediate and measurable.

Here’s what you can expect:

Consistent latency under pressure– The new engine maintains stable performance even with aggressive memory quotas. Spills are handled efficiently, and queries remain predictable.
Better resource utilization– Fixed-size memory allocation and parallel execution reduce CPU and memory waste. Joins complete faster without requiring oversized instances.
Scalability across workloads– Whether you are joining millions or billions of rows, the new engine adapts to available hardware and continues to perform efficiently.
Future-ready architecture– The vectorized and partitioned design opens the door for future improvements, such as SIMD acceleration, GPU integration, and advanced join optimizations.

This update is one of the most impactful performance improvements in TiDB 8.5 and lays a strong foundation for continued execution-layer enhancements in upcoming releases.

Next Steps

The new hash join executor is available starting in TiDB 8.5. It is disabled by default in this version but will become the default behavior in a future version.

To try it out in 8.5, you can enable it at the session level:

set session tidb_hash_join_version=optimized

Once enabled, the optimizer will choose the new executor for eligible queries automatically. No schema changes or application modifications are required.

If you are currently tuning queries or experiencing unpredictable performance on large joins, we encourage you to test this feature in your environment. You can report feedback or share results with us in GitHub Discussions or in the TiDB Community Slack.

This upgrade is part of our broader investment in execution-layer performance. We look forward to hearing how it performs in your workloads.

In the second part of this blog series, we’ll dive deeper into the engineering details for the open source community. Stay tuned for this next entry in the coming weeks.

Experience modern data infrastructure firsthand.

Start for Free

Thought Leadership

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

Improving TiDB Hash Joins: Up to 5× Faster with No Tuning Required