Kafka Streaming Data: How to Integrate Pipelines with TiDB

Modern applications generate enormous amounts of event data with user actions, transactions, logs, and metrics all happening in real time. To handle this scale, many teams rely on Apache Kafka, a distributed messaging system that decouples applications from their data pipelines and ensures reliable, high-throughput data delivery.

On the storage side, TiDB provides a distributed SQL database that scales horizontally, handles both transactional and analytical queries, and maintains low-latency performance even under heavy load.

Together, Kafka and TiDB form a powerful foundation for real-time workloads where high write throughput and fast data processing are critical.

This two-part blog tutorial explores how to integrate Kafka with TiDB. Part 1 covers the basics around how to stream data from Kafka to TiDB and why this architecture is becoming increasingly popular. Part 2 will examine how TiDB performs when Kafka processes millions of messages per second and how to monitor TiDB’s internal performance.

Why Stream Data through Kafka?

A recent customer project involved an application that sent messages directly to Kafka. From Kafka, data flowed into a persistent storage layer that included systems such as SQL Server and Cassandra.

This design choice is common for systems that handle large volumes of writes. Sending data directly to a database under heavy load can lead to latency issues, slowing down the entire application. Kafka helps mitigate this by acting as a buffer between the application and the database, ensuring that high-frequency writes are first collected and processed asynchronously before reaching the storage layer.

By decoupling ingestion from persistence, Kafka maintains consistent performance and reliability even during spikes in traffic.

Fig. 1: How Kafka decouples application data streams

What is TiDB?

TiDB is an open-source, distributed SQL database designed for horizontal scalability, strong consistency, and high availability. It uses a decoupled compute and storage architecture, allowing each layer to scale independently — a key advantage for cost and performance optimization.

TiDB is MySQL-compatible, which means existing applications, drivers, and SQL syntax can often be reused with minimal modification. This compatibility also makes it easier to migrate from other databases such as MySQL, PostgreSQL, or MongoDB.

Fig. 2: How TiDB complements Kafka for unified workloads

Let’s start with an example from a test TiDB instance running on the cloud:

ankitkapoor@Ankits-MacBook-Air bin % ./mysql -uankit -hxxx -P 4000 -p


mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| PERFORMANCE_SCHEMA |
| ankit              |
| kafka              |
| mysql              |
| test               |
+--------------------+
6 rows in set (0.10 sec)

What About TiDB to Kafka?

TiDB can also stream data to Kafka using TiCDC, which is documented here.

TiCDC (TiDB Change Data Capture) is a component that captures real-time changes from TiDB and replicates them downstream. It reads Raft logs, internal records that track every change in the TiDB cluster, and pushes those changes to external systems like Kafka, another TiDB cluster, or cloud storage.

For reference, TiDB Raft log files typically look like this:

-rw-r--r--  1 ankitkapoor  cc    69B 11 Sep 02:29 0000000000000001.rewrite
-rw-r--r--  1 ankitkapoor  cc     0B 11 Sep 02:39 LOCK
-rw-r--r--  1 ankitkapoor  cc   869K 11 Sep 02:39 0000000000000001.raftlog  <— Raft log

While TiCDC handles data from TiDB to Kafka, this article focuses on the reverse, Kafka to TiDB.

Kafka to TiDB: Overview

Streaming data from Kafka to TiDB is often achieved using Kafka Connect, an open-source framework for building scalable and reliable data pipelines. While other tools like PySpark can accomplish this, Kafka Connect provides a simpler and more performant approach, especially for production environments.

Since TiDB is MySQL-compatible, existing MySQL JDBC drivers can be used to set up the data stream between Kafka and TiDB.

Requirements

To follow this guide, the following components are required:

Kafka
Zookeeper
Kafka-topics
Kafka-console-producer
Kafka-console-consumer
Kafka Sink connector
MySQL client
TiDB cluster

Test Environment

Local machine: MacOS 15.6.1
MySQL client: 9.4.0 (any recent version will work)
Database: TiDB Cloud Serverless (publicly available)

What this Blog Won’t Cover

This blog assumes basic familiarity with Kafka fundamentals such as Zookeeper, Kafka topics, messages, and streaming concepts. Those topics are well-documented in the official Kafka resources and will not be repeated here.

Getting Started

Step 1: Install Kafka

brew install kafka

Expected message:

To start kafka now and restart at login:
  brew services start kafka
Or, if you don't want/need a background service you can just run:
  /opt/homebrew/opt/kafka/bin/kafka-server-start /opt/homebrew/etc/kafka/server.properties

For Linux, refer to the official setup guide.

Step 2: Install Zookeeper

brew install zookeeper

Expected message:

To start zookeeper now and restart at login:
  brew services start zookeeper
Or, if you don't want/need a background service you can just run:
  SERVER_JVMFLAGS="-Dapple.awt.UIElement=true" /opt/homebrew/opt/zookeeper/bin/zkServer start-foreground

Start Zookeeper:

brew services start zookeeper

Step 3: Download Dependencies

Download the following:

Kafka Connect JDBC: confluentinc-kafka-connect-jdbc-10.8.4
MySQL JDBC Connector: mysql-connector-j-9.4.0

Move the MySQL connector JAR into the Confluent library and create two configuration files:

Connect-standalone.properties
Mysql-sink-connector.properties

Step 4: Configure Kafka Connect

connect-standalone.properties

bootstrap.servers=localhost:9092

key.converter=org.apache.kafka.connect.json.JsonConverter

value.converter=org.apache.kafka.connect.json.JsonConverter

key.converter.schemas.enable=false

value.converter.schemas.enable=false

offset.storage.file.filename=/tmp/connect.offsets

plugin.path=/pathto_sink_jdbc_connector/

mysql-sink-connector.properties

name=jdbc-sink

connector.class=io.confluent.connect.jdbc.JdbcSinkConnector

tasks.max=1

topics=kafka_to_TiDB ( one which we will be creating later, you can choose your desired name )

connection.url=jdbc:hostname:4000/yourdatabase

connection.user=user_name

connection.password=password

auto.create=false

auto.evolve=false

insert.mode=insert

pk.mode=none

table.name.format=tb_kafka_to_TiDB

key.converter=org.apache.kafka.connect.json.JsonConverter

value.converter=org.apache.kafka.connect.json.JsonConverter

key.converter.schemas.enable=true

value.converter.schemas.enable=true

transforms=filter

transforms.filter.type=org.apache.kafka.connect.transforms.ReplaceField$Value

transforms.filter.include=id,user

Step 5: Create the Target TiDB Table

CREATE TABLE `tb_kafka_to_TiDB` (

  `id` int DEFAULT NULL,

  `user` char(255) DEFAULT NULL

)

Step 6: Start Kafka Connect

Run the following command in the same directory as the configuration files:

connect-standalone connect-standalone.properties mysql-sink-connector.properties

Note: Ensure that this command is run in the same folder where the two configuration files — Connect-standalone.properties and Mysql-sink-connector.properties — were created.

Successful startup logs will include:

kafka_to_TiDB-0 (org.apache.kafka.clients.consumer.internals.ConsumerRebalanceListenerInvoker:58)

[2025-08-18 19:58:35,603] INFO [jdbc-sink|task-0] [Consumer clientId=connector-consumer-jdbc-sink-0, groupId=connect-jdbc-sink] Found no committed offset for partition kafka_to_TiDB-0 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1508)

[2025-08-18 19:58:35,607] INFO [jdbc-sink|task-0] [Consumer clientId=connector-consumer-jdbc-sink-0, groupId=connect-jdbc-sink] Resetting offset for partition kafka_to_TiDB-0 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[localhost:9092 (id: 1 rack: null isFenced: false)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState:447)

[2025-08-18 19:58:46,968] INFO [jdbc-sink|task-0] JdbcDbWriter Connected (io.confluent.connect.jdbc.sink.JdbcDbWriter:57)

Step 7: Create and Test a Kafka Topic

Create a topic and start a producer:

kafka-console-producer --bootstrap-server localhost:9092 --topic kafka_to_TiDB --property parse.key=false --property "key.separator=:"

Then, start a consumer to verify message parsing:

kafka-console-consumer --bootstrap-server localhost:9092 --topic kafka_to_TiDB --from-beginning

hello

kafka

whats goin on

man

"User signed up"

"User signed up"

"User signed up"

{"id": 123, "status": "active"}

{"temperature": 25.4}

Step 8: Send Messages to Kafka

kafka-console-producer --bootstrap-server localhost:9092 --topic kafka_to_TiDB --property parse.key=false --property "key.separator=:"

>{"schema":{"type":"struct","fields":[{"field":"id","type":"int32"},{"field":"user","type":"string"}],"optional":false,"name":"kafka_to_TiDB"},"payload":{"id":1,"user":"Ankit"}}

The Kafka Connect logs should confirm successful writes:

[2025-08-18 19:58:48,424] INFO [jdbc-sink|task-0] Setting metadata for table "ankit"."kafka_to_TiDB" to Table{name='"ankit"."kafka_to_TiDB"', type=TABLE columns=[Column{'id', isPrimaryKey=false, allowsNull=true, sqlType=INT}, Column{'user', isPrimaryKey=false, allowsNull=true, sqlType=CHAR}]} (io.confluent.connect.jdbc.util.TableDefinitions:64)

[2025-08-18 19:58:48,725] INFO [jdbc-sink|task-0] Completed write operation for 1 records to the database (io.confluent.connect.jdbc.sink.JdbcDbWriter:100)

[2025-08-18 19:58:48,726] INFO [jdbc-sink|task-0] Successfully wrote 1 records. (io.confluent.connect.jdbc.sink.JdbcSinkTask:91)

Verify in TiDB

Finally, connect to TiDB using the MySQL client:


./mysql -u 'ankit' -hhostname -P 4000 -p

Then query the table:


mysql> select * from ankit.kafka_to_TiDB;
+------+-------+
| id   | user  |
+------+-------+
|    1 | Ankit |
+------+-------+

You’ll see that your data was successfully inserted, and Kafka is now streaming events into TiDB.

Conclusion

By streaming data from Kafka to TiDB, organizations can take advantage of Kafka’s ability to handle massive event throughput while leveraging TiDB’s distributed SQL capabilities for scalable, real-time data processing. This setup helps reduce latency, prevent write bottlenecks, and ensure application performance remains smooth even under demanding workloads.

In Part 2 of this blog tutorial, we’ll dive into performance testing and observability, exploring how this architecture behaves under millions of messages per second and how to effectively monitor TiDB’s performance.

Want to try this yourself? Experience TiDB in action with the TiDB Cloud Quick Start Lab. For a deeper dive into distributed SQL, check out the TiDB University Courses with self-paced modules that cover everything from TiDB fundamentals to advanced performance tuning and real-world streaming integrations.

Experience modern data infrastructure firsthand.

Start for Free

What Is

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

How to Stream Data from Kafka to TiDB