Transcreator: Fendy Feng; Editor: Tom Dewan
Data Lake is one of the most popular data storage technologies to emerge in recent years. Contrary to traditional data warehouses, a data lake can hold a vast amount of raw data in its native format until the data is needed for analytical applications. A data lake can also update incremental data in real time. In addition, a data lake is much cheaper than a database for storing massive data. That’s why more and more enterprises choose to integrate their databases with a data lake and let the latter store their massive historical data with fewer query workloads.
To direct the data from a database to a data lake, you need an export tool. Different databases have different tools. If you use TiDB, an open source distributed NewSQL database, you have to use two different tools to direct data from TiDB into data lakes. You need Dumpling to export the full data and TiCDC to replicate incremental data. The two tools are in split processes, so TiDB cannot clean and process the data in real time heading to the lake through a real-time materialized view.
At TiDB Hackathon 2021, Team TiLaker provided a solution: TiLaker, same as their team name. This data export tool can export both the historical data and incremental data change from TiDB into data lakes at the same time. This project won three prizes: the Second Prize, the “Best Market Potential Prize” sponsored by China Growth Capital, and the “Best Choice Award.”
How can TiLaker help?
TiLaker is a data direction tool that exports both historical and real-time incremental data from TiDB into data lakes. TiLaker can also be seen as a TiDB-customized Flink CDC Connector based on TiCDC, TiDB’s data replication tool.
It guarantees you data security. When you switch between the read demands of historical data and real-time incremental data, TiLaker guarantees you zero data loss and zero redundancy for any single piece of data.
It is easy to use. TiLaker provides both DataStream API and SQL API for developers. The SQL API allows developers to use pure Flink SQL to capture the full amount of historical data and real-time incremental data coming from TiDB. The DataStream API allows developers to use Java code to achieve more flexible and powerful features.
It provides a real-time materialized view. Thanks to its changelogs, Flink SQL can seamlessly connect with the change data of the database. The tidb-cdc table defined by Flink SQL is the real-time materialized view of the corresponding TiDB table. Every change in the database will be updated automatically in the tidb-cdc table.
It allows TiDB users and most mainstream database users to integrate heterogeneous data sources. The Apache Flink’s CDC Connector project already supports many databases such as MySQL, MariaDB, PostgreSQL, Oracle, Aurora, and MongoDB. Now, TiLaker moves a step further, making Flink CDC support TiDB as well. This means database users can integrate heterogeneous data sources. For example, by using TiLaker, even if users have tables in both MySQL and TiDB, they can still conduct real-time streaming processing through relational operators such as JOIN and UNION.
It enables you to integrate with the Flink ecosystem. Apache Flink has an active developer community and has connected with many downstream products. Because TiLaker is a TiDB-oriented Flink CDC Connector, it establishes a fast, efficient, and streamlined channel to connect and integrate the TiDB ecosystem and the big data ecosystem. It also improves the efficiency of data direction from TiDB to data lakes.
TiLaker has already joined Apache Flink’s CDC Connector project. On March 28, Apache Flink officially released TiLaker under the name TiDB CDC Connector. This connector supports various TiDB versions including TiDB v5.1-5.4 and TiDB v6.0.
In future posts, we will introduce more outstanding projects produced at the TiDB Hackathon 2021. Stay tuned.
If you are also interested in hacking and the TiDB Hackathon, you’re welcome to follow @PingCAP on Twitter, Facebook, and GitHub for the latest information. You can also join our Slack discussions and share your ideas with us.
A fully-managed cloud DBaaS for predictable workloads
A fully-managed cloud DBaaS for auto-scaling workloads