Text2SQL Enhanced: Data Exploration with AI and TiDB Cloud

In early 2023, TiDB Cloud introduced Chat2Query, a new tool that leverages OpenAI’s capabilities to help explore and interact with databases through natural language. After a year of continuous improvements, we’re excited to share that Chat2Query has now become even more powerful.

In this blog, we dive deeper into three major improvements to Chat2Query that will make your data interactions easier and more efficient:

Improved Text2SQL for more accurate interpretation of natural language queries
Expanded capability to handle large datasets for businesses with extensive data
OpenAPI availability for seamless integration with existing systems

Note: For TiDB Dedicated users, make sure you have upgraded to TiDB version 6.5 or higher and obtained whitelist access to utilize the Chat2Query feature.

Enhanced Accuracy for Text2SQL

Text2SQL is the most fundamental feature of Chat2Query, reducing the barrier to data exploration. Over the past year, we’ve made significant strides in Text2SQL accuracy, leveraging prompt engineering and Retrieval-Augmented Generation (RAG) techniques.

We conducted tests using the BIRD benchmark, a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing.Our results ranked us among the top four. Furthermore, We also tested it against the Spider benchmark, a comprehensive, challenging dataset known for its extensive coverage of complex, cross-domain semantic parsing and text-to-SQL tasks. And achieved an impressive score of 86.30, marking a significant achievement in our ongoing pursuit of excellence in Text2SQL capabilities.

Elevated Text2SQL - Spider Score — *Figure 2. Score on Spider Benchmark*

We will showcase the enhancement by comparing Chat2Query 1.0 and 2.0 in how they handle the following scenarios:

Valid queries with available data
Invalid queries

Accurate SQL Generation with Chat2Query 2.0 for Valid Queries

For SQL queries that can be satisfied with the available data, Chat2Query 2.0 showcases superior accuracy. For example, take the inquiry, “How many bikes were available in 2015?” Addressing this requires adept schema linking—correctly associating user queries with the appropriate database table and column. Chat2Query 2.0 precisely identifies the relevant table and column, leading to an accurate SQL query:

SELECT SUM(`bikes_available`) AS `total_bikes_available` 
FROM `status` 
WHERE YEAR(`time`) = 2015;

Conversely, Chat2Query 1.0 struggles with schema linking, often generating incorrect SQL queries that fail to provide the correct answer:

SELECT COUNT(*) AS `bike_count`
FROM `status` s
JOIN `trip` t ON s.`station_id` = t.`start_station_id`
WHERE t.`start_date` LIKE '2015%'

Handling Invalid Queries with Intelligence

Chat2Query 2.0 can adeptly handle invalid queries—those unrelated to the database data. For instance, when presented with the question, “What is the weather?” Chat2Query 2.0 responsibly prompts the user to provide a different instruction, preventing random and irrelevant responses. By contrast, Chat2Query 1.0 generates a random answer, which can confuse or mislead users.

*Figure 3. Chat2Query 2.0: Prompt users to provide different instructions*

*Figure 4: Chat2Query 1.0: Answer randomly*

Optimized for Large Datasets

Large datasets are characterized by either a vast number of columns in a single table or an overwhelming number of tables in a database. For Chat2Query 1.0, large datasets typically pose a significant challenge due to the limitations of the context window in OpenAI’s Language Model (gpt-3.5-turbo-16k). When attempting to process a large dataset, feeding all the information as context to the LLM might surpass this limit, leading to failures in Text2SQL. In practice, errors occur if the database includes more than 30 tables and the average number of columns per table exceeds 20.

To counter this, we’ve implemented strategies like Embedding Similarity Search and MapReduce processing in Chat2Query 2.0, enabling it to handle large datasets often required by real-world business applications efficiently.

Version Comparison on Large Datasets

When it comes to processing a large database (e.g., one with 75 tables and 688 columns), Chat2Query 2.0 shows its robustness by successfully generating SQL, whereas Chat2Query 1.0 often fails due to exceeding the context window limit of the underlying language model.

*Figure 5. Chat2Query 2.0: SQL generated successfully*

*Figure 6. Chat2Query 1.0: Failed to generate SQL*

Accessible via OpenAPI

More and more users have expressed their desire to integrate our Text2SQL capabilities into their systems to improve user experience. However, they also have privacy and security concerns when exposing their sensitive business data to external systems. To address this, we’ve made the full capabilities of Chat2Query accessible via OpenAPI. This enables seamless integration into user systems without compromising data security.

For a step-by-step guide on using OpenAPI to integrate Chat2Query with your application, please refer to our API documentation.

Conclusion

The improved Chat2Query is highly accurate in converting text to SQL and capable of handling large datasets. This allows it to be a versatile tool for data exploration. Moreover, its accessibility via OpenAPI ensures that it can be seamlessly incorporated into existing systems, making it a powerful and user-friendly solution for a wide range of applications.

As we look to the future, Chat2Query is poised for further innovation, enhancing its capabilities and user experience. Here’s what to expect:

Chat mode interaction: Chat2Query will offer a more conversational approach to interacting with Chat2Query. This enables users to engage in a chat-like dialogue to generate SQL queries.
Incorporation of domain knowledge: This feature provides Chat2Query with adaptive learning capabilities. Chat2Query will become more intelligent with use, offering increasingly sophisticated and accurate responses tailored to specific domain requirements.

You can try out Chat2Query for free right now by signing up for TiDB Cloud. Additionally, try our demo for data analysis at TiInsight to see Chat2Query in action.