Managing Vectors in the Same Way as Operating on MySQL Data

As the landscape of data management continues to evolve, the integration of vector search capabilities into traditional relational databases has emerged as a powerful advancement. TiDB Serverless, now offers a vector search feature, allowing users to manage vectors in a manner akin to operating on MySQL data. This seamless integration of vector search within a relational database framework not only simplifies complex data operations but also extends the versatility of database applications.

Understanding Vector Search

Vector search enables semantic and similarity searches across various data types, such as text, images, videos, and audio. Instead of searching the data itself, vector search focuses on the meanings of the data. This is achieved by representing data as points in a multidimensional space, where the spatial relationships between points indicate semantic similarities.

In TiDB, vector embeddings are used to represent data. These embeddings can be stored alongside traditional data within the same database. This unique feature allows users to perform sophisticated searches and analyses while leveraging the robustness and scalability of a relational database.

Implementing Vector Search in TiDB

Setting Up TiDB for Vector Search

To get started with TiDB’s vector search feature, follow these steps:

Sign Up for TiDB Cloud: Create an account on TiDB Cloud and sign up for the service.
Select the Appropriate Region: As of now, vector search is available only in the eu-central-1 region. Select this region when setting up your TiDB serverless cluster.
Create a TiDB Serverless Cluster: Follow the tutorial on TiDB Cloud to create a cluster with vector search support enabled.

Basic Usage: Insert and Query Vectors

Once your cluster is set up, you can start by creating tables and inserting data that includes vector embeddings. Here is an example of how to create a table with a 3-dimensional vector field and insert records:

CREATE TABLE vector_table (
    id INT PRIMARY KEY, 
    doc TEXT, 
    embedding VECTOR(3)
);

INSERT INTO vector_table VALUES 
    (1, 'apple', '[1,1,1]'),
    (2, 'banana', '[1,1,2]'),
    (3, 'dog', '[2,2,2]');

You can query the table to retrieve all records:

SELECT * FROM vector_table;

To find the nearest neighbors to a given vector based on cosine distance, you can execute the following query:

SELECT * FROM vector_table 
ORDER BY vec_cosine_distance(embedding, '[1,1,3]') 
LIMIT 3;

This query orders the results by their cosine similarity to the vector [1,1,3], returning the closest matches.

Advanced Vector Operations

TiDB also supports various vector distance functions, such as:

Vec_L1_Distance: Manhattan Distance
Vec_L2_Distance: Squared Euclidean Distance
Vec_Cosine_Distance: Cosine Distance
Vec_Negative_Inner_Product: Negative Inner Product

You can use these functions in your SQL queries to perform different types of vector comparisons. For example, to compute the cosine distance between two vectors:

SELECT vec_cosine_distance('[1,1,1]', '[1,2,3]');

Indexing for Faster Vector Searches

To optimize vector search performance, you can create an HNSW (Hierarchical Navigable Small World) index on your vector fields. This index type is suitable for vector search queries and can be defined during table creation:

CREATE TABLE vector_table_with_index (
    id INT PRIMARY KEY, 
    doc TEXT, 
    embedding VECTOR(3) COMMENT 'hnsw(distance=cosine)'
);

Integration with AI Frameworks

TiDB vector search integrates seamlessly with popular AI frameworks, such as LangChain and LlamaIndex. These integrations enable advanced applications like semantic search using OpenAI embeddings. For example, you can set up a semantic search environment with the following steps:

Prepare Your Environment: Set up a virtual environment and install the necessary dependencies.
Define Your Data Model: Create a data model using the peewee ORM with vector support.
Generate and Insert Embeddings: Use OpenAI’s API to generate embeddings for your data and insert them into your TiDB table.
Query for Similar Documents: Execute queries to find documents semantically similar to a given query.

import os
from openai import OpenAI
from peewee import Model, MySQLDatabase, TextField, SQL
from tidb_vector.peewee import VectorField

# Set up OpenAI client and database connection
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
db = MySQLDatabase('test', user=os.getenv('TIDB_USERNAME'), password=os.getenv('TIDB_PASSWORD'), host=os.getenv('TIDB_HOST'), port=4000, ssl_verify_cert=True, ssl_verify_identity=True)

# Define the data model
class DocModel(Model):
    text = TextField()
    embedding = VectorField(dimensions=1536)

    class Meta:
        database = db
        table_name = "doc_test"

# Connect to the database and create the table
db.connect()
db.create_tables([DocModel])

# Generate embeddings and insert into the database
documents = ["Example document 1", "Example document 2", "Example document 3"]
embeddings = [r.embedding for r in client.embeddings.create(input=documents, model="text-embedding-3-small").data]
data_source = [{"text": doc, "embedding": emb} for doc, emb in zip(documents, embeddings)]
DocModel.insert_many(data_source).execute()

# Query for similar documents
question = "Example query"
question_embedding = client.embeddings.create(input=question, model="text-embedding-3-small").data[0].embedding
related_docs = DocModel.select(DocModel.text, DocModel.embedding.cosine_distance(question_embedding).alias("distance")).order_by(SQL("distance")).limit(3)

# Output the results
for doc in related_docs:
    print(doc.distance, doc.text)

db.close()

This approach demonstrates how TiDB’s vector search capability can be integrated into AI applications, leveraging the power of embeddings and similarity searches to deliver more intelligent and context-aware solutions.

Conclusion

TiDB’s vector search feature bridges the gap between traditional relational databases and modern AI-driven applications. By allowing users to manage vectors in the same way they operate on MySQL data, TiDB offers a flexible and powerful platform for developing innovative solutions that require both robust data management and advanced search capabilities. Whether you are building semantic search engines, recommendation systems, or any application that benefits from understanding the meaning behind data, TiDB’s vector search provides the tools you need to succeed.

For more information and to start using TiDB vector search, visit TiDB Cloud and join the waitlist to get access to this exciting feature.

More Demos

OpenAI Embedding: use the OpenAI embedding model to generate vectors for text data.
Image Search: use the OpenAI CLIP model to generate vectors for image and text.
LlamaIndex RAG with UI: use the LlamaIndex to build an RAG(Retrieval-Augmented Generation) application.
Chat with URL: use LlamaIndex to build an RAG(Retrieval-Augmented Generation) application that can chat with a URL.
GraphRAG: 20 lines code of using TiDB Serverless to build a Knowledge Graph based RAG application.
GraphRAG Step by Step Tutorial: Step by step tutorial to build a Knowledge Graph based RAG application with Colab notebook. In this tutorial, you will learn how to extract knowledge from a text corpus, build a Knowledge Graph, store the Knowledge Graph in TiDB Serverless, and search from the Knowledge Graph.

Last updated June 4, 2024

Table of Contents

Spin up a Serverless database with 25GiB free resources.

Start Now