{"id":27969,"date":"2025-06-25T10:37:40","date_gmt":"2025-06-25T17:37:40","guid":{"rendered":"https:\/\/www.pingcap.com\/?post_type=article&#038;p=27969"},"modified":"2025-07-09T20:58:08","modified_gmt":"2025-07-10T03:58:08","slug":"building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data","status":"publish","type":"article","link":"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/","title":{"rendered":"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data"},"content":{"rendered":"<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Getting_Started\"><\/span>Getting Started<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more accurate, grounded, and context-aware outputs. If you&#8217;re just getting started, this guide walks you through the essential foundations\u2014what RAG is, how it works, and how to prepare your data for use in a RAG pipeline.<\/p>\n\n\n\n<p>From selecting and preprocessing the right content to embedding it and storing it efficiently, we\u2019ll help you lay the groundwork for a powerful AI application.<\/p>\n\n\n\n<p><strong>Data Prepared?<\/strong>&nbsp;Jump to:&nbsp;<a href=\"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-retrieval-and-generation-components\/\">Retrieval and Generation Components<\/a> \u2192<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Basics_of_RAG\"><\/span>Basics of RAG<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What is Retrieval-Augmented Generation?<\/h3>\n\n\n\n<p>RAG system fetches relevant documents or pieces of data to a query and leverages this retrieved information to produce more accurate and contextually appropriate responses. This novel technique naturally decreases the chancesof hallucinations\u2014instances where the model produces plausible but inaccurate or nonsensical output\u2014by basing the generation process on real, external knowledge.<\/p>\n\n\n\n<p>Key concepts in RAG include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Knowledge Base<\/strong>: A structured repository of information from which the retrieval component fetches data.<\/li>\n\n\n\n<li><strong>Retrieval Component<\/strong>: Searches through a knowledge base to find relevant information that can be used to answer a query.<\/li>\n\n\n\n<li><strong>Generation Component<\/strong>: Using the retrieved information, the language model generates a response that is coherent and contextually relevant.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">RAG Model vs Traditional Models<\/h3>\n\n\n\n<p>Traditional language models rely solely on the data they were trained on, which can lead to outdated or incorrect responses, especially when dealing with dynamic information. In contrast, RAG models dynamically incorporate up-to-date information from external sources, making them more reliable and versatile.<\/p>\n\n\n\n<p>Key differences include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Contextual Accuracy<\/strong>: RAG models provide more accurate responses by retrieving real-time data, whereas traditional models might generate outdated or incorrect information.<\/li>\n\n\n\n<li><strong>Flexibility<\/strong>: RAG systems can adapt to new information quickly, while traditional models require retraining to incorporate new data.<\/li>\n\n\n\n<li><strong>Complex Query Handling<\/strong>: By leveraging external knowledge, RAG models can handle more complex and nuanced queries effectively.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Benefits and Use Cases<\/h3>\n\n\n\n<p>The adoption of RAG offers numerous benefits, particularly in scenarios where accuracy and contextual relevance are paramount. Some of the key advantages include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Enhanced Accuracy<\/strong>: By grounding responses in real-world data, RAG models significantly improve the accuracy of generated content.<\/li>\n\n\n\n<li><strong>Reduced Hallucinations<\/strong>: The integration of external knowledge helps mitigate the risk of generating incorrect or nonsensical responses.<\/li>\n\n\n\n<li><strong>\ud655\uc7a5\uc131<\/strong>: RAG systems can handle vast amounts of data, making them suitable for enterprise-level applications.<\/li>\n<\/ul>\n\n\n\n<p>With these benefits, RAG is already making significant strides in various industries, demonstrating its practical value and versatility. Some notable applications include:<\/p>\n\n\n\n<p><strong>Customer Support:<\/strong>&nbsp;RAG helps customer service representatives deliver more accurate and timely responses. For example, Algo Communications saw their CSRs gain confidence in handling complex queries thanks to RAG-enhanced answers grounded in internal knowledge bases.<\/p>\n\n\n\n<p><strong>Entrprise AI:<\/strong>&nbsp;Enterprises use RAG to improve internal search tools, enabling users to find more contextually relevant and semantically accurate results across large document sets.<\/p>\n\n\n\n<p><strong>Content Creation:<\/strong>&nbsp;Writers and marketing teams can generate high-quality, up-to-date content by combining generative models with real-time data retrieval, ensuring both originality and factual accuracy.<\/p>\n\n\n\n<p><strong>Healthcare:<\/strong>&nbsp;RAG supports medical professionals by retrieving the latest research and clinical guidelines, helping improve diagnosis, treatment recommendations, and patient outcomes.<\/p>\n\n\n\n<p><strong>Research &amp; Development:<\/strong>&nbsp;RAG applications in R&amp;D environments allow researchers to stay informed on cutting-edge studies by pulling data from academic sources and conference proceedings, streamlining literature reviews and hypothesis validation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Setting_up_Your_Development_Environment\"><\/span>Setting up Your Development Environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Before diving into the code, it\u2019s crucial to set up a robust development environment. This section will guide you through the necessary tools and libraries, as well as how to prepare your workspace for building a Retrieval-Augmented Generation (RAG) application.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Required Tools and Libraries<\/h3>\n\n\n\n<p>To build a RAG application, you\u2019ll need a set of essential tools and libraries. Here\u2019s an overview of the necessary software:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Python<\/strong>: The primary programming language for this project.<\/li>\n\n\n\n<li><strong>LangChain<\/strong>: A powerful framework that connects large language models (LLMs) to data sources, providing features such as evaluation libraries, document loaders, and query methods.<\/li>\n\n\n\n<li><strong>FastAPI<\/strong>: A modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard Python type hints.<\/li>\n\n\n\n<li><strong>LangCorn<\/strong>: An API server that enables you to serve LangChain models and pipelines with ease, leveraging FastAPI for a robust and efficient experience.<\/li>\n\n\n\n<li><strong>Langserve<\/strong>: Integrated with FastAPI and the LangChain Expression Language interface, ensuring compatibility with LangChain and providing templates for easy deployment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Installation Guides for LangChain, FastAPI, etc.<\/h3>\n\n\n\n<p>To get started, you\u2019ll need to install these tools and libraries. Follow these steps:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Install Python<\/strong>: Ensure you have Python 3.7 or higher installed. You can download it from&nbsp;<a href=\"https:\/\/www.python.org\/\">python.org<\/a>.<\/li>\n\n\n\n<li><strong>Set Up a Virtual Environment<\/strong>: python -m venv rag_envsource rag_env\/bin\/activate # On Windows use `rag_envScriptsactivate`<\/li>\n\n\n\n<li><strong>Install FastAPI<\/strong>: pip install fastapi<\/li>\n\n\n\n<li><strong>Install LangChain<\/strong>: pip install langchain<\/li>\n\n\n\n<li><strong>Install LangCorn<\/strong>: pip install langcorn<\/li>\n\n\n\n<li><strong>Install Langserve<\/strong>: <kbd><code>pip install langserve<\/code><\/kbd><\/li>\n<\/ol>\n\n\n\n<p>These installations will set the foundation for your RAG application, enabling you to leverage the power of LangChain and FastAPI seamlessly.<\/p>\n\n\n\n<p>With the necessary tools and libraries installed, the next step is to prepare your workspace. This involves setting up a virtual environment and organizing your project structure for optimal development efficiency.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up a Virtual Environment<\/h3>\n\n\n\n<p>A virtual environment helps isolate your project\u2019s dependencies, ensuring that they don\u2019t interfere with other projects on your system. Here\u2019s how to set it up:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Create a Virtual Environment<\/strong>: python -m venv rag_env<\/li>\n\n\n\n<li><strong>Activate the Virtual Environment<\/strong>: source rag_env\/bin\/activate # On Windows use `rag_envScriptsactivate`<\/li>\n\n\n\n<li><strong>Install Dependencies<\/strong>: pip install fastapi langchain langcorn langserve<\/li>\n<\/ol>\n\n\n\n<p>By using a virtual environment, you ensure that all dependencies are contained within your project, making it easier to manage and deploy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Organizing Your Project Structure<\/h3>\n\n\n\n<p>A well-organized project structure is key to maintaining clarity and efficiency as your project grows. Here\u2019s a recommended structure for your RAG application:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>rag_project\/\n\u2502\n\u251c\u2500\u2500 app\/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 main.py\n\u2502   \u251c\u2500\u2500 models\/\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u2514\u2500\u2500 langchain_model.py\n\u2502   \u251c\u2500\u2500 routers\/\n\u2502   \u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2502   \u2514\u2500\u2500 api.py\n\u2502   \u2514\u2500\u2500 utils\/\n\u2502       \u251c\u2500\u2500 __init__.py\n\u2502       \u2514\u2500\u2500 helpers.py\n\u2502\n\u251c\u2500\u2500 data\/\n\u2502   \u251c\u2500\u2500 raw\/\n\u2502   \u2514\u2500\u2500 processed\/\n\u2502\n\u251c\u2500\u2500 tests\/\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 test_main.py\n\u2502\n\u251c\u2500\u2500 .env\n\u251c\u2500\u2500 requirements.txt\n\u2514\u2500\u2500 README.md\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>app\/<\/strong>: Contains the main application code, including models, routers, and utility functions.<\/li>\n\n\n\n<li><strong>data\/<\/strong>: Stores raw and processed data used by your RAG application.<\/li>\n\n\n\n<li><strong>tests\/<\/strong>: Contains test cases to ensure your application works as expected.<\/li>\n\n\n\n<li><strong>.env<\/strong>: Stores environment variables.<\/li>\n\n\n\n<li><strong>requirements.txt<\/strong>: Lists all the dependencies required for your project.<\/li>\n\n\n\n<li><strong>README.md<\/strong>: Provides an overview and instructions for your project.<\/li>\n<\/ul>\n\n\n\n<p>By following this structure, you create a clear and maintainable codebase, making it easier to develop, test, and deploy your RAG application.<\/p>\n\n\n\n<p>With your development environment set up and your workspace organized, you\u2019re now ready to move on to the next phase: data preparation. This will involve collecting and preprocessing the data that will form the backbone of your RAG application.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Data_Preparation\"><\/span>Data Preparation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Data preparation is a pivotal phase in building a Retrieval-Augmented Generation (RAG) application. This stage sets the groundwork for effective data utilization in later stages, ensuring that the information fed into your system is clean, structured, and ready for retrieval. Let\u2019s delve into the steps involved in collecting, preprocessing, and structuring your data to create a robust knowledge base.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Collecting Data<\/h3>\n\n\n\n<p>The first step in data preparation is identifying reliable sources of data. Depending on your application\u2019s domain, these sources can vary widely. Here are some common types of data sources:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Public Datasets<\/strong>: Platforms like Kaggle, UCI Machine Learning Repository, and government databases offer a wealth of publicly available datasets.<\/li>\n\n\n\n<li><strong>Internal Databases<\/strong>: Your organization\u2019s internal databases can be a goldmine of relevant information.<\/li>\n\n\n\n<li><strong>APIs<\/strong>: Many services provide APIs to access real-time data, such as social media feeds, news sites, and academic journals.<\/li>\n\n\n\n<li><strong>Web Scraping<\/strong>: For more niche data requirements, web scraping can be an effective method to gather information from various websites.<\/li>\n<\/ul>\n\n\n\n<p>When selecting data sources, ensure they are reliable, up-to-date, and relevant to your application\u2019s needs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cleaning and Formatting Data<\/h3>\n\n\n\n<p>Once you\u2019ve collected your data, the next crucial step is cleaning and formatting it. Raw data often contains noise, inconsistencies, and irrelevant information that can hinder the performance of your RAG application. Here are some key steps in the data cleaning process:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Removing Duplicates<\/strong>: Ensure that your dataset does not contain duplicate entries, which can skew results and increase processing time.<\/li>\n\n\n\n<li><strong>Handling Missing Values<\/strong>: Decide how to handle missing values\u2014whether by removing incomplete records or imputing missing data using statistical methods.<\/li>\n\n\n\n<li><strong>Standardizing Formats<\/strong>: Ensure consistency in data formats, such as dates, numerical values, and text fields.<\/li>\n\n\n\n<li><strong>Filtering Irrelevant Information<\/strong>: Remove any data that is not pertinent to your application\u2019s objectives.<\/li>\n<\/ol>\n\n\n\n<p>For example, if you\u2019re building a customer service chatbot, you might filter out non-customer-related interactions from your dataset.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Creating a Knowledge Base<\/h3>\n\n\n\n<p>With your data cleaned and formatted, the next step is to structure it into a knowledge base that your RAG application can efficiently retrieve information from.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Structuring Data for Retrieval<\/h4>\n\n\n\n<p>A well-structured knowledge base is essential for efficient data retrieval. Here are some best practices for organizing your data:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Categorization<\/strong>: Group related data into categories or topics to facilitate quick retrieval. For instance, in a customer service application, you might categorize data by product type, issue type, or customer demographics.<\/li>\n\n\n\n<li><strong>Metadata Tagging<\/strong>: Enhance your data with metadata tags that provide additional context and improve search accuracy. Tags can include keywords, timestamps, authorship, and more.<\/li>\n\n\n\n<li><strong>Normalization<\/strong>: Ensure that your data follows a consistent structure and format, making it easier to index and search.<\/li>\n<\/ul>\n\n\n\n<p>By structuring your data effectively, you create a solid foundation for the retrieval component of your RAG application.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Indexing Techniques<\/h4>\n\n\n\n<p>Indexing is a critical step that enables fast and efficient data retrieval. Here are some common indexing techniques:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Inverted Index<\/strong>: This technique involves creating a mapping from content to its location in the dataset, allowing for quick lookups. It\u2019s particularly useful for text-based data.<\/li>\n\n\n\n<li><strong>Vector Indexing<\/strong>: For applications involving semantic search, vector indexing can be highly effective. This involves converting data into high-dimensional vectors and using algorithms like k-nearest neighbors (k-NN) to find similar items.<\/li>\n\n\n\n<li><strong>Hybrid Indexing<\/strong>: Combining multiple indexing techniques can provide the best of both worlds, ensuring fast retrieval and high relevance.<\/li>\n<\/ul>\n\n\n\n<p>For instance, using TiDB database\u2019s advanced vector indexing features can significantly enhance the performance of your RAG application, especially when dealing with large-scale data.<\/p>\n\n\n\n<p>By meticulously preparing your data and creating a well-structured knowledge base, you set the stage for building a powerful and efficient Retrieval-Augmented Generation application. The next step will involve implementing the retrieval component, where you\u2019ll put your prepared data to work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>At this point, you should have a solid understanding of how to prepare your data for a RAG application \u2014 from cleaning and chunking to embedding and storing vectors in a database.<\/p>\n\n\n\n<p>In the next part of this series, we\u2019ll take you through querying your vector store, integrating a language model, and deploying your RAG pipeline in a real-world scenario.<\/p>\n\n\n\n<p><strong>Continue reading:<\/strong>&nbsp;Part 2:&nbsp;<a href=\"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-retrieval-and-generation-components\/\">Retrieval and Generation Components<\/a> \u2192<\/p>","protected":false},"excerpt":{"rendered":"<p>Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more accurate, grounded, and context-aware outputs. If you&#8217;re just getting started, this guide walks you through the essential foundations\u2014what RAG is, how it works, and how to prepare your data for use in a RAG pipeline. From [&hellip;]<\/p>\n","protected":false},"author":305,"featured_media":0,"template":"","class_list":["post-27969","article","type-article","status-publish","hentry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB<\/title>\n<meta name=\"description\" content=\"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB\" \/>\n<meta property=\"og:description\" content=\"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\" \/>\n<meta property=\"og:site_name\" content=\"TiDB\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/facebook.com\/pingcap2015\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-10T03:58:08+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/static.pingcap.com\/files\/2024\/09\/11005522\/Homepage-Ad.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1440\" \/>\n\t<meta property=\"og:image:height\" content=\"714\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@PingCAP\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"9\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\",\"url\":\"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\",\"name\":\"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB\",\"isPartOf\":{\"@id\":\"https:\/\/www.pingcap.com\/#website\"},\"datePublished\":\"2025-06-25T17:37:40+00:00\",\"dateModified\":\"2025-07-10T03:58:08+00:00\",\"description\":\"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.pingcap.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Articles\",\"item\":\"https:\/\/www.pingcap.com\/article\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.pingcap.com\/#website\",\"url\":\"https:\/\/www.pingcap.com\/\",\"name\":\"TiDB\",\"description\":\"TiDB | SQL at Scale\",\"publisher\":{\"@id\":\"https:\/\/www.pingcap.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.pingcap.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.pingcap.com\/#organization\",\"name\":\"PingCAP\",\"url\":\"https:\/\/www.pingcap.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"width\":811,\"height\":232,\"caption\":\"PingCAP\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/facebook.com\/pingcap2015\",\"https:\/\/x.com\/PingCAP\",\"https:\/\/linkedin.com\/company\/pingcap\",\"https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB","description":"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/","og_locale":"ko_KR","og_type":"article","og_title":"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB","og_description":"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.","og_url":"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/","og_site_name":"TiDB","article_publisher":"https:\/\/facebook.com\/pingcap2015","article_modified_time":"2025-07-10T03:58:08+00:00","og_image":[{"width":1440,"height":714,"url":"https:\/\/static.pingcap.com\/files\/2024\/09\/11005522\/Homepage-Ad.png","type":"image\/png"}],"twitter_card":"summary_large_image","twitter_site":"@PingCAP","twitter_misc":{"Est. reading time":"9\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/","url":"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/","name":"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data | TiDB","isPartOf":{"@id":"https:\/\/www.pingcap.com\/#website"},"datePublished":"2025-06-25T17:37:40+00:00","dateModified":"2025-07-10T03:58:08+00:00","description":"Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more Learn the fundamentals of Retrieval-Augmented Generation (RAG) and how to prepare unstructured data for RAG-based AI applications. Ideal for developers building scalable LLM-powered systems with TiDB.","breadcrumb":{"@id":"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.pingcap.com\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pingcap.com\/"},{"@type":"ListItem","position":2,"name":"Articles","item":"https:\/\/www.pingcap.com\/article\/"},{"@type":"ListItem","position":3,"name":"Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data"}]},{"@type":"WebSite","@id":"https:\/\/www.pingcap.com\/#website","url":"https:\/\/www.pingcap.com\/","name":"\ud2f0DB","description":"TiDB | SQL at Scale","publisher":{"@id":"https:\/\/www.pingcap.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pingcap.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/www.pingcap.com\/#organization","name":"PingCAP","url":"https:\/\/www.pingcap.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/","url":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","contentUrl":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","width":811,"height":232,"caption":"PingCAP"},"image":{"@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/facebook.com\/pingcap2015","https:\/\/x.com\/PingCAP","https:\/\/linkedin.com\/company\/pingcap","https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA"]}]}},"card_markup":"        <a class=\"card-article\" href=\"https:\/\/www.pingcap.com\/ko\/article\/building-a-rag-application-from-scratch-understanding-rag-and-preparing-your-data\/\">            <h3>Building a RAG Application from Scratch: Understanding RAG and Preparing Your Data<\/h3>            <p>Getting Started Retrieval-Augmented Generation (RAG) combines the strengths of large language models with external knowledge sources to produce more accurate, grounded, and context-aware outputs. If you&#8217;re just getting started, this guide walks you through the essential foundations\u2014what RAG is, how it works, and how to prepare your data for use in a RAG pipeline. From [&hellip;]<\/p>        <\/a>","_links":{"self":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/article\/27969","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/article"}],"about":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/types\/article"}],"author":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/users\/305"}],"wp:attachment":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/media?parent=27969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}