{"id":1445,"date":"2021-04-19T00:00:00","date_gmt":"2021-04-19T00:00:00","guid":{"rendered":"https:\/\/en.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/"},"modified":"2024-08-20T07:06:03","modified_gmt":"2024-08-20T14:06:03","slug":"use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times","status":"publish","type":"post","link":"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/","title":{"rendered":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times"},"content":{"rendered":"<p><a href=\"https:\/\/docs.pingcap.com\/tidb\/stable\/\">TiDB<\/a> is a Hybrid Transaction\/Analytical Processing (HTAP) database that can efficiently process analytical queries. However, when large amounts of data are involved, the CPU becomes the bottleneck for processing queries that include <code>JOIN<\/code> statements and\/or aggregation functions.<\/p>\n<p>At the same time, the GPU is rapidly gaining popularity in areas of scientific computing, AI, data processing, and so on. It outperforms the CPU by orders of magnitude in such areas. GPU-accelerated databases are also emerging and getting lots of attention in the database market.<\/p>\n<p>We thought that it might be possible to empower TiDB using GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing. This idea later became our TiDB Hackathon 2020 project, in which we achieved 10x~150x performance improvement and won two medals (3rd prize and Yunqi partner&#8217;s Best Market Potential).<\/p>\n<p>In this post, we&#8217;ll share some technical details about our project and analyze our benchmark results.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Rationale\"><\/span>Rationale<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our goal is to GPU-accelerate the most CPU-intensive part of query processing, namely, the relational operators that process in-memory data, such as join and aggregation. Therefore, we&#8217;ll need to reimplement those operators using GPU programming techniques. Although the idea is straightforward, one major challenge is that TiDB is implemented in Golang, while the GPU is most commonly programmed in C\/C++-like languages. This challenge directly leads to the following design decisions:<\/p>\n<ul>\n<li>We will use Compute Unified Device Architecture (CUDA) as our GPU programming language. It&#8217;s programmer friendly and has extensive documentation.<\/li>\n<li>The GPU operators will be implemented in C++\/CUDA rather than Golang.<\/li>\n<li>TiDB (Golang) and the GPU operators (C++\/CUDA) will use intra-process communication. This way, we won&#8217;t introduce remote procedure calls (RPCs), serialization\/deserialization, and network transmission overhead. Thus, Cgo is our only language option.<\/li>\n<li>We need a lightweight protocol to simplify crossing the programming boundary between TiDB and the GPU operators. This will ease the programming burden introduced by Cgo.<\/li>\n<\/ul>\n<p>These design decisions are the driving force behind our architecture, which is described below.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Architecture\"><\/span>Architecture<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The following figure shows our architecture. As you can see, the TiDB process can be divided into components written in Golang (the blue box) and those written in C++\/CUDA (the green box).<\/p>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-18964\" src=\"https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb.jpg\" alt=\"architecture-of-gpu-accelerated-tidb\" width=\"1582\" height=\"824\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb.jpg 1582w, https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb-300x156.jpg 300w, https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb-1024x533.jpg 1024w, https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb-768x400.jpg 768w, https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb-1536x800.jpg 1536w, https:\/\/static.pingcap.com\/files\/2021\/04\/20070220\/architecture-of-gpu-accelerated-tidb-1440x750.jpg 1440w\" sizes=\"auto, (max-width: 1582px) 100vw, 1582px\" \/>\n<div class=\"caption-center\">Architecture of GPU-accelerated TiDB<\/div>\n<p>All TiDB legacy structures, including the parser, planner, optimizer, and executor are in the Golang world. We also added two other components: the plan translator and the execution adapter. The first component translates a TiDB query plan into a CUDA Relational Algebra or &#8220;CURA&#8221; plan, and the other adapts the TiDB execution model to the execution of the CURA plan with the actual data properly fed. CURA is a library implemented in C++\/CUDA, containing our GPU operators, as well as some auxiliary components to organize and execute the GPU operators.<\/p>\n<p>The architecture also includes a protocol that defines the interaction between CURA and the execution adapter. It is exposed as several CURA C APIs. The execution adapter uses Cgo to call this protocol. (See the gray arrow.)<\/p>\n<p>You can find our source code here:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/zanmato1984\/cura\">CURA<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/windtalker\/tidb\/tree\/tidb_cura\">Customized TiDB<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/windtalker\/tikv\/tree\/tikv_cura\">Customized TiKV<\/a><\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Query_processing\"><\/span>Query processing<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The query processing steps of parsing, planning, and optimization are basically the same as before. Our work focuses on the orange boxes in the architectural figure.<\/p>\n<h3>Plan translation<\/h3>\n<p>As the following figure shows, given an optimized TiDB query plan\u2014one that has optimal join orders and columns properly pruned\u2014the plan translator cherry-picks all the operators other than <code>Table Scan<\/code>s (the gray circles) from the operator tree and packs them into a single CURA plan (the green circle).<\/p>\n<img loading=\"lazy\" decoding=\"async\" width=\"1520\" height=\"540\" class=\"wp-image-1446\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2021\/04\/query-plan-translation.png\" alt=\"Plan translation\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/query-plan-translation.png 1520w, https:\/\/static.pingcap.com\/files\/2021\/04\/query-plan-translation-300x107.png 300w, https:\/\/static.pingcap.com\/files\/2021\/04\/query-plan-translation-1024x364.png 1024w, https:\/\/static.pingcap.com\/files\/2021\/04\/query-plan-translation-768x273.png 768w, https:\/\/static.pingcap.com\/files\/2021\/04\/query-plan-translation-1440x512.png 1440w\" sizes=\"auto, (max-width: 1520px) 100vw, 1520px\" \/>\n<div class=\"caption-center\">Query plan translation<\/div>\n<p>The original operator tree is subsequently transformed into a new one, with a single CURA plan node as root, and all the original <code>Table Scan<\/code>s as leaves. <code>Table Scan<\/code>s are mainly I\/O-bound, and thus remain unchanged. The CURA plan indicates which portion of the query the GPU executes. It is important to pack the CURA plan with multiple nodes from the original operator tree. That way, we can avoid unnecessary memory copies back and forth between the host and device.<\/p>\n<p>The figure above shows the plan translation for the following query:<\/p>\n<pre><code>SELECT * FROM t0 JOIN t1 ON t0.id = t1.id WHERE t0.value = 42\n<\/code><\/pre>\n<p>The translated CURA plan is a JSON-formatted representation of the subtree enclosed by the left gray dashed circle, with one placeholder, <code>InputSource<\/code>, for each <code>Table Scan<\/code> node. Each <code>InputSource<\/code> is later anchored by the actual <code>Table Scan<\/code> during CURA execution, which is described in the next section. The following snippet is the actual JSON text of this CURA plan. If you are not interested in what a CURA plan looks like, you can skip to the next section.<\/p>\n<pre><code>{\"rels\":[\n  {\"rel_op\":\"InputSource\",\"source_id\":0,\"schema\":[{\"type\":\"INT32\",\"nullable\":false},{\"type\":\"INT32\",\"nullable\":false}]},\n  {\"rel_op\":\"Filter\",\"condition\":{\"binary_op\":\"EQUAL\",\"operands\":[{\"col_ref\":1},{\"type\":\"INT32\",\"literal\":42}],\"type\":{\"type\":\"BOOL8\",\"nullable\":false}}},\n  {\"rel_op\":\"InputSource\",\"source_id\":1,\"schema\":[{\"type\":\"INT32\",\"nullable\":false}]},\n  {\"rel_op\":\"HashJoin\",\"type\":\"INNER\",\"build_side\":\"LEFT\",\"condition\":{\"binary_op\":\"EQUAL\",\"operands\":[{\"col_ref\":0},{\"col_ref\":2}],\"type\":{\"type\":\"BOOL8\",\"nullable\":false}}}\n]}\n<\/code><\/pre>\n<h3>CURA execution<\/h3>\n<img loading=\"lazy\" decoding=\"async\" width=\"1520\" height=\"681\" class=\"wp-image-1447\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2021\/04\/cura-execution.png\" alt=\"CURA execution\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/cura-execution.png 1520w, https:\/\/static.pingcap.com\/files\/2021\/04\/cura-execution-300x134.png 300w, https:\/\/static.pingcap.com\/files\/2021\/04\/cura-execution-1024x459.png 1024w, https:\/\/static.pingcap.com\/files\/2021\/04\/cura-execution-768x344.png 768w, https:\/\/static.pingcap.com\/files\/2021\/04\/cura-execution-1440x645.png 1440w\" sizes=\"auto, (max-width: 1520px) 100vw, 1520px\" \/>\n<div class=\"caption-center\">CURA execution<\/div>\n<p>During execution, the execution adapter first passes the CURA plan to CURA to compile it into an internal physical plan. The figure above shows the physical plan and the dataflow of the CURA plan discussed earlier. We introduce &#8220;pipelines,&#8221; each one containing a subset of the GPU operators reconstructed from the CURA plan, to organize the GPU operators and manage CPU parallelism (multiple CPU threads). Data are formed as &#8220;fragments&#8221; (shown in the figure as window icons), and each is a horizontal split of one table.<\/p>\n<p>For example, due to the intrinsic property of hash join, <code>Probe<\/code> can only be started after <code>Build<\/code> finishes; thus, the whole execution breaks into two pipelines. <code>Filter<\/code> doesn&#8217;t have this limitation, so it can be &#8220;pipelined&#8221; with <code>Build<\/code> within the same pipeline. The fragments from <code>Table Scan t0<\/code> (the light gray window icon) flow through <code>Filter<\/code> and go into <code>Build<\/code> right after. Meanwhile, in pipeline 1 (the final pipeline), fragments from <code>Table Scan t1<\/code> (the middle gray window icon) flow through <code>Probe<\/code> and leave CURA as the final result (the dark gray window icon) emitted to the client. Each pipeline accepts fragments in arbitrary CPU parallelism. The need for synchronization is rare and trivial, and the GPU operators take care of it when necessary.<\/p>\n<p>The execution adapter and CURA interact with each other via the protocol (the bidirectional arrow), and:<\/p>\n<ul>\n<li>Anchor each <code>Table Scan<\/code> to the corresponding pipeline.<\/li>\n<li>Iterate all pipelines in the specified order.<\/li>\n<li>For each pipeline, drain fragments from the anchored <code>Table Scan<\/code>s and feed the pipeline with them, in configurable CPU parallelism.<\/li>\n<li>Emit data output by the final pipeline to the client.<\/li>\n<\/ul>\n<p>This protocol also brings us another benefit: CURA becomes a general GPU query runtime which you can adapt to any other database by properly following the protocol.<\/p>\n<h3>GPU operators<\/h3>\n<p>Thanks to the <a href=\"https:\/\/rapids.ai\/\">RAPIDS<\/a> project&#8217;s core component <a href=\"https:\/\/github.com\/rapidsai\/cudf\">cuDF<\/a>, we can use it as a solid primitive and build our own GPU operators on top of it. cuDF supplies mature dataframe APIs running on the GPU. It requires data to be in columnar format and use the <a href=\"https:\/\/github.com\/apache\/arrow\">Arrow<\/a> layout. Luckily, our fragments match both, so other than host-to-device memory copies for input and device-to-host memory copies for output (final pipeline only), we don&#8217;t need to reformat any CPU-side data. We won&#8217;t go into much detail about cuDF in this post. If you&#8217;d like more information on cuDF, see its <a href=\"https:\/\/github.com\/rapidsai\/cudf\">GitHub repo<\/a>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Benchmarking\"><\/span>Benchmarking<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We choose TPC-H 50G as our benchmark data set. Rather than use all 22 queries, we picked the four most CPU-intensive ones: <a href=\"https:\/\/github.com\/pingcap\/tidb-bench\/blob\/master\/tpch\/queries\/5.sql\">Q5<\/a>, <a href=\"https:\/\/github.com\/pingcap\/tidb-bench\/blob\/master\/tpch\/queries\/9.sql\">Q9<\/a>, <a href=\"https:\/\/github.com\/pingcap\/tidb-bench\/blob\/master\/tpch\/queries\/17.sql\">Q17<\/a>, and <a href=\"https:\/\/github.com\/pingcap\/tidb-bench\/blob\/master\/tpch\/queries\/18.sql\">Q18<\/a>. These queries are run on a powerful server running legacy TiDB (using CPU) and on a commodity PC running our GPU-accelerated TiDB (We don&#8217;t have a GPU server.) The hardware specs are shown in the figure below:<\/p>\n<img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"540\" class=\"wp-image-1448\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2021\/04\/hardware-comparison-cpu-vs-gpu-tidb.jpg\" alt=\"Hardware comparison between CPU TiDB and GPU TiDB\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/hardware-comparison-cpu-vs-gpu-tidb.jpg 960w, https:\/\/static.pingcap.com\/files\/2021\/04\/hardware-comparison-cpu-vs-gpu-tidb-300x169.jpg 300w, https:\/\/static.pingcap.com\/files\/2021\/04\/hardware-comparison-cpu-vs-gpu-tidb-768x432.jpg 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>\n<div class=\"caption-center\">Hardware comparison between CPU TiDB and GPU TiDB<\/div>\n<p>To avoid I\/O operations amortizing the overall performance improvement, we used the <a href=\"https:\/\/docs.pingcap.com\/tidb\/stable\/coprocessor-cache\">Coprocessor Cache<\/a> to buffer all the intermediate results of the <code>Table Scan<\/code>s. It effectively makes TiDB a &#8220;hypothetical&#8221; in-memory database.<\/p>\n<p>The final results of these TPC-H queries are listed below. We achieved 12x, 32x, 27x, and 28x performance improvements for Q5, Q9, Q17, and Q18, respectively.<\/p>\n<img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"371\" class=\"wp-image-1449\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2021\/04\/tpc-h-benchmark-gpu-vs-cpu.png\" alt=\"TPC-H benchmark for GPU vs. CPU\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/tpc-h-benchmark-gpu-vs-cpu.png 600w, https:\/\/static.pingcap.com\/files\/2021\/04\/tpc-h-benchmark-gpu-vs-cpu-300x186.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/>\n<div class=\"caption-center\">TPC-H benchmark<\/div>\n<p>TPC-H queries are complex. We also perform several hand-written simple queries as our microbenchmark. The first query is a typical &#8220;distinct count.&#8221; We count distinct values of column <code>l_returnflag<\/code> from table <code>lineitem<\/code>:<\/p>\n<pre><code>SELECT COUNT(DISTINCT l_returnflag) FROM lineitem\n<\/code><\/pre>\n<p>The second query is a typical &#8220;big tables join.&#8221; We choose to join the two largest tables <code>lineitem<\/code> and <code>orders<\/code> on column <code>orderkey<\/code>:<\/p>\n<pre><code>SELECT COUNT(1) FROM lineitem JOIN orders ON l_orderkey = o_orderkey\n<\/code><\/pre>\n<p>The final results of these two queries are displayed below. We achieve 67x and 148x performance improvements for &#8220;distinct count&#8221; and &#8220;big tables join,&#8221; respectively.<\/p>\n<img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"371\" class=\"wp-image-1450\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2021\/04\/simple-queries-benchmark-cpu-vs-gpu.png\" alt=\"Simple queries benchmark for CPU vs. GPU\" srcset=\"https:\/\/static.pingcap.com\/files\/2021\/04\/simple-queries-benchmark-cpu-vs-gpu.png 600w, https:\/\/static.pingcap.com\/files\/2021\/04\/simple-queries-benchmark-cpu-vs-gpu-300x186.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/>\n<div class=\"caption-center\">Simple queries benchmark<\/div>\n<p>The following sections discuss the insights we&#8217;ve gained from these performance improvements.<\/p>\n<h3>Hash table on the GPU<\/h3>\n<p>In many of the database operations, such as hash join and hash aggregation, the hash table is the most critical data structure. It dominates the execution of most CPU-intensive queries. The concurrent hash table in cuDF is a performant one on GPU. It couples open addressing with the highly parallel nature of the GPU. It outperforms the CPU implementation of the hash table in legacy TiDB by several hundred times. Actually, most performance improvement throughout our benchmark is the result of the GPU accelerating hash table operations. The GPU does not accelerate other database operations as significantly as the hash table; thus, the overall acceleration is 10x to 150x faster.<\/p>\n<h3>Kernel\/memcpy efficiency<\/h3>\n<p>Kernel\/memcpy efficiency is the ratio of the amount of time performing compute and memory copy by the GPU. It is a major indicator of how suitable the GPU is to accelerate this workload. On average, we see a 1:1 ratio for our queries, which is not very high, and the ratio goes down as the query emits more final results. For example, the ratio goes down when you join two big tables without an aggregation on top of it, because it brings much more device-to-host memory copies.<\/p>\n<h3>GPU memory allocator<\/h3>\n<p><code>cudaMalloc<\/code> and <code>cudaFree<\/code> are the standard CUDA APIs to manage GPU memory. However, they have the side effect of forcing the GPU to synchronize the whole device by draining all the pending jobs; that is, the asynchronous memory copies and kernels that have been submitted to the GPU. This easily becomes a bottleneck in our pipeline model, because:<\/p>\n<ul>\n<li>The CPU cannot submit any subsequent GPU jobs until the GPU finishes the device synchronization required by <code>cudaMalloc<\/code> and <code>cudaFree<\/code>. Our pipeline model tends to have a long sequence of GPU jobs, so, due to this synchronization, the potential parallelism between the CPU and the GPU is reduced.<\/li>\n<li>There are chances for the GPU to overlap multiple jobs, and frequent synchronizations kill these chances.<\/li>\n<li>As mentioned before, the pipeline accepts arbitrary CPU parallelism. The number of synchronizations multiplies by the number of CPU threads.<\/li>\n<\/ul>\n<p>Fortunately, cuDF supplies an <a href=\"https:\/\/github.com\/rapidsai\/rmm\/blob\/main\/include\/rmm\/mr\/device\/arena_memory_resource.hpp\">arena allocator<\/a>, which reserves a huge amount of GPU memory in advance and manages the subsequent allocation\/deallocation internally. It is concurrency friendly as well. The allocator effectively addresses our problems and doubles the overall performance.<\/p>\n<h3>CUDA stream<\/h3>\n<p><a href=\"https:\/\/developer.nvidia.com\/blog\/gpu-pro-tip-cuda-7-streams-simplify-concurrency\/\">CUDA stream<\/a> is a way to achieve higher GPU utilization by overlapping the GPU jobs submitted by multiple independent CPU threads. We didn&#8217;t realize this is an issue until we used the <a href=\"https:\/\/developer.nvidia.com\/nvidia-visual-profiler\">nvvp<\/a> to profile our queries, and we saw no overlap among different GPU jobs. All GPU jobs seem to go into the default stream.<\/p>\n<p>By enabling <a href=\"https:\/\/github.com\/rapidsai\/cudf\/blob\/850548d9b705b999c7010ded1e2d7573820228e6\/cpp\/CMakeLists.txt#L48\">per-thread default stream<\/a> at compile time, we see each CPU thread is assigned an individual stream. This improves GPU utilization by 10%~20% in terms of both kernel concurrency and memcpy\/kernel overlap.<\/p>\n<h3>GPU utilization<\/h3>\n<p>After we cleaned up the obstacles of memory management overhead and the CUDA stream, we were finally able to face GPU utilization directly. Honestly, it&#8217;s not a happy ending. After a lot of effort, we achieved:<\/p>\n<ul>\n<li>Less than 3 GB\/s host-to-device memory copy throughput, which is far less than the PCI Express (PCIe) 3.0 bandwidth of 16 GB\/s.<\/li>\n<li>Less than 10% of kernel concurrency and memcpy\/kernel overlap.<\/li>\n<\/ul>\n<p>The first issue is most likely because our data fragments are not big enough\u2014they&#8217;re typically several tens of megabytes\u2014and they do not fully occupy the PCIe bandwidth.<\/p>\n<p>The reason for the second issue is still unclear. Given that we conquered the &#8220;obstacles&#8221; mentioned earlier, we assumed that our pipeline model could improve GPU utilization by increasing CPU parallelism. This is because there would be more opportunities for the GPU jobs to overlap as more CPU threads submit the GPU jobs concurrently. However, the peak performance comes at two to three CPU threads. Neither kernel concurrency nor memcpy\/kernel overlap increase at higher CPU parallelism, and the overall performance even goes down.<\/p>\n<p>Is it because the hardware units are exhausted? Or are some other unidentified CUDA APIs causing unintended implicit overhead as <code>cudaMalloc<\/code> and <code>cudaFree<\/code> do? Or did we hit some kind of Cgo threading problem? To answer these questions, we must do more fine-grained profilings. We hope there is still some room for further improvement.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Takeaways\"><\/span>Takeaways<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>By running on new generation hardware like a GPU, TiDB can process CPU-intensive analytical queries 10x~150x faster.<\/li>\n<li>We built a general GPU query runtime <a href=\"https:\/\/github.com\/zanmato1984\/cura\">CURA<\/a> on top of <a href=\"https:\/\/github.com\/rapidsai\/cudf\">cuDF<\/a>. It is adapted to <a href=\"https:\/\/github.com\/windtalker\/tidb\/tree\/tidb_cura\">TiDB<\/a> in this project, and it can be adapted to any other database (especially an in-memory database) to accelerate its query engine using the GPU.<\/li>\n<li>We could potentially improve GPU utilization even further. But that will have to wait until we have more time, and my friend returns my RTX 2080 after he finishes the game &#8220;Cyberpunk 2077.&#8221;<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>At TiDB Hackathon 2020, a team used GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing and achieved 10x~150x performance improvement.<\/p>","protected":false},"author":142,"featured_media":1452,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ub_ctt_via":"","footnotes":""},"categories":[6],"tags":[44,7],"class_list":["post-1445","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-engineering","tag-hackathon","tag-query-execution"],"acf":[],"featured_image_src":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","author_info":{"display_name":"Ruoxi Sun","author_link":"https:\/\/www.pingcap.com\/ko\/blog\/author\/ruoxi-sun\/"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB<\/title>\n<meta name=\"description\" content=\"In this post, we&#039;ll share some technical details about using GPU boosts TiDB analytics and analyze our benchmark results.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB\" \/>\n<meta property=\"og:description\" content=\"We used GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing and achieved 10-150x performance improvement.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\" \/>\n<meta property=\"og:site_name\" content=\"TiDB\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/facebook.com\/pingcap2015\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-19T00:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-20T14:06:03+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"854\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Ruoxi Sun\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:description\" content=\"We used GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing and achieved 10-150x performance improvement.\" \/>\n<meta name=\"twitter:creator\" content=\"@PingCAP\" \/>\n<meta name=\"twitter:site\" content=\"@PingCAP\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ruoxi Sun\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\"},\"author\":{\"name\":\"Ruoxi Sun\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/1ad3366088a52749f9d60a68c862be13\"},\"headline\":\"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times\",\"datePublished\":\"2021-04-19T00:00:00+00:00\",\"dateModified\":\"2024-08-20T14:06:03+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\"},\"wordCount\":2187,\"publisher\":{\"@id\":\"https:\/\/www.pingcap.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\",\"keywords\":[\"Hackathon\",\"Query execution\"],\"articleSection\":[\"Engineering\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\",\"url\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\",\"name\":\"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB\",\"isPartOf\":{\"@id\":\"https:\/\/www.pingcap.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\",\"datePublished\":\"2021-04-19T00:00:00+00:00\",\"dateModified\":\"2024-08-20T14:06:03+00:00\",\"description\":\"In this post, we'll share some technical details about using GPU boosts TiDB analytics and analyze our benchmark results.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage\",\"url\":\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\",\"width\":2560,\"height\":854},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.pingcap.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.pingcap.com\/#website\",\"url\":\"https:\/\/www.pingcap.com\/\",\"name\":\"TiDB\",\"description\":\"TiDB | SQL at Scale\",\"publisher\":{\"@id\":\"https:\/\/www.pingcap.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.pingcap.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.pingcap.com\/#organization\",\"name\":\"PingCAP\",\"url\":\"https:\/\/www.pingcap.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"width\":811,\"height\":232,\"caption\":\"PingCAP\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/facebook.com\/pingcap2015\",\"https:\/\/x.com\/PingCAP\",\"https:\/\/linkedin.com\/company\/pingcap\",\"https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/1ad3366088a52749f9d60a68c862be13\",\"name\":\"Ruoxi Sun\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg\",\"caption\":\"Ruoxi Sun\"},\"url\":\"https:\/\/www.pingcap.com\/ko\/blog\/author\/ruoxi-sun\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB","description":"In this post, we'll share some technical details about using GPU boosts TiDB analytics and analyze our benchmark results.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/","og_locale":"ko_KR","og_type":"article","og_title":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB","og_description":"We used GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing and achieved 10-150x performance improvement.","og_url":"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/","og_site_name":"TiDB","article_publisher":"https:\/\/facebook.com\/pingcap2015","article_published_time":"2021-04-19T00:00:00+00:00","article_modified_time":"2024-08-20T14:06:03+00:00","og_image":[{"width":2560,"height":854,"url":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","type":"image\/jpeg"}],"author":"Ruoxi Sun","twitter_card":"summary_large_image","twitter_description":"We used GPU acceleration techniques to improve the performance of CPU-intensive analytical query processing and achieved 10-150x performance improvement.","twitter_creator":"@PingCAP","twitter_site":"@PingCAP","twitter_misc":{"Written by":"Ruoxi Sun","Est. reading time":"12\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#article","isPartOf":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/"},"author":{"name":"Ruoxi Sun","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/1ad3366088a52749f9d60a68c862be13"},"headline":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times","datePublished":"2021-04-19T00:00:00+00:00","dateModified":"2024-08-20T14:06:03+00:00","mainEntityOfPage":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/"},"wordCount":2187,"publisher":{"@id":"https:\/\/www.pingcap.com\/#organization"},"image":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage"},"thumbnailUrl":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","keywords":["Hackathon","Query execution"],"articleSection":["Engineering"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/","url":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/","name":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times | TiDB","isPartOf":{"@id":"https:\/\/www.pingcap.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage"},"image":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage"},"thumbnailUrl":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","datePublished":"2021-04-19T00:00:00+00:00","dateModified":"2024-08-20T14:06:03+00:00","description":"In this post, we'll share some technical details about using GPU boosts TiDB analytics and analyze our benchmark results.","breadcrumb":{"@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#primaryimage","url":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","contentUrl":"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg","width":2560,"height":854},{"@type":"BreadcrumbList","@id":"https:\/\/www.pingcap.com\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pingcap.com\/"},{"@type":"ListItem","position":2,"name":"Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times"}]},{"@type":"WebSite","@id":"https:\/\/www.pingcap.com\/#website","url":"https:\/\/www.pingcap.com\/","name":"\ud2f0DB","description":"TiDB | SQL at Scale","publisher":{"@id":"https:\/\/www.pingcap.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pingcap.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/www.pingcap.com\/#organization","name":"PingCAP","url":"https:\/\/www.pingcap.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/","url":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","contentUrl":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","width":811,"height":232,"caption":"PingCAP"},"image":{"@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/facebook.com\/pingcap2015","https:\/\/x.com\/PingCAP","https:\/\/linkedin.com\/company\/pingcap","https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA"]},{"@type":"Person","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/1ad3366088a52749f9d60a68c862be13","name":"Ruoxi Sun","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/image\/","url":"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg","contentUrl":"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg","caption":"Ruoxi Sun"},"url":"https:\/\/www.pingcap.com\/ko\/blog\/author\/ruoxi-sun\/"}]}},"grav_blocks":false,"card_markup":"<a class=\"card-resource bg-white\" href=\"https:\/\/www.pingcap.com\/ko\/blog\/use-gpu-boosts-tidb-analytics-performance-by-10-to-150-times\/\"><div class=\"card-resource__image-container\"><img class=\"card-resource__image\" alt=\"gpu-vs-cpu-performance.jpg\" src=\"https:\/\/static.pingcap.com\/files\/2021\/04\/gpu-vs-cpu-performance-scaled.jpg\" loading=\"lazy\" width=2560 height=854 \/><\/div><div class=\"card-resource__content-container\"><div class=\"card-resource__content-head\"><div class=\"card-resource__category\">Engineering<\/div><\/div><h5 class=\"card-resource__title\">Using a GPU Boosts TiDB Analytics Performance by 10 to 150 Times<\/h5><\/div><\/a>","_links":{"self":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/1445","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/users\/142"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/comments?post=1445"}],"version-history":[{"count":6,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/1445\/revisions"}],"predecessor-version":[{"id":18965,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/1445\/revisions\/18965"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/media\/1452"}],"wp:attachment":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/media?parent=1445"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/categories?post=1445"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/tags?post=1445"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}