{"id":318,"date":"2020-07-01T00:00:00","date_gmt":"2020-07-01T00:00:00","guid":{"rendered":"https:\/\/en.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/"},"modified":"2024-08-21T05:25:01","modified_gmt":"2024-08-21T12:25:01","slug":"cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries","status":"publish","type":"post","link":"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/","title":{"rendered":"Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries"},"content":{"rendered":"<p><a href=\"https:\/\/www.pingcap.com\/ko\/tidb\/\">\ud2f0DB<\/a> is an open-source, distributed SQL database that supports <a href=\"https:\/\/en.wikipedia.org\/wiki\/HTAP\">Hybrid Transactional\/Analytical Processing<\/a> (HTAP) workloads. Ideally, a TiDB cluster should always be efficient and problem free. It should be stable, load-balanced, and have a reliable rate of queries per second (QPS). There shouldn&#8217;t be any jitters (either in the cluster or on disk), and no hotspots, slow queries, or network fluctuations.<\/p>\n\n\n\n<p>However, the reality is often unsatisfactory. For external reasons, application traffic may surge and increase the pressure on the cluster. Through a chain reaction of events, the CPU load maxes out, out of memory errors occur, network latency increases, and disk writes and reads slow down.<\/p>\n\n\n\n<p>Before TiDB 4.0, when these problems occurred in the cluster, there was no uniform way to locate them. We had to use various external tools to find problems in the cluster. It was tedious and time-consuming.<\/p>\n\n\n\n<p>Now, TiDB 4.0 introduces a new feature, cluster diagnostics, a built-in widget in TiDB Dashboard, which lets you diagnose cluster problems within a specified time range and summarize the diagnostic results and cluster-related load monitoring information in a diagnostic report.<\/p>\n\n\n\n<p>Our <a href=\"https:\/\/www.pingcap.com\/ko\/blog\/easier-troubleshooting-for-distributed-databases\/#cluster-diagnostics\">previous post<\/a> gave you a quick peek at this feature. In this post, we&#8217;ll elaborate on cluster diagnostics&#8217;s diagnostic reports, and show you examples of how cluster diagnostics can help you quickly find system problems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"How_cluster_diagnostics_makes_your_job_easier\"><\/span>How cluster diagnostics makes your job easier<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>It&#8217;s important to provide a consistent user experience and reduce the learning curve. There is a lot of information that is relevant to troubleshooting, including cluster static information and cluster runtime information. We reorganize this information across the entire cluster to make sure that you can access it using SQL queries, without the need for external tools. At the same time, you can extract common SQL statements as scripts and write different troubleshooting scripts for different application scenarios.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Querying_logs_with_cluster_diagnostics\"><\/span>Querying logs with cluster diagnostics<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>In a TiDB cluster, a single transaction might involve multiple instances of TiDB&#8217;s storage engine, <a href=\"https:\/\/docs.pingcap.com\/tidb\/stable\/tidb-architecture#tikv-server\">TiKV<\/a>. Before TiDB 4.0, if you wanted to view logs related to a specific transaction ID (<code>txn_id<\/code>), you might need to log in to all nodes and view the logs using the grep command. But in TiDB 4.0 and later, we offer cluster log tables. You can view all relevant logs with only one SQL statement. For example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT * FROM information_schema.cluster_log where message like \"%{txn_id}%\" and time &gt; '2020-03-27 15:39:00' and time &lt; '2020-03-27 15:50:00'\n<\/code><\/pre>\n\n\n\n<p>Similarly, splitting and merging the basic unit of TiKV storage (the <a href=\"https:\/\/docs.pingcap.com\/tidb\/v4.0\/glossary#regionpeerraft-group\">Region<\/a>) and the Region <a href=\"https:\/\/docs.pingcap.com\/tidb\/v4.0\/glossary#leaderfollowerlearner\">leader<\/a> switch usually involve multiple TiKV nodes. You can quickly view all activities in a life cycle of a Region through the log table.<\/p>\n\n\n\n<p>When you query logs, the SQL statement&#8217;s predicates are pushed down to each log node for filtering, and no program collects all the logs. So the overhead is controllable and is lower than using the grep command. You can do what a distributed grep command can do but with less overhead.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Cluster_diagnostic_reports\"><\/span>Cluster diagnostic reports<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>In TiDB 4.0, if you want to diagnose or inspect the cluster within a time range, or check the load of the cluster, you can generate a diagnostic report for a period of time in TiDB Dashboard. The diagnostic report contains the diagnostic results during this period and monitoring and configuration information for each component in the system.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The instance CPU usage report<\/h3>\n\n\n\n<p>The instance CPU usage report lets you view the average (AVG), maximum (MAX), and minimum (MIN) CPU usage for TiDB, Placement Driver (PD), and TiKV instances. You can use this report to quickly judge whether the cluster&#8217;s load is balanced or if it has hotspots.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"311\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/instance-cpu-usage.jpg\" alt=\"The instance CPU usage report\" class=\"wp-image-319\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage-300x47.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage-1024x159.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage-768x119.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage-1536x239.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/instance-cpu-usage-1440x224.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">The instance CPU usage report<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">The monitoring execution time report<\/h3>\n\n\n\n<p>The monitoring execution time report presents the monitoring time for each component in the cluster and what percentage it is of the total execution time for all queries. You can use this report to quickly determine whether a component&#8217;s execution time is too long and whether there is a bottleneck.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"765\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/monitoring-execution-time.jpg\" alt=\"The monitoring execution time report\" class=\"wp-image-320\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time-300x115.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time-1024x392.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time-768x294.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time-1536x588.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/monitoring-execution-time-1440x551.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">The monitoring execution time report<\/div>\n\n\n\n<p>The description for each field is as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>METRIC_NAME: The name for the monitoring item.<\/li>\n\n\n\n<li>LABEL: The monitoring label information. To get more detailed monitoring information for each label in this monitoring, click <strong>expand<\/strong>.<\/li>\n\n\n\n<li>TIME_RATIO: The ratio of the total execution time for this monitoring item to the total execution time for the monitoring item whose TIME_RATIO is 1. For example, the ratio of the total execution time for <code>tidb_kv_request<\/code> to the total execution time for <code>tidb_query<\/code> is: 151709.3\/174648.19=0.87.<\/li>\n\n\n\n<li>TOTAL_TIME: The total execution time, in seconds, for the monitoring item.<\/li>\n\n\n\n<li>TOTAL_COUNT: The number of times the monitoring item ran.<\/li>\n\n\n\n<li>P999: The execution time, in seconds, that 99.9% of the monitoring samples fall below.<\/li>\n\n\n\n<li>P99: The execution time, in seconds, that 99% of monitoring samples fall below.<\/li>\n\n\n\n<li>P90: The execution time, in seconds, that 90% of monitoring samples fall below.<\/li>\n\n\n\n<li>P80: The execution time, in seconds, that 80% of monitoring samples fall below.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"How_to_generate_diagnostic_reports\"><\/span>How to generate diagnostic reports<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>To generate a diagnostic report:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Choose a time range and click <strong>Start<\/strong>:<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"666\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/choose-time-range.jpg\" alt=\"Choose a time range\" class=\"wp-image-321\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range-300x100.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range-1024x341.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range-768x256.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range-1536x511.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/choose-time-range-1440x480.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">Choose a time range<\/div>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>After a report is generated, click <strong>View Full Report<\/strong>.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"574\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/view-full-report.jpg\" alt=\"View full report\" class=\"wp-image-322\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report-300x86.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report-1024x294.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report-768x220.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report-1536x441.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/view-full-report-1440x413.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">View the full report<\/div>\n\n\n\n<p>If the system has any abnormal behavior, such as QPS jitters or higher latency, you can generate a comparison report for that time range and for a time range with normal activities. You can easily compare the differences between system metrics of the two time periods, so you can quickly find the problem:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"612\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/generate-comparison-report.jpg\" alt=\"Generate a comparison report\" class=\"wp-image-323\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report-300x92.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report-1024x313.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report-768x235.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report-1536x470.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/generate-comparison-report-1440x441.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">Generate a comparison report<\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Cases_for_automatic_diagnostics_and_time_range_comparison\"><\/span>Cases for automatic diagnostics and time range comparison<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Here are two cases that show how cluster diagnostics helped us quickly find system problems.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Automatic diagnostics and system inspection<\/h3>\n\n\n\n<p>Cluster diagnostics automatically diagnoses system faults and potential problems in the current cluster. This feature lets you inspect the cluster as a whole and analyze the system for bottlenecks. You don&#8217;t have to check the monitoring information for nodes one by one. Diagnostic results are output to the <code>information_schema.inspection_result<\/code> system table. When you query this table, you trigger diagnostics. When you encounter a problem, you can first query this table to find the cause.<\/p>\n\n\n\n<p>The automatic diagnostics is based on a series of built-in diagnostic rules, and it gives diagnostic results by querying the cluster information.<\/p>\n\n\n\n<p>Currently, diagnostic rules are as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>config<\/code>: Checks configuration consistency. It checks whether the same type of components have the same configurations and whether configuration values are set properly.<\/li>\n\n\n\n<li><code>version<\/code>: Checks version consistency. It checks whether versions of the same type of components are consistent.<\/li>\n\n\n\n<li><code>node-load<\/code>: Checks server node load. It checks whether CPU, memory, and disk usage are too high.<\/li>\n\n\n\n<li><code>critical-error<\/code>: Checks critical errors in the system, such as <code>server is busy<\/code>, component restarting, and failures to write the binlog.<\/li>\n\n\n\n<li><code>threshold-check<\/code>: Checks whether some monitoring metrics exceed threshold values, for example:\n<ul class=\"wp-block-list\">\n<li>Whether the CPU usage of a thread in TiKV&#8217;s components exceeds threshold values<\/li>\n\n\n\n<li>Whether <a href=\"https:\/\/docs.pingcap.com\/tidb\/v4.0\/glossary#leaderfollowerlearner\">leaders<\/a> \uadf8\ub9ac\uace0 <a href=\"https:\/\/docs.pingcap.com\/tidb\/v4.0\/glossary#regionpeerraft-group\">Regions<\/a> (the basic data storage unit in TiKV) are balanced among TiKV instances<\/li>\n\n\n\n<li>Whether a TiKV instance has too many Regions and whether a single instance has more than 20,000 Regions<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p>We&#8217;re still improving the rules cluster diagnostics uses. If you&#8217;re interested in adding rules, you can file a pull request <a href=\"https:\/\/github.com\/pingcap\/tidb\/\">on GitHub<\/a>.<\/p>\n\n\n\n<p>Let&#8217;s look at an example. In the following monitoring interface, the query&#8217;s 999th percentile latency and queries per second (QPS) suddenly jittered and then immediately returned to normal. What caused this phenomenon?<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"599\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/qps-jitter.jpg\" alt=\"P999 latency and QPS jittered\" class=\"wp-image-324\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter-300x90.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter-1024x307.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter-768x230.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter-1536x460.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/qps-jitter-1440x431.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">The 999th percentile latency and QPS suddenly jittered<\/div>\n\n\n\n<p>To diagnose the cluster problem in a specified time range, we used the <code>time_range<\/code> SQL hint to specify it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mysql&gt;select \/*+ time_range(\"2020-03-30 23:45:00\", \"2020-03-30 23:50:00\") *\/ * from inspection_result;\n+----------------+-------------+------+-------------------+-------+-----------+----------+--------------------------------------------------------------------+\n| RULE           | ITEM        | TYPE | INSTANCE          | VALUE | REFERENCE | SEVERITY | DETAILS                                                            |\n+----------------+-------------+------+-------------------+-------+-----------+----------+--------------------------------------------------------------------+\n| critical-error | server-down | tikv | 172.16.5.40:23151 |       |           | critical | tikv 172.16.5.40:22151 restarted at time '2020\/03\/30 23:48:38.763' |\n+----------------+-------------+------+-------------------+-------+-----------+----------+--------------------------------------------------------------------+\n<\/code><\/pre>\n\n\n\n<p>The above result showed that the <code>172.16.5.40:23151<\/code> TiKV instance restarted at 23:48:38.763 on March 30, 2020. This time roughly matched when the QPS jittered. Based on this finding, we could investigate the issue further.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparing two time ranges to quickly find system problems<\/h3>\n\n\n\n<p>Automatic diagnostics does not always pinpoint the problem. More often than not, you need to find the problem based on monitoring and other information. However, a TiDB cluster has many monitoring metrics. To quickly identify abnormal monitoring items, you can compare two time ranges.<\/p>\n\n\n\n<p>Let&#8217;s see another example. In the monitoring interface below, the query&#8217;s 999th percentile latency and QPS suddenly dropped at 17:18:00 on March 3, 2020 and, after a few minutes, returned to normal status. Why?<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"571\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/latency-and-qps-dropped.jpg\" alt=\"P999 latency and QPS dropped\" class=\"wp-image-325\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped-300x86.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped-1024x292.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped-768x219.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped-1536x439.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/latency-and-qps-dropped-1440x411.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">The query&#8217;s 999th percentile latency and QPS suddenly dropped<\/div>\n\n\n\n<p>In TiDB 4.0, the <code>metrics_summary<\/code> monitoring system table contains TiDB, PD, and TiKV instances and some monitoring items of the server node. We can use this table to compare all system monitorings when the system is normal against when the system has a problem to quickly find the monitoring item with the biggest difference during the two time ranges:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The time range t1 with a normal cluster status: (&#8220;2020-03-03 17:08:00&#8221;, &#8220;2020-03-03 17:11:00&#8221;)<\/li>\n\n\n\n<li>The time range t2 with an abnormal cluster status: (&#8220;2020-03-03 17:18:00&#8221;, &#8220;2020-03-03 17:21:00&#8221;)<\/li>\n<\/ul>\n\n\n\n<p>We used the following SQL statement to join the monitorings of the two time periods based on the monitoring metric name and sorted them according to the gap value. Note that <code>TIME_RANGE<\/code> is a hint that specifies the query time:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT GREATEST(t1.avg_value,t2.avg_value)\/LEAST(t1.avg_value,\n         t2.avg_value) AS ratio,\n         t1.metrics_name,\n         t1.avg_value as t1_avg_value,\n         t2.avg_value as t2_avg_value,\n         t2.comment\nFROM\n    (SELECT \/*+ time_range(\"2020-03-03 17:08:00\", \"2020-03-03 17:11:00\")*\/ *\n    FROM information_schema.metrics_summary ) t1\nJOIN\n    (SELECT \/*+ time_range(\"2020-03-03 17:18:00\", \"2020-03-03 17:21:00\")*\/ *\n    FROM information_schema.metrics_summary ) t2\n    ON t1.metrics_name = t2.metrics_name\nORDER BY  ratio DESC limit 10;\n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"295\" src=\"https:\/\/en.pingcap.com\/wp-content\/uploads\/2020\/07\/query-results.jpg\" alt=\"Query results\" class=\"wp-image-326\" srcset=\"https:\/\/static.pingcap.com\/files\/2020\/07\/query-results.jpg 2000w, https:\/\/static.pingcap.com\/files\/2020\/07\/query-results-300x44.jpg 300w, https:\/\/static.pingcap.com\/files\/2020\/07\/query-results-1024x151.jpg 1024w, https:\/\/static.pingcap.com\/files\/2020\/07\/query-results-768x113.jpg 768w, https:\/\/static.pingcap.com\/files\/2020\/07\/query-results-1536x227.jpg 1536w, https:\/\/static.pingcap.com\/files\/2020\/07\/query-results-1440x212.jpg 1440w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><\/figure>\n\n\n\n<div class=\"caption-center\">Query results<\/div>\n\n\n\n<p>The query results above showed that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <code>tidb_slow_query_cop_process_total_time<\/code> (<code>cop process<\/code>s&#8217; execution time in TiDB slow queries) in t2 was about 5,866 times that in t1.<\/li>\n\n\n\n<li>The <code>tidb_distsql_partial_scan_key_total_num<\/code> (the number of keys scanned by TiDB&#8217;s <code>distsql<\/code> requests) in t2 was about 3,648 times that in t1.<\/li>\n\n\n\n<li>The <code>tidb_slow_query_cop_wait_total_time<\/code> (<code>cop<\/code> requests&#8217; waiting time in TiDB slow queries) in t2 was about 267 times that in t1.<\/li>\n\n\n\n<li>The <code>tikv_cop_total_response_size<\/code> (the size of the result returned by TiKV&#8217;s <code>cop<\/code> requests) in t2 was about 192 times that in t1.<\/li>\n\n\n\n<li>The <code>tikv_cop_scan_details<\/code> (the number of scans of TiKV&#8217;s <code>cop<\/code> requests) in t2 was about 105 times that in t1.<\/li>\n<\/ul>\n\n\n\n<p>According to the statistics above, we knew that t2 had more <code>cop<\/code> requests than t1. As a result, TiKV Coprocessor was overloaded and <code>cop task<\/code> waited. The reason might be that there were a lot of large queries in t2. We could use <code>CLUSTER_SLOW_QUERY<\/code> to query whether there were unexpected queries during the t2 time period.<\/p>\n\n\n\n<p>At last, we found the root of the problem. During t1 and t2, we were running the <code>go-ycsb<\/code> stress test, and 20 <code>tpch<\/code> queries were running during t2. Therefore, <code>tpch<\/code> large queries led to too many <code>cop<\/code> requests.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Summary\"><\/span>Summary<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>The cluster diagnostics feature simplifies and streamlines TiDB cluster monitoring. Say goodbye to multiple tools and multiple learning curves. Further, you can monitor and analyze the entire cluster at the same time. No more checking each node individually.<\/p>\n\n\n\n<p>Cluster diagnostics is a work in progress. We&#8217;re still adding some diagnostic rules, and we look forward to feedback and advice from the open-source community. If you&#8217;d like to help us, join our <a href=\"https:\/\/slack.tidb.io\/invite?team=tidb-community&amp;channel=everyone&amp;ref=pingcap-blog\">community on Slack<\/a> and share your idea with us.<\/p>","protected":false},"excerpt":{"rendered":"<p>TiDB 4.0 introduces cluster diagnostics, a built-in widget in TiDB Dashboard, which lets you diagnose cluster problems within a specified time range. You can summarize the diagnostic results and cluster-related load monitoring information in a diagnostic report.<\/p>","protected":false},"author":39,"featured_media":328,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ub_ctt_via":"","footnotes":""},"categories":[13],"tags":[49],"class_list":["post-318","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-product","tag-troubleshooting"],"acf":[],"featured_image_src":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","author_info":{"display_name":"Heng Long","author_link":"https:\/\/www.pingcap.com\/ko\/blog\/author\/heng-long\/"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Cluster Diagnostics: Troubleshoot Using Only SQL Queries<\/title>\n<meta name=\"description\" content=\"In this post, we&#039;ll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Cluster Diagnostics: Troubleshoot Using Only SQL Queries\" \/>\n<meta property=\"og:description\" content=\"In this post, we&#039;ll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\" \/>\n<meta property=\"og:site_name\" content=\"TiDB\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/facebook.com\/pingcap2015\" \/>\n<meta property=\"article:published_time\" content=\"2020-07-01T00:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-21T12:25:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2000\" \/>\n\t<meta property=\"og:image:height\" content=\"667\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Heng Long\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@PingCAP\" \/>\n<meta name=\"twitter:site\" content=\"@PingCAP\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Heng Long\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\"},\"author\":{\"name\":\"Heng Long\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/ac272bb3ba37f7fabea5fd2fa142db1b\"},\"headline\":\"Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries\",\"datePublished\":\"2020-07-01T00:00:00+00:00\",\"dateModified\":\"2024-08-21T12:25:01+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\"},\"wordCount\":1748,\"publisher\":{\"@id\":\"https:\/\/www.pingcap.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\",\"keywords\":[\"Troubleshooting\"],\"articleSection\":[\"Product\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\",\"url\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\",\"name\":\"Cluster Diagnostics: Troubleshoot Using Only SQL Queries\",\"isPartOf\":{\"@id\":\"https:\/\/www.pingcap.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\",\"datePublished\":\"2020-07-01T00:00:00+00:00\",\"dateModified\":\"2024-08-21T12:25:01+00:00\",\"description\":\"In this post, we'll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage\",\"url\":\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\",\"width\":2000,\"height\":667},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.pingcap.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.pingcap.com\/#website\",\"url\":\"https:\/\/www.pingcap.com\/\",\"name\":\"TiDB\",\"description\":\"TiDB | SQL at Scale\",\"publisher\":{\"@id\":\"https:\/\/www.pingcap.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.pingcap.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.pingcap.com\/#organization\",\"name\":\"PingCAP\",\"url\":\"https:\/\/www.pingcap.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png\",\"width\":811,\"height\":232,\"caption\":\"PingCAP\"},\"image\":{\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/facebook.com\/pingcap2015\",\"https:\/\/x.com\/PingCAP\",\"https:\/\/linkedin.com\/company\/pingcap\",\"https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/ac272bb3ba37f7fabea5fd2fa142db1b\",\"name\":\"Heng Long\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/www.pingcap.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg\",\"contentUrl\":\"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg\",\"caption\":\"Heng Long\"},\"description\":\"Software Engineer\",\"url\":\"https:\/\/www.pingcap.com\/ko\/blog\/author\/heng-long\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Cluster Diagnostics: Troubleshoot Using Only SQL Queries","description":"In this post, we'll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/","og_locale":"ko_KR","og_type":"article","og_title":"Cluster Diagnostics: Troubleshoot Using Only SQL Queries","og_description":"In this post, we'll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.","og_url":"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/","og_site_name":"TiDB","article_publisher":"https:\/\/facebook.com\/pingcap2015","article_published_time":"2020-07-01T00:00:00+00:00","article_modified_time":"2024-08-21T12:25:01+00:00","og_image":[{"width":2000,"height":667,"url":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","type":"image\/jpeg"}],"author":"Heng Long","twitter_card":"summary_large_image","twitter_creator":"@PingCAP","twitter_site":"@PingCAP","twitter_misc":{"Written by":"Heng Long","Est. reading time":"10\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#article","isPartOf":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/"},"author":{"name":"Heng Long","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/ac272bb3ba37f7fabea5fd2fa142db1b"},"headline":"Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries","datePublished":"2020-07-01T00:00:00+00:00","dateModified":"2024-08-21T12:25:01+00:00","mainEntityOfPage":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/"},"wordCount":1748,"publisher":{"@id":"https:\/\/www.pingcap.com\/#organization"},"image":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage"},"thumbnailUrl":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","keywords":["Troubleshooting"],"articleSection":["Product"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/","url":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/","name":"Cluster Diagnostics: Troubleshoot Using Only SQL Queries","isPartOf":{"@id":"https:\/\/www.pingcap.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage"},"image":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage"},"thumbnailUrl":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","datePublished":"2020-07-01T00:00:00+00:00","dateModified":"2024-08-21T12:25:01+00:00","description":"In this post, we'll elaborate on cluster diagnostic reports, and show you how these reports can help you quickly find system problems.","breadcrumb":{"@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#primaryimage","url":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","contentUrl":"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg","width":2000,"height":667},{"@type":"BreadcrumbList","@id":"https:\/\/www.pingcap.com\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pingcap.com\/"},{"@type":"ListItem","position":2,"name":"Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries"}]},{"@type":"WebSite","@id":"https:\/\/www.pingcap.com\/#website","url":"https:\/\/www.pingcap.com\/","name":"\ud2f0DB","description":"TiDB | SQL at Scale","publisher":{"@id":"https:\/\/www.pingcap.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pingcap.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/www.pingcap.com\/#organization","name":"PingCAP","url":"https:\/\/www.pingcap.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/","url":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","contentUrl":"https:\/\/static.pingcap.com\/files\/2021\/11\/pingcap-logo.png","width":811,"height":232,"caption":"PingCAP"},"image":{"@id":"https:\/\/www.pingcap.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/facebook.com\/pingcap2015","https:\/\/x.com\/PingCAP","https:\/\/linkedin.com\/company\/pingcap","https:\/\/youtube.com\/channel\/UCuq4puT32DzHKT5rU1IZpIA"]},{"@type":"Person","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/ac272bb3ba37f7fabea5fd2fa142db1b","name":"Heng Long","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/www.pingcap.com\/#\/schema\/person\/image\/","url":"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg","contentUrl":"https:\/\/static.pingcap.com\/files\/2022\/10\/17234942\/avatar.jpg","caption":"Heng Long"},"description":"Software Engineer","url":"https:\/\/www.pingcap.com\/ko\/blog\/author\/heng-long\/"}]}},"grav_blocks":false,"card_markup":"<a class=\"card-resource bg-white\" href=\"https:\/\/www.pingcap.com\/ko\/blog\/cluster-diagnostics-troubleshoot-cluster-issues-using-only-sql-queries\/\"><div class=\"card-resource__image-container\"><img class=\"card-resource__image\" alt=\"troubleshoot-distributed-database-issues.jpg\" src=\"https:\/\/static.pingcap.com\/files\/2020\/07\/troubleshoot-distributed-database-issues.jpg\" loading=\"lazy\" width=2000 height=667 \/><\/div><div class=\"card-resource__content-container\"><div class=\"card-resource__content-head\"><div class=\"card-resource__category\">Product<\/div><\/div><h5 class=\"card-resource__title\">Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries<\/h5><\/div><\/a>","_links":{"self":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/318","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/comments?post=318"}],"version-history":[{"count":5,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/318\/revisions"}],"predecessor-version":[{"id":16915,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/posts\/318\/revisions\/16915"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/media\/328"}],"wp:attachment":[{"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/media?parent=318"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/categories?post=318"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pingcap.com\/ko\/wp-json\/wp\/v2\/tags?post=318"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}