- Docs Home
- About TiDB
- Quick Start
- Develop
- Overview
- Quick Start
- Build a TiDB Cluster in TiDB Cloud (Developer Tier)
- CRUD SQL in TiDB
- Build a Simple CRUD App with TiDB
- Example Applications
- Connect to TiDB
- Design Database Schema
- Write Data
- Read Data
- Transaction
- Optimize
- Troubleshoot
- Reference
- Cloud Native Development Environment
- Third-party Support
- Deploy
- Software and Hardware Requirements
- Environment Configuration Checklist
- Plan Cluster Topology
- Install and Start
- Verify Cluster Status
- Test Cluster Performance
- Migrate
- Overview
- Migration Tools
- Migration Scenarios
- Migrate from Aurora
- Migrate MySQL of Small Datasets
- Migrate MySQL of Large Datasets
- Migrate and Merge MySQL Shards of Small Datasets
- Migrate and Merge MySQL Shards of Large Datasets
- Migrate from CSV Files
- Migrate from SQL Files
- Migrate from One TiDB Cluster to Another TiDB Cluster
- Migrate from TiDB to MySQL-compatible Databases
- Advanced Migration
- Integrate
- Overview
- Integration Scenarios
- Maintain
- Monitor and Alert
- Troubleshoot
- TiDB Troubleshooting Map
- Identify Slow Queries
- Analyze Slow Queries
- SQL Diagnostics
- Identify Expensive Queries Using Top SQL
- Identify Expensive Queries Using Logs
- Statement Summary Tables
- Troubleshoot Hotspot Issues
- Troubleshoot Increased Read and Write Latency
- Save and Restore the On-Site Information of a Cluster
- Troubleshoot Cluster Setup
- Troubleshoot High Disk I/O Usage
- Troubleshoot Lock Conflicts
- Troubleshoot TiFlash
- Troubleshoot Write Conflicts in Optimistic Transactions
- Troubleshoot Inconsistency Between Data and Indexes
- Performance Tuning
- Tuning Guide
- Configuration Tuning
- System Tuning
- Software Tuning
- SQL Tuning
- Overview
- Understanding the Query Execution Plan
- SQL Optimization Process
- Overview
- Logic Optimization
- Physical Optimization
- Prepare Execution Plan Cache
- Control Execution Plans
- Tutorials
- TiDB Tools
- Overview
- Use Cases
- Download
- TiUP
- Documentation Map
- Overview
- Terminology and Concepts
- Manage TiUP Components
- FAQ
- Troubleshooting Guide
- Command Reference
- Overview
- TiUP Commands
- TiUP Cluster Commands
- Overview
- tiup cluster audit
- tiup cluster check
- tiup cluster clean
- tiup cluster deploy
- tiup cluster destroy
- tiup cluster disable
- tiup cluster display
- tiup cluster edit-config
- tiup cluster enable
- tiup cluster help
- tiup cluster import
- tiup cluster list
- tiup cluster patch
- tiup cluster prune
- tiup cluster reload
- tiup cluster rename
- tiup cluster replay
- tiup cluster restart
- tiup cluster scale-in
- tiup cluster scale-out
- tiup cluster start
- tiup cluster stop
- tiup cluster template
- tiup cluster upgrade
- TiUP DM Commands
- Overview
- tiup dm audit
- tiup dm deploy
- tiup dm destroy
- tiup dm disable
- tiup dm display
- tiup dm edit-config
- tiup dm enable
- tiup dm help
- tiup dm import
- tiup dm list
- tiup dm patch
- tiup dm prune
- tiup dm reload
- tiup dm replay
- tiup dm restart
- tiup dm scale-in
- tiup dm scale-out
- tiup dm start
- tiup dm stop
- tiup dm template
- tiup dm upgrade
- TiDB Cluster Topology Reference
- DM Cluster Topology Reference
- Mirror Reference Guide
- TiUP Components
- PingCAP Clinic Diagnostic Service
- TiDB Operator
- Dumpling
- TiDB Lightning
- TiDB Data Migration
- About TiDB Data Migration
- Architecture
- Quick Start
- Deploy a DM cluster
- Tutorials
- Advanced Tutorials
- Maintain
- Cluster Upgrade
- Tools
- Performance Tuning
- Manage Data Sources
- Manage Tasks
- Export and Import Data Sources and Task Configurations of Clusters
- Handle Alerts
- Daily Check
- Reference
- Architecture
- Command Line
- Configuration Files
- OpenAPI
- Compatibility Catalog
- Secure
- Monitoring and Alerts
- Error Codes
- Glossary
- Example
- Troubleshoot
- Release Notes
- Backup & Restore (BR)
- Point-in-Time Recovery
- TiDB Binlog
- TiCDC
- Dumpling
- sync-diff-inspector
- TiSpark
- Reference
- Cluster Architecture
- Key Monitoring Metrics
- Secure
- Privileges
- SQL
- SQL Language Structure and Syntax
- SQL Statements
ADD COLUMN
ADD INDEX
ADMIN
ADMIN CANCEL DDL
ADMIN CHECKSUM TABLE
ADMIN CHECK [TABLE|INDEX]
ADMIN SHOW DDL [JOBS|QUERIES]
ADMIN SHOW TELEMETRY
ALTER DATABASE
ALTER INDEX
ALTER INSTANCE
ALTER PLACEMENT POLICY
ALTER TABLE
ALTER TABLE COMPACT
ALTER TABLE SET TIFLASH MODE
ALTER USER
ANALYZE TABLE
BACKUP
BATCH
BEGIN
CHANGE COLUMN
COMMIT
CHANGE DRAINER
CHANGE PUMP
CREATE [GLOBAL|SESSION] BINDING
CREATE DATABASE
CREATE INDEX
CREATE PLACEMENT POLICY
CREATE ROLE
CREATE SEQUENCE
CREATE TABLE LIKE
CREATE TABLE
CREATE USER
CREATE VIEW
DEALLOCATE
DELETE
DESC
DESCRIBE
DO
DROP [GLOBAL|SESSION] BINDING
DROP COLUMN
DROP DATABASE
DROP INDEX
DROP PLACEMENT POLICY
DROP ROLE
DROP SEQUENCE
DROP STATS
DROP TABLE
DROP USER
DROP VIEW
EXECUTE
EXPLAIN ANALYZE
EXPLAIN
FLASHBACK TABLE
FLUSH PRIVILEGES
FLUSH STATUS
FLUSH TABLES
GRANT <privileges>
GRANT <role>
INSERT
KILL [TIDB]
LOAD DATA
LOAD STATS
MODIFY COLUMN
PREPARE
RECOVER TABLE
RENAME INDEX
RENAME TABLE
REPLACE
RESTORE
REVOKE <privileges>
REVOKE <role>
ROLLBACK
SAVEPOINT
SELECT
SET DEFAULT ROLE
SET [NAMES|CHARACTER SET]
SET PASSWORD
SET ROLE
SET TRANSACTION
SET [GLOBAL|SESSION] <variable>
SHOW ANALYZE STATUS
SHOW [BACKUPS|RESTORES]
SHOW [GLOBAL|SESSION] BINDINGS
SHOW BUILTINS
SHOW CHARACTER SET
SHOW COLLATION
SHOW [FULL] COLUMNS FROM
SHOW CONFIG
SHOW CREATE PLACEMENT POLICY
SHOW CREATE SEQUENCE
SHOW CREATE TABLE
SHOW CREATE USER
SHOW DATABASES
SHOW DRAINER STATUS
SHOW ENGINES
SHOW ERRORS
SHOW [FULL] FIELDS FROM
SHOW GRANTS
SHOW INDEX [FROM|IN]
SHOW INDEXES [FROM|IN]
SHOW KEYS [FROM|IN]
SHOW MASTER STATUS
SHOW PLACEMENT
SHOW PLACEMENT FOR
SHOW PLACEMENT LABELS
SHOW PLUGINS
SHOW PRIVILEGES
SHOW [FULL] PROCESSSLIST
SHOW PROFILES
SHOW PUMP STATUS
SHOW SCHEMAS
SHOW STATS_HEALTHY
SHOW STATS_HISTOGRAMS
SHOW STATS_META
SHOW STATUS
SHOW TABLE NEXT_ROW_ID
SHOW TABLE REGIONS
SHOW TABLE STATUS
SHOW [FULL] TABLES
SHOW [GLOBAL|SESSION] VARIABLES
SHOW WARNINGS
SHUTDOWN
SPLIT REGION
START TRANSACTION
TABLE
TRACE
TRUNCATE
UPDATE
USE
WITH
- Data Types
- Functions and Operators
- Overview
- Type Conversion in Expression Evaluation
- Operators
- Control Flow Functions
- String Functions
- Numeric Functions and Operators
- Date and Time Functions
- Bit Functions and Operators
- Cast Functions and Operators
- Encryption and Compression Functions
- Locking Functions
- Information Functions
- JSON Functions
- Aggregate (GROUP BY) Functions
- Window Functions
- Miscellaneous Functions
- Precision Math
- Set Operations
- List of Expressions for Pushdown
- TiDB Specific Functions
- Clustered Indexes
- Constraints
- Generated Columns
- SQL Mode
- Table Attributes
- Transactions
- Garbage Collection (GC)
- Views
- Partitioning
- Temporary Tables
- Cached Tables
- Character Set and Collation
- Placement Rules in SQL
- System Tables
mysql
- INFORMATION_SCHEMA
- Overview
ANALYZE_STATUS
CLIENT_ERRORS_SUMMARY_BY_HOST
CLIENT_ERRORS_SUMMARY_BY_USER
CLIENT_ERRORS_SUMMARY_GLOBAL
CHARACTER_SETS
CLUSTER_CONFIG
CLUSTER_HARDWARE
CLUSTER_INFO
CLUSTER_LOAD
CLUSTER_LOG
CLUSTER_SYSTEMINFO
COLLATIONS
COLLATION_CHARACTER_SET_APPLICABILITY
COLUMNS
DATA_LOCK_WAITS
DDL_JOBS
DEADLOCKS
ENGINES
INSPECTION_RESULT
INSPECTION_RULES
INSPECTION_SUMMARY
KEY_COLUMN_USAGE
METRICS_SUMMARY
METRICS_TABLES
PARTITIONS
PLACEMENT_POLICIES
PROCESSLIST
REFERENTIAL_CONSTRAINTS
SCHEMATA
SEQUENCES
SESSION_VARIABLES
SLOW_QUERY
STATISTICS
TABLES
TABLE_CONSTRAINTS
TABLE_STORAGE_STATS
TIDB_HOT_REGIONS
TIDB_HOT_REGIONS_HISTORY
TIDB_INDEXES
TIDB_SERVERS_INFO
TIDB_TRX
TIFLASH_REPLICA
TIKV_REGION_PEERS
TIKV_REGION_STATUS
TIKV_STORE_STATUS
USER_PRIVILEGES
VARIABLES_INFO
VIEWS
METRICS_SCHEMA
- UI
- TiDB Dashboard
- Overview
- Maintain
- Access
- Overview Page
- Cluster Info Page
- Top SQL Page
- Key Visualizer Page
- Metrics Relation Graph
- SQL Statements Analysis
- Slow Queries Page
- Cluster Diagnostics
- Monitoring Page
- Search Logs Page
- Instance Profiling
- Session Management and Configuration
- FAQ
- CLI
- Command Line Flags
- Configuration File Parameters
- System Variables
- Storage Engines
- Telemetry
- Errors Codes
- Table Filter
- Schedule Replicas by Topology Labels
- FAQs
- Release Notes
- All Releases
- Release Timeline
- TiDB Versioning
- TiDB Installation Packages
- v6.2
- v6.1
- v6.0
- v5.4
- v5.3
- v5.2
- v5.1
- v5.0
- v4.0
- v3.1
- v3.0
- v2.1
- v2.0
- v1.0
- Glossary
Troubleshoot Increased Read and Write Latency
This document introduces the possible causes of read and write latency and jitters, and how to troubleshoot these issues.
Common causes
Incorrect TiDB execution plan
The execution plan of queries is unstable and might select the incorrect index, which causes higher latency.
Phenomenon
- If the query execution plan is output in the slow log, you can directly view the plan. Execute the
select tidb_decode_plan('xxx...')
statement to parse the detailed execution plan. - The number of scanned keys in the monitor abnormally increases; in the slow log, the number of
Scan Keys
are large. - The SQL execution duration in TiDB is greatly different than that in other databases such as MySQL. You can compare the execution plan of other databases (for example, whether
Join Order
is different).
Possible reason
The statistics is inaccurate.
Troubleshooting methods
- Update the statistical information
- Execute
analyze table
manually and executeanalyze
periodically with thecrontab
command to keep the statistics accurate. - Execute
auto analyze
automatically. Lower the threshold value ofanalyze ratio
, increase the frequency of information collection, and set the start and end time of the execution. See the following examples:set global tidb_auto_analyze_ratio=0.2;
set global tidb_auto_analyze_start_time='00:00 +0800';
set global tidb_auto_analyze_end_time='06:00 +0800';
- Execute
- Bind the execution plan
- Modify the application SQL statements and execute
use index
to consistently use the index of the column. - In 3.0 versions, you do not need to modify the application SQL statements. Use
create global binding
to create the binding SQL statement offorce index
. - In 4.0 versions, SQL Plan Management is supported, which avoids the performance decrease caused by unstable execution plans.
- Modify the application SQL statements and execute
PD anomalies
Phenomenon
There is an abnormal increase of the wait duration
metric for the PD TSO. This metric represents the duration of waiting for PD to return requests.
Possible reasons
Disk issue. The disk where the PD node is located has full I/O load. Investigate whether PD is deployed with other components with high I/O demand and the health of the disk. You can verify the cause by viewing the monitor metrics in Grafana -> disk performance -> latency/load. You can also use the FIO tool to run a check on the disk if necessary.
Network issues between PD peers. The PD log shows
lost the TCP streaming connection
. You need to check whether there is a problem with the network between PD nodes and verify the cause by viewinground trip
in the monitor Grafana -> PD -> etcd.High server load. The log shows
server is likely overloaded
.PD cannot elect a Leader: The PD log shows
lease is not expired
. This issue has been fixed in v3.0.x and v2.1.19.The leader election is slow. The Region loading duration is long. You can check this issue by running
grep "regions cost"
in the PD log. If the result is in seconds, such asload 460927 regions cost 11.77099s
, it means the Region loading is slow. You can enable theregion storage
feature in v3.0 by settinguse-region-storage
totrue
, which significantly reduce the Region loading duration.The network issue between TiDB and PD. Check whether the network from TiDB to PD Leader is running normally by accessing the monitor Grafana -> blackbox_exporter -> ping latency.
PD reports the
FATAL
error, and the log showsrange failed to find revision pair
. This issue has been fixed in v3.0.8 (#2040).When the
/api/v1/regions
interface is used, too many Regions might cause PD OOM. This issue has been fixed in v3.0.8 (#1986).PD OOM during the rolling upgrade. The size of gRPC messages is not limited, and the monitor shows that
TCP InSegs
is relatively large. This issue has been fixed in v3.0.6 (#1952).PD panics. Report a bug.
Other causes. Get goroutine by running
curl http://127.0.0.1:2379/debug/pprof/goroutine?debug=2
and report a bug.
TiKV anomalies
Phenomenon
The KV Cmd Duration
metric in the monitor increases abnormally. This metric represents the duration between the time that TiDB sends a request to TiKV and the time that TiDB receives the response.
Possible reasons
Check the
gRPC duration
metric. This metric represents the total duration of a gRPC request in TiKV. You can find out the potential network issue by comparinggRPC duration
of TiKV andKV duration
of TiDB. For example, the gRPC duration is short but the KV duration of TiDB is long, which indicates that the network latency between TiDB and TiKV might be high, or that the NIC bandwidth between TiDB and TiKV is fully occupied.Re-election because TiKV is restarted.
- After TiKV panics, it is pulled up by
systemd
and runs normally. You can check whether panic has occurred by viewing the TiKV log. Because this issue is unexpected, report a bug if it happens. - TiKV is stopped or killed by a third party and then pulled up by
systemd
. Check the cause by viewingdmesg
and the TiKV log. - TiKV is OOM, which causes restart.
- TiKV is hung because of dynamically adjusting
THP
(Transparent Hugepage).
- After TiKV panics, it is pulled up by
Check monitor: TiKV RocksDB encounters write stall and thus results in re-election. You can check if the monitor Grafana -> TiKV-details -> errors shows
server is busy
.Re-election because of network isolation.
If the
block-cache
configuration is too large, it might cause TiKV OOM. To verify the cause of the problem, check theblock cache size
of RocksDB by selecting the corresponding instance in the monitor Grafana -> TiKV-details. Meanwhile, check whether the[storage.block-cache] capacity = # "1GB"
parameter is set properly. By default, TiKV'sblock-cache
is set to45%
of the total memory of the machine. You need to explicitly specify this parameter when you deploy TiKV in the container, because TiKV obtains the memory of the physical machine, which might exceed the memory limit of the container.Coprocessor receives many large queries and returns a large volume of data. gRPC fails to send data as quickly as the coprocessor returns data, which results in OOM. To verify the cause, you can check whether
response size
exceeds thenetwork outbound
traffic by viewing the monitor Grafana -> TiKV-details -> coprocessor overview.
Bottleneck of a single TiKV thread
There are some single threads in TiKV that might become the bottleneck.
- Too many Regions in a TiKV instance causes a single gRPC thread to be the bottleneck (Check the Grafana -> TiKV-details -> Thread CPU/gRPC CPU Per Thread metric). In v3.x or later versions, you can enable
Hibernate Region
to resolve the issue. - For versions earlier than v3.0, when the raftstore thread or the apply thread becomes the bottleneck (Grafana -> TiKV-details -> Thread CPU/raft store CPU and Async apply CPU metrics exceed
80%
), you can scale out TiKV (v2.x) instances or upgrade to v3.x with multi-threading.
CPU load increases
Phenomenon
The usage of CPU resources becomes the bottleneck.
Possible reasons
- Hotspot issue
- High overall load. Check the slow queries and expensive queries of TiDB. Optimize the executing queries by adding indexes or executing queries in batches. Another solution is to scale out the cluster.
Other causes
Cluster maintenance
Most of each online cluster has three or five nodes. If the machine to be maintained has the PD component, you need to determine whether the node is the leader or the follower. Disabling a follower has no impact on the cluster operation. Before disabling a leader, you need to switch the leadership. During the leadership change, performance jitter of about 3 seconds will occur.
Minority of replicas are offline
By default, each TiDB cluster has three replicas, so each Region has three replicas in the cluster. These Regions elect the leader and replicate data through the Raft protocol. The Raft protocol ensures that TiDB can still provide services without data loss even when the nodes (that are fewer than half of replicas) fail or are isolated. For the cluster with three replicas, the failure of one node might cause performance jitter but the usability and correctness in theory are not affected.
New indexes
Creating indexes consumes a huge amount of resources when TiDB scans tables and backfills indexes. Index creation might even conflict with the frequently updated fields, which affects the application. Creating indexes on a large table often takes a long time, so you must try to balance the index creation time and the cluster performance (for example, creating indexes at the off-peak time).
Parameter adjustment:
Currently, you can use tidb_ddl_reorg_worker_cnt
and tidb_ddl_reorg_batch_size
to dynamically adjust the speed of index creation. Usually, the smaller the values, the smaller the impact on the system, with longer execution time though.
In general cases, you can first keep their default values (4
and 256
), observe the resource usage and response speed of the cluster, and then increase the value of tidb_ddl_reorg_worker_cnt
to increase the concurrency. If no obvious jitter is observed in the monitor, increase the value of tidb_ddl_reorg_batch_size
. If the columns involved in the index creation are frequently updated, the many resulting conflicts will cause the index creation to fail and be retried.
In addition, you can also set the value of tidb_ddl_reorg_priority
to PRIORITY_HIGH
to prioritize the index creation and speed up the process. But in the general OLTP system, it is recommended to keep its default value.
High GC pressure
The transaction of TiDB adopts the Multi-Version Concurrency Control (MVCC) mechanism. When the newly written data overwrites the old data, the old data is not replaced, and both versions of data are stored. Timestamps are used to mark different versions. The task of GC is to clear the obsolete data.
- In the phase of Resolve Locks, a large amount of
scan_lock
requests are created in TiKV, which can be observed in the gRPC-related metrics. Thesescan_lock
requests call all Regions. - In the phase of Delete Ranges, a few (or no)
unsafe_destroy_range
requests are sent to TiKV, which can be observed in the gRPC-related metrics and the GC tasks panel. - In the phase of Do GC, each TiKV by default scans the leader Regions on the machine and performs GC to each leader, which can be observed in the GC tasks panel.