- Docs Home
- About TiDB
- Quick Start
- Develop
- Overview
- Quick Start
- Build a TiDB Cluster in TiDB Cloud (Developer Tier)
- CRUD SQL in TiDB
- Build a Simple CRUD App with TiDB
- Example Applications
- Connect to TiDB
- Design Database Schema
- Write Data
- Read Data
- Transaction
- Optimize
- Troubleshoot
- Reference
- Cloud Native Development Environment
- Third-party Support
- Deploy
- Software and Hardware Requirements
- Environment Configuration Checklist
- Plan Cluster Topology
- Install and Start
- Verify Cluster Status
- Test Cluster Performance
- Migrate
- Overview
- Migration Tools
- Migration Scenarios
- Migrate from Aurora
- Migrate MySQL of Small Datasets
- Migrate MySQL of Large Datasets
- Migrate and Merge MySQL Shards of Small Datasets
- Migrate and Merge MySQL Shards of Large Datasets
- Migrate from CSV Files
- Migrate from SQL Files
- Migrate from One TiDB Cluster to Another TiDB Cluster
- Migrate from TiDB to MySQL-compatible Databases
- Advanced Migration
- Integrate
- Overview
- Integration Scenarios
- Maintain
- Monitor and Alert
- Troubleshoot
- TiDB Troubleshooting Map
- Identify Slow Queries
- Analyze Slow Queries
- SQL Diagnostics
- Identify Expensive Queries Using Top SQL
- Identify Expensive Queries Using Logs
- Statement Summary Tables
- Troubleshoot Hotspot Issues
- Troubleshoot Increased Read and Write Latency
- Save and Restore the On-Site Information of a Cluster
- Troubleshoot Cluster Setup
- Troubleshoot High Disk I/O Usage
- Troubleshoot Lock Conflicts
- Troubleshoot TiFlash
- Troubleshoot Write Conflicts in Optimistic Transactions
- Troubleshoot Inconsistency Between Data and Indexes
- Performance Tuning
- Tuning Guide
- Configuration Tuning
- System Tuning
- Software Tuning
- SQL Tuning
- Overview
- Understanding the Query Execution Plan
- SQL Optimization Process
- Overview
- Logic Optimization
- Physical Optimization
- Prepare Execution Plan Cache
- Control Execution Plans
- Tutorials
- TiDB Tools
- Overview
- Use Cases
- Download
- TiUP
- Documentation Map
- Overview
- Terminology and Concepts
- Manage TiUP Components
- FAQ
- Troubleshooting Guide
- Command Reference
- Overview
- TiUP Commands
- TiUP Cluster Commands
- Overview
- tiup cluster audit
- tiup cluster check
- tiup cluster clean
- tiup cluster deploy
- tiup cluster destroy
- tiup cluster disable
- tiup cluster display
- tiup cluster edit-config
- tiup cluster enable
- tiup cluster help
- tiup cluster import
- tiup cluster list
- tiup cluster patch
- tiup cluster prune
- tiup cluster reload
- tiup cluster rename
- tiup cluster replay
- tiup cluster restart
- tiup cluster scale-in
- tiup cluster scale-out
- tiup cluster start
- tiup cluster stop
- tiup cluster template
- tiup cluster upgrade
- TiUP DM Commands
- Overview
- tiup dm audit
- tiup dm deploy
- tiup dm destroy
- tiup dm disable
- tiup dm display
- tiup dm edit-config
- tiup dm enable
- tiup dm help
- tiup dm import
- tiup dm list
- tiup dm patch
- tiup dm prune
- tiup dm reload
- tiup dm replay
- tiup dm restart
- tiup dm scale-in
- tiup dm scale-out
- tiup dm start
- tiup dm stop
- tiup dm template
- tiup dm upgrade
- TiDB Cluster Topology Reference
- DM Cluster Topology Reference
- Mirror Reference Guide
- TiUP Components
- PingCAP Clinic Diagnostic Service
- TiDB Operator
- Dumpling
- TiDB Lightning
- TiDB Data Migration
- About TiDB Data Migration
- Architecture
- Quick Start
- Deploy a DM cluster
- Tutorials
- Advanced Tutorials
- Maintain
- Cluster Upgrade
- Tools
- Performance Tuning
- Manage Data Sources
- Manage Tasks
- Export and Import Data Sources and Task Configurations of Clusters
- Handle Alerts
- Daily Check
- Reference
- Architecture
- Command Line
- Configuration Files
- OpenAPI
- Compatibility Catalog
- Secure
- Monitoring and Alerts
- Error Codes
- Glossary
- Example
- Troubleshoot
- Release Notes
- Backup & Restore (BR)
- Point-in-Time Recovery
- TiDB Binlog
- TiCDC
- Dumpling
- sync-diff-inspector
- TiSpark
- Reference
- Cluster Architecture
- Key Monitoring Metrics
- Secure
- Privileges
- SQL
- SQL Language Structure and Syntax
- SQL Statements
ADD COLUMN
ADD INDEX
ADMIN
ADMIN CANCEL DDL
ADMIN CHECKSUM TABLE
ADMIN CHECK [TABLE|INDEX]
ADMIN SHOW DDL [JOBS|QUERIES]
ADMIN SHOW TELEMETRY
ALTER DATABASE
ALTER INDEX
ALTER INSTANCE
ALTER PLACEMENT POLICY
ALTER TABLE
ALTER TABLE COMPACT
ALTER TABLE SET TIFLASH MODE
ALTER USER
ANALYZE TABLE
BACKUP
BATCH
BEGIN
CHANGE COLUMN
COMMIT
CHANGE DRAINER
CHANGE PUMP
CREATE [GLOBAL|SESSION] BINDING
CREATE DATABASE
CREATE INDEX
CREATE PLACEMENT POLICY
CREATE ROLE
CREATE SEQUENCE
CREATE TABLE LIKE
CREATE TABLE
CREATE USER
CREATE VIEW
DEALLOCATE
DELETE
DESC
DESCRIBE
DO
DROP [GLOBAL|SESSION] BINDING
DROP COLUMN
DROP DATABASE
DROP INDEX
DROP PLACEMENT POLICY
DROP ROLE
DROP SEQUENCE
DROP STATS
DROP TABLE
DROP USER
DROP VIEW
EXECUTE
EXPLAIN ANALYZE
EXPLAIN
FLASHBACK TABLE
FLUSH PRIVILEGES
FLUSH STATUS
FLUSH TABLES
GRANT <privileges>
GRANT <role>
INSERT
KILL [TIDB]
LOAD DATA
LOAD STATS
MODIFY COLUMN
PREPARE
RECOVER TABLE
RENAME INDEX
RENAME TABLE
REPLACE
RESTORE
REVOKE <privileges>
REVOKE <role>
ROLLBACK
SAVEPOINT
SELECT
SET DEFAULT ROLE
SET [NAMES|CHARACTER SET]
SET PASSWORD
SET ROLE
SET TRANSACTION
SET [GLOBAL|SESSION] <variable>
SHOW ANALYZE STATUS
SHOW [BACKUPS|RESTORES]
SHOW [GLOBAL|SESSION] BINDINGS
SHOW BUILTINS
SHOW CHARACTER SET
SHOW COLLATION
SHOW [FULL] COLUMNS FROM
SHOW CONFIG
SHOW CREATE PLACEMENT POLICY
SHOW CREATE SEQUENCE
SHOW CREATE TABLE
SHOW CREATE USER
SHOW DATABASES
SHOW DRAINER STATUS
SHOW ENGINES
SHOW ERRORS
SHOW [FULL] FIELDS FROM
SHOW GRANTS
SHOW INDEX [FROM|IN]
SHOW INDEXES [FROM|IN]
SHOW KEYS [FROM|IN]
SHOW MASTER STATUS
SHOW PLACEMENT
SHOW PLACEMENT FOR
SHOW PLACEMENT LABELS
SHOW PLUGINS
SHOW PRIVILEGES
SHOW [FULL] PROCESSSLIST
SHOW PROFILES
SHOW PUMP STATUS
SHOW SCHEMAS
SHOW STATS_HEALTHY
SHOW STATS_HISTOGRAMS
SHOW STATS_META
SHOW STATUS
SHOW TABLE NEXT_ROW_ID
SHOW TABLE REGIONS
SHOW TABLE STATUS
SHOW [FULL] TABLES
SHOW [GLOBAL|SESSION] VARIABLES
SHOW WARNINGS
SHUTDOWN
SPLIT REGION
START TRANSACTION
TABLE
TRACE
TRUNCATE
UPDATE
USE
WITH
- Data Types
- Functions and Operators
- Overview
- Type Conversion in Expression Evaluation
- Operators
- Control Flow Functions
- String Functions
- Numeric Functions and Operators
- Date and Time Functions
- Bit Functions and Operators
- Cast Functions and Operators
- Encryption and Compression Functions
- Locking Functions
- Information Functions
- JSON Functions
- Aggregate (GROUP BY) Functions
- Window Functions
- Miscellaneous Functions
- Precision Math
- Set Operations
- List of Expressions for Pushdown
- TiDB Specific Functions
- Clustered Indexes
- Constraints
- Generated Columns
- SQL Mode
- Table Attributes
- Transactions
- Garbage Collection (GC)
- Views
- Partitioning
- Temporary Tables
- Cached Tables
- Character Set and Collation
- Placement Rules in SQL
- System Tables
mysql
- INFORMATION_SCHEMA
- Overview
ANALYZE_STATUS
CLIENT_ERRORS_SUMMARY_BY_HOST
CLIENT_ERRORS_SUMMARY_BY_USER
CLIENT_ERRORS_SUMMARY_GLOBAL
CHARACTER_SETS
CLUSTER_CONFIG
CLUSTER_HARDWARE
CLUSTER_INFO
CLUSTER_LOAD
CLUSTER_LOG
CLUSTER_SYSTEMINFO
COLLATIONS
COLLATION_CHARACTER_SET_APPLICABILITY
COLUMNS
DATA_LOCK_WAITS
DDL_JOBS
DEADLOCKS
ENGINES
INSPECTION_RESULT
INSPECTION_RULES
INSPECTION_SUMMARY
KEY_COLUMN_USAGE
METRICS_SUMMARY
METRICS_TABLES
PARTITIONS
PLACEMENT_POLICIES
PROCESSLIST
REFERENTIAL_CONSTRAINTS
SCHEMATA
SEQUENCES
SESSION_VARIABLES
SLOW_QUERY
STATISTICS
TABLES
TABLE_CONSTRAINTS
TABLE_STORAGE_STATS
TIDB_HOT_REGIONS
TIDB_HOT_REGIONS_HISTORY
TIDB_INDEXES
TIDB_SERVERS_INFO
TIDB_TRX
TIFLASH_REPLICA
TIKV_REGION_PEERS
TIKV_REGION_STATUS
TIKV_STORE_STATUS
USER_PRIVILEGES
VARIABLES_INFO
VIEWS
METRICS_SCHEMA
- UI
- TiDB Dashboard
- Overview
- Maintain
- Access
- Overview Page
- Cluster Info Page
- Top SQL Page
- Key Visualizer Page
- Metrics Relation Graph
- SQL Statements Analysis
- Slow Queries Page
- Cluster Diagnostics
- Monitoring Page
- Search Logs Page
- Instance Profiling
- Session Management and Configuration
- FAQ
- CLI
- Command Line Flags
- Configuration File Parameters
- System Variables
- Storage Engines
- Telemetry
- Errors Codes
- Table Filter
- Schedule Replicas by Topology Labels
- FAQs
- Release Notes
- All Releases
- Release Timeline
- TiDB Versioning
- TiDB Installation Packages
- v6.2
- v6.1
- v6.0
- v5.4
- v5.3
- v5.2
- v5.1
- v5.0
- v4.0
- v3.1
- v3.0
- v2.1
- v2.0
- v1.0
- Glossary
PD Scheduling Best Practices
This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts:
- leader/follower/learner
- operator
- operator step
- pending/down
- region/peer/Raft group
- region split
- scheduler
- store
This document initially targets TiDB 3.0. Although some features are not supported in earlier versions (2.x), the underlying mechanisms are similar and this document can still be used as a reference.
PD scheduling policies
This section introduces the principles and processes involved in the scheduling system.
Scheduling process
The scheduling process generally has three steps:
Collect information
Each TiKV node periodically reports two types of heartbeats to PD:
StoreHeartbeat
: Contains the overall information of stores, including disk capacity, available storage, and read/write trafficRegionHeartbeat
: Contains the overall information of regions, including the range of each region, peer distribution, peer status, data volume, and read/write traffic
PD collects and restores this information for scheduling decisions.
Generate operators
Different schedulers generate the operators based on their own logic and requirements, with the following considerations:
- Do not add peers to a store in abnormal states (disconnected, down, busy, low space)
- Do not balance regions in abnormal states
- Do not transfer a leader to a pending peer
- Do not remove a leader directly
- Do not break the physical isolation of various region peers
- Do not violate constraints such as label property
Execute operators
To execute the operators, the general procedure is:
The generated operator first joins a queue managed by
OperatorController
.OperatorController
takes the operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each operator step to the corresponding region leader.The operator is marked as "finish" or "timeout" and removed from the queue.
Load balancing
Region primarily relies on balance-leader
and balance-region
schedulers to achieve load balance. Both schedulers target distributing regions evenly across all stores in the cluster but with separate focuses: balance-leader
deals with region leader to balance incoming client requests, whereas balance-region
concerns itself with each region peer to redistribute the pressure of storage and avoid exceptions like out of storage space.
balance-leader
and balance-region
share a similar scheduling process:
- Rate stores according to their resource availability.
balance-leader
orbalance-region
constantly transfer leaders or peers from stores with high scores to those with low scores.
However, their rating methods are different. balance-leader
uses the sum of all region sizes corresponding to leaders in a store, whereas the way of balance-region
is relatively complicated. Depending on the specific storage capacity of each node, the rating method of balance-region
might:
- based on the amount of data when there is sufficient storage (to balance data distribution among nodes).
- based on the available storage when there is insufficient storage (to balance the storage availability on different nodes).
- based on the weighted sum of the two factors above when neither of the situations applies.
Because different nodes might differ in performance, you can also set the weight of load balancing for different stores. leader-weight
and region-weight
are used to control the leader weight and region weight respectively ("1" by default for both). For example, when the leader-weight
of a store is set to "2", the number of leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the leader-weight
of a store is set to "0.5", the number of leaders on the node is about half as many as that of other nodes.
Hot regions scheduling
For hot regions scheduling, use hot-region-scheduler
. Since TiDB v3.0, the process is performed as follows:
Count hot regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by stores.
Redistribute these regions in a similar way to load balancing.
For hot write regions, hot-region-scheduler
attempts to redistribute both region peers and leaders; for hot read regions, hot-region-scheduler
only redistributes region leaders.
Cluster topology awareness
Cluster topology awareness enables PD to distribute replicas of a region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all regions in the background. When PD finds that the distribution of regions is not optimal, it generates an operator to replace peers and redistribute regions.
The component to check region distribution is replicaChecker
, which is similar to a scheduler except that it cannot be disabled. replicaChecker
schedules based on the the configuration of location-labels
. For example, [zone,rack,host]
defines a three-tier topology for a cluster. PD attempts to schedule region peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on.
Scale-down and failure recovery
Scale-down refers to the process when you take a store offline and mark it as "offline" using a command. PD replicates the regions on the offline node to other nodes by scheduling. Failure recovery applies when stores failed and cannot be recovered. In this case, regions with peers distributed on the corresponding store might lose replicas, which requires PD to replenish on other nodes.
The processes of scale-down and failure recovery are basically the same. replicaChecker
finds a region peer in abnormal states, and then generates an operator to replace the abnormal peer with a new one on a healthy store.
Region merge
Region merge refers to the process of merging adjacent small regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty regions after data deletion. Region merge is performed by mergeChecker
, which processes in a similar way to replicaChecker
: PD continuously scans all regions in the background, and generates an operator when contiguous small regions are found.
Specifically, when a newly split Region exists for more than the value of split-merge-interval
(1h
by default), if any of the following conditions occurs, this Region triggers the Region merge scheduling:
The size of this Region is smaller than the value of the
max-merge-region-size
(20 MiB by default)The number of keys in this Region is smaller than the value of
max-merge-region-keys
(200,000 by default).
Query scheduling status
You can check the status of scheduling system through metrics, pd-ctl and logs. This section briefly introduces the methods of metrics and pd-ctl. Refer to PD Monitoring Metrics and PD Control for details.
Operator status
The Grafana PD/Operator page shows the metrics about operators, among which:
- Schedule operator create: Operator creating information
- Operator finish duration: Execution time consumed by each operator
- Operator step duration: Execution time consumed by the operator step
You can query operators using pd-ctl with the following commands:
operator show
: Queries all operators generated in the current scheduling taskoperator show [admin | leader | region]
: Queries operators by type
Balance status
The Grafana PD/Statistics - Balance page shows the metrics about load balancing, among which:
- Store leader/region score: Score of each store
- Store leader/region count: The number of leaders/regions in each store
- Store available: Available storage on each store
You can use store commands of pd-ctl to query balance status of each store.
Hot Region status
The Grafana PD/Statistics - hotspot page shows the metrics about hot regions, among which:
- Hot write region's leader/peer distribution: the leader/peer distribution in hot write regions
- Hot read region's leader distribution: the leader distribution in hot read regions
You can also query the status of hot regions using pd-ctl with the following commands:
hot read
: Queries hot read regionshot write
: Queries hot write regionshot store
: Queries the distribution of hot regions by storeregion topread [limit]
: Queries the region with top read trafficregion topwrite [limit]
: Queries the region with top write traffic
Region health
The Grafana PD/Cluster/Region health panel shows the metrics about regions in abnormal states.
You can query the list of regions in abnormal states using pd-ctl with region check commands:
region check miss-peer
: Queries regions without enough peersregion check extra-peer
: Queries regions with extra peersregion check down-peer
: Queries regions with down peersregion check pending-peer
: Queries regions with pending peers
Control scheduling strategy
You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to PD Control for more details.
Add/delete scheduler manually
PD supports dynamically adding and removing schedulers directly through pd-ctl. For example:
scheduler show
: Shows currently running schedulers in the systemscheduler remove balance-leader-scheduler
: Removes (disable) balance-leader-schedulerscheduler add evict-leader-scheduler 1
: Adds a scheduler to remove all leaders in Store 1
Add/delete Operators manually
PD also supports adding or removing operators directly through pd-ctl. For example:
operator add add-peer 2 5
: Adds peers to Region 2 in Store 5operator add transfer-leader 2 5
: Migrates the leader of Region 2 to Store 5operator add split-region 2
: Splits Region 2 into two regions evenly in sizeoperator remove 2
: Removes currently pending operator in Region 2
Adjust scheduling parameter
You can check the scheduling configuration using the config show
command in pd-ctl, and adjust the values using config set {key} {value}
. Common adjustments include:
leader-schedule-limit
: Controls the concurrency of transferring leader schedulingregion-schedule-limit
: Controls the concurrency of adding/deleting peer schedulingenable-replace-offline-replica
: Determines whether to enable the scheduling to take nodes offlineenable-location-replacement
: Determines whether to enable the scheduling that handles the isolation level of regionsmax-snapshot-count
: Controls the maximum concurrency of sending/receiving snapshots for each store
PD scheduling in common scenarios
This section illustrates the best practices of PD scheduling strategies through several typical scenarios.
Leaders/regions are not evenly distributed
The rating mechanism of PD determines that leader count and region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage.
Once you have confirmed that leaders/region are not evenly distributed, you need to check the rating of different stores.
If the scores of different stores are close, it means PD mistakenly believes that leaders/regions are evenly distributed. Possible reasons are:
- There are hot regions that cause load imbalancing. In this case, you need to analyze further based on hot regions scheduling.
- There are a large number of empty regions or small regions, which leads to a great difference in the number of leaders in different stores and high pressure on Raft store. This is the time for a region merge scheduling.
- Hardware and software environment varies among stores. You can adjust the values of
leader-weight
andregion-weight
accordingly to control the distribution of leader/region. - Other unknown reasons. Still you can adjust the values of
leader-weight
andregion-weight
to control the distribution of leader/region.
If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations:
When operators are generated normally but the scheduling process is slow, it is possible that:
- The scheduling speed is limited by default for load balancing purpose. You can adjust
leader-schedule-limit
orregion-schedule-limit
to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified bymax-pending-peer-count
andmax-snapshot-count
. - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of
region-schedule-limit
. In this case, you can limit the speed of scheduler to remove nodes, or simply setenable-replace-offline-replica = false
to temporarily disable it. - The scheduling process is too slow. You can check the Operator step duration metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as
TransferLeader
,RemovePeer
,PromoteLearner
) should be completed in milliseconds, while steps that involve snapshots (such asAddLearner
andAddPeer
) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis.
- The scheduling speed is limited by default for load balancing purpose. You can adjust
PD fails to generate the corresponding balancing scheduler. Possible reasons include:
- The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit it set to "0".
- Other constraints. For example,
evict-leader-scheduler
in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders. - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally.
Taking nodes offline is slow
This scenario requires examining the generation and execution of operators through related metrics.
If operators are successfully generated but the scheduling process is slow, possible reasons are:
- The scheduling speed is limited by default. You can adjust
leader-schedule-limit
orreplica-schedule-limit
to larger value.s Similarly, you can consider loosening the limits onmax-pending-peer-count
andmax-snapshot-count
. - Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in Leaders/regions are not evenly distributed.
- When you take a single node offline, a number of region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an
evict-leader-scheduler
to migrate leaders.
If the corresponding operator fails to generate, possible reasons are:
- The operator is stopped, or
replica-schedule-limit
is set to "0". - There is no proper node for region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such case, you need to add more nodes or delete some data to free the space.
Bringing nodes online is slow
Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to Leaders/regions are not evenly distributed for troubleshooting.
Hot regions are not evenly distributed
Hot regions scheduling issues generally fall into the following categories:
Hot regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot regions in time.
Solution: adjust
hot-region-schedule-limit
to a larger value, and reduce the limit quota of other schedulers to speed up hot regions scheduling. Or you can adjusthot-region-cache-hits-threshold
to a smaller value to make PD more sensitive to traffic changes.Hotspot formed on a single region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a
split-region
operator to split such a region.The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules.
Solutions: Firstly, locate the table where hot regions are formed based on the specific business. Then add a
scatter-range-scheduler
scheduler to make all regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to TiDB HTTP API for more details.
Region merge is slow
Similar to slow scheduling, the speed of region merge is most likely limited by the configurations of merge-schedule-limit
and region-schedule-limit
, or the region merge scheduler is competing with other schedulers. Specifically, the solutions are:
If it is known from metrics that there are a large number of empty regions in the system, you can adjust
max-merge-region-size
andmax-merge-region-keys
to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can setpatrol-region-interval
to10ms
(the default value of this configuration item is10ms
in v5.3.0 and later versions of TiDB). This makes region scanning faster at the cost of more CPU consumption.A lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters:
TiKV: Set
split-region-on-table
tofalse
. You cannot modify the parameter dynamically.PD: Use PD Control to set the parameters required by your cluster situation.
Suppose that your cluster has no TiDB instance, and the value of
key-type
is set toraw
ortxn
. In this case, PD can merge Regions across tables, regardless of the value ofenable-cross-table-merge setting
. You can modify thekey-type
parameter dynamically.config set key-type txn
Suppose that your cluster has a TiDB instance, and the value of
key-type
is set totable
. In this case, PD can merge Regions across tables only if the value ofenable-cross-table-merge
is set totrue
. You can modify thekey-type
parameter dynamically.config set enable-cross-table-merge true
If the modification does not take effect, refer to FAQ - Why the modified
toml
configuration for TiKV/PD does not take effect?.NoteAfter enabling Placement Rules, properly switch the value of
key-type
to avoid the failure of decoding.
For v3.0.4 and v2.1.16 or earlier, the approximate_keys
of regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of max-merge-region-keys
. To avoid this problem, you can adjust max-merge-region-keys
to a larger value.
Troubleshoot TiKV node
If a TiKV node fails, PD defaults to setting the corresponding node to the down state after 30 minutes (customizable by configuration item max-store-down-time
), and rebalancing replicas for regions involved.
Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust max-store-down-time
to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout.
In TiDB v5.2.0, TiKV introduces the mechanism of slow TiKV node detection. By sampling the requests in TiKV, this mechanism works out a score ranging from 1 to 100. A TiKV node with a score higher than or equal to 80 is marked as slow. You can add evict-slow-store-scheduler
to detect and schedule slow nodes. If only one TiKV is detected as slow, and the slow score reaches the upper limit (100 by default), the leader in this node will be evicted (similar to the effect of evict-leader-scheduler
).