Skip to content

MGR 复制延迟临时优化

故障描述

在 MGR 集群节点恢复期间,组复制 recovery 通道出现严重复制延迟问题:

示例 SQL 输出显示延迟时长213214 秒(约 60 小时):

SELECT      'Recovery_Delay' as metric, 
            WORKER_ID, 
            APPLYING_TRANSACTION, 
            APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP as original_commit_time, 
            APPLYING_TRANSACTION_START_APPLY_TIMESTAMP as start_apply_time, 
            TIMESTAMPDIFF(SECOND,  
               APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP, 
               APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 
            ) as delay_seconds 
FROM performance_schema.replication_applier_status_by_worker 
WHERE CHANNEL_NAME = 'group_replication_recovery' AND APPLYING_TRANSACTION IS NOT NULL 
ORDER BY delay_seconds DESC;
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+
| metric         | WORKER_ID | APPLYING_TRANSACTION                          | original_commit_time       | start_apply_time           | delay_seconds |
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+
| Recovery_Delay |         1 | 72903f15-1d27-11f1-9fec-12f54fc6f25d:20903535 | 2026-03-28 04:40:26.847551 | 2026-03-30 15:54:01.241266 |        213214 |
| Recovery_Delay |         2 | 72903f15-1d27-11f1-9fec-12f54fc6f25d:20903536 | 2026-03-28 04:40:26.847512 | 2026-03-30 15:54:01.241293 |        213214 |
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+

解决方案

Step 1: 调整复制加速参数

通过临时无降低安全级别以提升复制效率:

# 临时优化参数调整
mysql -e "
-- 降低刷盘频率 + 提升并行 worker 线程数
SET GLOBAL innodb_flush_log_at_trx_commit = 0;
SET GLOBAL sync_binlog = 0;
SET GLOBAL slave_parallel_workers = 8;
"

参数作用:

参数 原值 修改后 说明
innodb_flush_log_at_trx_commit 1 0 每事务刷盘 → 每秒一次刷盘
sync_binlog 1 0 每次 binlog fsync → 无强制
slave_parallel_workers 默认 8 增加并行回放线程数量

Step 2:等待同步完成后复原

待完全恢复到 ONLINE 状态后,立即复原参数以保持生产环境数据安全级别:

# 恢复完成后执行复原
mysql -e "
-- 复原为原配置
SET GLOBAL innodb_flush_log_at_trx_commit = 1;
SET GLOBAL sync_binlog = 1;
-- SET GLOBAL slave_parallel_workers = ?; -- 可设回原值
SHOW GLOBAL VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit', 'sync_binlog');
"

验证检查

方法 1:查看详细复制延迟 SQL

SELECT      'Recovery_Delay' as metric, 
            WORKER_ID, 
            APPLYING_TRANSACTION, 
            APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP as original_commit_time, 
            APPLYING_TRANSACTION_START_APPLY_TIMESTAMP as start_apply_time, 
            TIMESTAMPDIFF(SECOND,  
               APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP, 
               APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 
            ) as delay_seconds 
FROM performance_schema.replication_applier_status_by_worker 
WHERE CHANNEL_NAME = 'group_replication_recovery' AND APPLYING_TRANSACTION IS NOT NULL 
ORDER BY delay_seconds DESC;

方法 2:快速 GTID 进度检查

-- 查看当前节点已执行的 GTID
SELECT @@global.gtid_executed;

-- 查看恢复通道正在应用的事务号
SELECT APPLYING_TRANSACTION FROM performance_schema.replication_applier_status_by_worker WHERE CHANNEL_NAME='group_replication_recovery';