MGR 复制延迟临时优化
故障描述
在 MGR 集群节点恢复期间,组复制 recovery 通道出现严重复制延迟问题:
示例 SQL 输出显示延迟时长213214 秒(约 60 小时):
SELECT 'Recovery_Delay' as metric,
WORKER_ID,
APPLYING_TRANSACTION,
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP as original_commit_time,
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP as start_apply_time,
TIMESTAMPDIFF(SECOND,
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP
) as delay_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE CHANNEL_NAME = 'group_replication_recovery' AND APPLYING_TRANSACTION IS NOT NULL
ORDER BY delay_seconds DESC;
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+
| metric | WORKER_ID | APPLYING_TRANSACTION | original_commit_time | start_apply_time | delay_seconds |
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+
| Recovery_Delay | 1 | 72903f15-1d27-11f1-9fec-12f54fc6f25d:20903535 | 2026-03-28 04:40:26.847551 | 2026-03-30 15:54:01.241266 | 213214 |
| Recovery_Delay | 2 | 72903f15-1d27-11f1-9fec-12f54fc6f25d:20903536 | 2026-03-28 04:40:26.847512 | 2026-03-30 15:54:01.241293 | 213214 |
+----------------+-----------+-----------------------------------------------+----------------------------+----------------------------+---------------+
解决方案
Step 1: 调整复制加速参数
通过临时无降低安全级别以提升复制效率:
# 临时优化参数调整
mysql -e "
-- 降低刷盘频率 + 提升并行 worker 线程数
SET GLOBAL innodb_flush_log_at_trx_commit = 0;
SET GLOBAL sync_binlog = 0;
SET GLOBAL slave_parallel_workers = 8;
"
参数作用:
| 参数 | 原值 | 修改后 | 说明 |
|---|---|---|---|
innodb_flush_log_at_trx_commit | 1 | 0 | 每事务刷盘 → 每秒一次刷盘 |
sync_binlog | 1 | 0 | 每次 binlog fsync → 无强制 |
slave_parallel_workers | 默认 | 8 | 增加并行回放线程数量 |
Step 2:等待同步完成后复原
待完全恢复到 ONLINE 状态后,立即复原参数以保持生产环境数据安全级别:
# 恢复完成后执行复原
mysql -e "
-- 复原为原配置
SET GLOBAL innodb_flush_log_at_trx_commit = 1;
SET GLOBAL sync_binlog = 1;
-- SET GLOBAL slave_parallel_workers = ?; -- 可设回原值
SHOW GLOBAL VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit', 'sync_binlog');
"
验证检查
方法 1:查看详细复制延迟 SQL
SELECT 'Recovery_Delay' as metric,
WORKER_ID,
APPLYING_TRANSACTION,
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP as original_commit_time,
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP as start_apply_time,
TIMESTAMPDIFF(SECOND,
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP
) as delay_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE CHANNEL_NAME = 'group_replication_recovery' AND APPLYING_TRANSACTION IS NOT NULL
ORDER BY delay_seconds DESC;
方法 2:快速 GTID 进度检查
-- 查看当前节点已执行的 GTID
SELECT @@global.gtid_executed;
-- 查看恢复通道正在应用的事务号
SELECT APPLYING_TRANSACTION FROM performance_schema.replication_applier_status_by_worker WHERE CHANNEL_NAME='group_replication_recovery';