repmgr集群故障修复

PostgreSQL

症状

repmgr集群无法连接。

问题原因

数据库无法分配内存，出现宕机。

解决方案

1、查看集群状态，判断主备节点及节点运行情况

[root@slave ~]# repmgr cluster show 
 ID | Name          | Role    | Status    | Upstream      | Location | Priority | Replication lag | Last replayed LSN
----+---------------+---------+-----------+---------------+----------+----------+-----------------+-------------------
 1  | x.x.0.121 | primary | -failed |               | default  | 100      | n/a             | none             
 2  | x.x.0.122 | standby |   ?unreachable | x.x.0.121 | default  | 100      | 5232GB         | 0/0

通过查看集群状态发现主备节点皆出现问题。

[root@slave ~]# ps -ef | grep post
root      1091     1  0 Apr15 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   1094  1091  0 Apr15 ?        00:00:00 qmgr -l -t unix -u
postfix   4743  1091  0 08:37 ?        00:00:00 pickup -l -t unix -u
root      4953  4821  0 09:41 pts/0    00:00:00 grep --color=auto post

通过查看post进程发现主备节点皆无数据库进程，代表数据库节点皆出现宕机情况。

2、查看两节点数据库时间线

[root@slave ~]# pg_controldata |grep TimeLineID
Latest checkpoint's TimeLineID:       11
Latest checkpoint's PrevTimeLineID:   11

两个数据库节点时间线皆为11，说明两个节点出现问题之前并未发生主备切换，我们以原主库为主库进行集群恢复，同时备节点做好数据备份。

3、查看数据库日志，判断节点出现问题原因。

find / -iname hgdb_log -print
cat highgodb_11.log

提示无法分配内存，free -h查看服务器内存使用情况，发现服务器内存共14gb，且内存使用量过高，查看数据库参数配置正常，shared_buffers=4GB；建议客户增加内存。

4、启动主节点，并注册为主库

pg_ctl start
repmgr  primary register -F

5、查看主库是否出现vip

ip a

6、备份备节点，重做备节点，注册为备库

cd /highgo
 
mv data data_bak20240323
 
pg_basebackup -h 主库ip  -U highgo -D /highgo/data -Fp -P -Xs -R -v
pg_ctl start
repmgr  standby register -F

7、检查集群状态

[root@slave ~]# repmgr cluster show 
 ID | Name          | Role    | Status    | Upstream      | Location | Priority | Replication lag | Last replayed LSN
----+---------------+---------+-----------+---------------+----------+----------+-----------------+-------------------
 1  | x.x.0.121 | primary | * running |               | default  | 100      | n/a             | none             
 2  | x.x.0.122 | standby |   running | x.x.0.121 | default  | 100      | 0 bytes         | 0/70007D0

主备节点集群状态都为running正常，且Replication lag为0 bytes

[root@hs02 ~]# ps -ef|grep postg
root      20568      1  0 17:37 ?        00:00:00 /highgo/database/4.5/bin/postgres -D /highgo/database/4.5/data
root      20569  20568  0 17:37 ?        00:00:00 postgres: logger process   
root      20570  20568  0 17:37 ?        00:00:00 postgres: startup process   recovering 000000010000000000000007
root      20571  20568  0 17:37 ?        00:00:00 postgres: checkpointer process   
root      20572  20568  0 17:37 ?        00:00:00 postgres: writer process   
root      20573  20568  0 17:37 ?        00:00:00 postgres: stats collector process   
root      20574  20568  0 17:37 ?        00:00:00 postgres: wal receiver process   streaming 0/70006F0
root      20585  20568  0 17:37 ?        00:00:00 postgres: sysdba highgo x.x.0.122(13382) idle

查看流复制信息，主备节点皆出现数据库进程，且备节点有wal receiver进程，代表接收主节点的数据流。

8、查看守护进程

ps -ef | grep repmgrd

9、没有守护进程执行如下命令

repmgrd -d

10、程序检查日志表，数据正常。

文档来源：https://mp.weixin.qq.com/s/6xPHUbO-kM_G11w9qXppeQ

如果觉得文章对你有用，请随意赞赏

repmgr集群故障修复

https://www.zhangyan1997.xyz/archives/repmgrji-qun-gu-zhang-xiu-fu

作者

张颜

发布于

2025-07-22

更新于

2025-07-22

许可协议

CC BY 4.0

repmgr集群故障修复

作者

发布于

更新于

许可协议

评论