成功故障转移后 postgres 副本崩溃
我有一个运行在 kubernetes 集群上的松脆的 postgres 运算符,该集群有 3 个使用 kubespray(裸机)部署的工作节点,我设置了一个副本,以便在主节点关闭时打开。 副本的状态正在运行并与 postgres master 同步,没有延迟,出于测试原因,我已经停止了主 postgres 在其上运行的节点,到副本的故障转移已完成,并且 postgres 稍后变得可用。
当我重新启动停止的节点时,其上的 postgres 实例崩溃,并且延迟详细信息变得未知:
Every 2.0s: patronictl list
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running | | unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader | running | 2 | |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
崩溃实例 pod 的日志是:
psycopg2.OperationalError: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
提示不起作用,我无法使用 psql 重新索引索引“pg_database_oid_index” ,这是 psql 命令的输出:
bash-4.4$ psql
psql: error: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
我使用新创建的 postgres 集群多次重做故障转移测试,得到了相同的结果。这是 crunchy-postgres-operator 中的错误吗?
k8s 版本:
# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
postgres.yaml :
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: pg-metal
namespace: prj-metal
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
postgresVersion: 13
users:
- name: pg
options: "SUPERUSER"
instances:
- name: instance1
replicas: 2
dataVolumeClaimSpec:
storageClassName: "ins-ls"
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 75Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: pg-metal
postgres-operator.crunchydata.com/instance-set: instance1
I have crunchy postgres operator running on a kubernetes cluster with 3 worker nodes deployed using kubespray (bare metal), I have setup one replica to switch on when the primary is down.
the state of the replica was running and synced with postgres master with no lag, for test reasons, I have stopped the node which the master postgres is running on it, the failover to the replica was done, and postgres become availbale after a moment.
When I restart up the stopped node, the postgres instance on it become crashed and the lag details become unknown:
Every 2.0s: patronictl list
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running | | unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader | running | 2 | |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
the log of the crashed instance pod is:
psycopg2.OperationalError: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
The hint didn't worked, I can't reindex the index "pg_database_oid_index" using psql, and this is th output of psql
command:
bash-4.4$ psql
psql: error: FATAL: index "pg_database_oid_index" contains unexpected zero page at block 0
HINT: Please REINDEX it.
I redo the failover test many times with newly created postgres clusters, and I got the same result. is this a bug in crunchy-postgres-operator?
k8s version:
# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
postgres.yaml :
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: pg-metal
namespace: prj-metal
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
postgresVersion: 13
users:
- name: pg
options: "SUPERUSER"
instances:
- name: instance1
replicas: 2
dataVolumeClaimSpec:
storageClassName: "ins-ls"
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 75Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: pg-metal
postgres-operator.crunchydata.com/instance-set: instance1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论