成功故障转移后 postgres 副本崩溃

发布于 2025-01-14 13:54:52 字数 3078 浏览 2 评论 0原文

我有一个运行在 kubernetes 集群上的松脆的 postgres 运算符，该集群有 3 个使用 kubespray（裸机）部署的工作节点，我设置了一个副本，以便在主节点关闭时打开。副本的状态正在运行并与 postgres master 同步，没有延迟，出于测试原因，我已经停止了主 postgres 在其上运行的节点，到副本的故障转移已完成，并且 postgres 稍后变得可用。

当我重新启动停止的节点时，其上的 postgres 实例崩溃，并且延迟详细信息变得未知：

Every 2.0s: patronictl list                                                                                                                                                        
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member                    | Host                                    | Role    | State   | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running |    |   unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader  | running |  2 |           |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+

崩溃实例 pod 的日志是：

psycopg2.OperationalError: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

提示不起作用，我无法使用 psql 重新索引索引“pg_database_oid_index” ，这是 psql 命令的输出：

bash-4.4$ psql
psql: error: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

我使用新创建的 postgres 集群多次重做故障转移测试，得到了相同的结果。这是 crunchy-postgres-operator 中的错误吗？

k8s 版本：

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

postgres.yaml ：

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg-metal
  namespace: prj-metal

spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
  postgresVersion: 13
  users:
    - name: pg
      options: "SUPERUSER"
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        storageClassName: "ins-ls"
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 75Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: pg-metal
                postgres-operator.crunchydata.com/instance-set: instance1

原文

I have crunchy postgres operator running on a kubernetes cluster with 3 worker nodes deployed using kubespray (bare metal), I have setup one replica to switch on when the primary is down.
the state of the replica was running and synced with postgres master with no lag, for test reasons, I have stopped the node which the master postgres is running on it, the failover to the replica was done, and postgres become availbale after a moment.

When I restart up the stopped node, the postgres instance on it become crashed and the lag details become unknown:

Every 2.0s: patronictl list                                                                                                                                                        
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member                    | Host                                    | Role    | State   | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running |    |   unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader  | running |  2 |           |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+

the log of the crashed instance pod is:

psycopg2.OperationalError: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

The hint didn't worked, I can't reindex the index "pg_database_oid_index" using psql, and this is th output of psql command:

bash-4.4$ psql
psql: error: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

I redo the failover test many times with newly created postgres clusters, and I got the same result. is this a bug in crunchy-postgres-operator?

k8s version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

postgres.yaml :

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg-metal
  namespace: prj-metal

spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
  postgresVersion: 13
  users:
    - name: pg
      options: "SUPERUSER"
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        storageClassName: "ins-ls"
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 75Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: pg-metal
                postgres-operator.crunchydata.com/instance-set: instance1

分享到QQ

分享到微博