成功故障转移后 postgres 副本崩溃

发布于 2025-01-14 13:54:52 字数 3078 浏览 2 评论 0原文

我有一个运行在 kubernetes 集群上的松脆的 postgres 运算符,该集群有 3 个使用 kubespray(裸机)部署的工作节点,我设置了一个副本,以便在主节点关闭时打开。 副本的状态正在运行并与 postgres master 同步,没有延迟,出于测试原因,我已经停止了主 postgres 在其上运行的节点,到副本的故障转移已完成,并且 postgres 稍后变得可用。

当我重新启动停止的节点时,其上的 postgres 实例崩溃,并且延迟详细信息变得未知:

Every 2.0s: patronictl list                                                                                                                                                        
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member                    | Host                                    | Role    | State   | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running |    |   unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader  | running |  2 |           |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+

崩溃实例 pod 的日志是:

psycopg2.OperationalError: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

提示不起作用,我无法使用 psql 重新索引索引“pg_database_oid_index” ,这是 psql 命令的输出:

bash-4.4$ psql
psql: error: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

我使用新创建的 postgres 集群多次重做故障转移测试,得到了相同的结果。这是 crunchy-postgres-operator 中的错误吗?

k8s 版本:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

postgres.yaml :

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg-metal
  namespace: prj-metal

spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
  postgresVersion: 13
  users:
    - name: pg
      options: "SUPERUSER"
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        storageClassName: "ins-ls"
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 75Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: pg-metal
                postgres-operator.crunchydata.com/instance-set: instance1

I have crunchy postgres operator running on a kubernetes cluster with 3 worker nodes deployed using kubespray (bare metal), I have setup one replica to switch on when the primary is down.
the state of the replica was running and synced with postgres master with no lag, for test reasons, I have stopped the node which the master postgres is running on it, the failover to the replica was done, and postgres become availbale after a moment.

When I restart up the stopped node, the postgres instance on it become crashed and the lag details become unknown:

Every 2.0s: patronictl list                                                                                                                                                        
+---------------------------+-----------------------------------------+---------+---------+----+-----------+
| Member                    | Host                                    | Role    | State   | TL | Lag in MB |
+ Cluster: pg-metal-ha (7075323376834977860) -------------------------+---------+---------+----+-----------+
| pg-metal-instance1-hfdp-0 | pg-metal-instance1-hfdp-0.pg-metal-pods | Replica | running |    |   unknown |
| pg-metal-instance1-zdc6-0 | pg-metal-instance1-zdc6-0.pg-metal-pods | Leader  | running |  2 |           |
+---------------------------+-----------------------------------------+---------+---------+----+-----------+

the log of the crashed instance pod is:

psycopg2.OperationalError: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

The hint didn't worked, I can't reindex the index "pg_database_oid_index" using psql, and this is th output of psql command:

bash-4.4$ psql
psql: error: FATAL:  index "pg_database_oid_index" contains unexpected zero page at block 0
HINT:  Please REINDEX it.

I redo the failover test many times with newly created postgres clusters, and I got the same result. is this a bug in crunchy-postgres-operator?

k8s version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

postgres.yaml :

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg-metal
  namespace: prj-metal

spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:centos8-13.6-3.0-0
  postgresVersion: 13
  users:
    - name: pg
      options: "SUPERUSER"
  instances:
    - name: instance1
      replicas: 2
      dataVolumeClaimSpec:
        storageClassName: "ins-ls"
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 75Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: pg-metal
                postgres-operator.crunchydata.com/instance-set: instance1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文