主检查点记录中的资源管理器 ID 无效

发布于 2025-01-10 00:32:44 字数 1254 浏览 1 评论 0原文

我已将 Airbyte 映像从 0.35.2-alpha 更新为 0.35.37-alpha。 [在 kubernetes 中运行]

当系统推出时,db pod 不会终止,我[一个可怕的错误]删除了 pod。 当它恢复时,我收到一个错误 -

PostgreSQL Database directory appears to contain a database; Skipping initialization

2022-02-24 20:19:44.065 UTC [1] LOG:  starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG:  database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC:  could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG:  aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG:  database system is shut down

很确定 WAL 文件已损坏,但我不知道如何修复此问题。

I've update my Airbyte image from 0.35.2-alpha to 0.35.37-alpha.
[running in kubernetes]

When the system rolled out the db pod wouldn't terminate and I [a terrible mistake] deleted the pod.
When it came back up, I get an error -

PostgreSQL Database directory appears to contain a database; Skipping initialization

2022-02-24 20:19:44.065 UTC [1] LOG:  starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG:  database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC:  could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG:  aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG:  database system is shut down

Pretty sure the WAL file is corrupted, but I'm not sure how to fix this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

腻橙味 2025-01-17 00:32:44

警告 - 可能会丢失数据

这是一个测试系统,因此我不关心保留最新事务,也没有备份。

首先,我重写了容器命令以保持容器运行,但不尝试启动 postgres。

...
    spec:
      containers:
        - name: airbyte-db-container
          image: airbyte/db
          command: ["sh"]
          args: ["-c", "while true; do echo $(date -u) >> /tmp/run.log; sleep 5; done"]
...

并在 Pod 上生成一个 shell -

kubectl exec -it -n airbyte airbyte-db-xxxx -- sh

运行 pg_reset_wal

# dry-run first
pg_resetwal --dry-run /var/lib/postgresql/data/pgdata

成功!

pg_resetwal /var/lib/postgresql/data/pgdata
Write-ahead log reset

然后去掉容器中的temp命令,postgres就正确启动了!

Warning - there is a potential for data loss

This is a test system, so I wasn't concerned with keeping the latest transactions, and had no backup.

First I overrode the container command to keep the container running but not try to start postgres.

...
    spec:
      containers:
        - name: airbyte-db-container
          image: airbyte/db
          command: ["sh"]
          args: ["-c", "while true; do echo $(date -u) >> /tmp/run.log; sleep 5; done"]
...

And spawned a shell on the pod -

kubectl exec -it -n airbyte airbyte-db-xxxx -- sh

Run pg_reset_wal

# dry-run first
pg_resetwal --dry-run /var/lib/postgresql/data/pgdata

Success!

pg_resetwal /var/lib/postgresql/data/pgdata
Write-ahead log reset

Then removed the temp command in the container, and postgres started up correctly!

堇色安年 2025-01-17 00:32:44

su 命令会扰乱 PATH,因此最简单的解决方案是使用 gosu 从根目录删除到 postgres gosu postgres pg_resetxlog /var/lib/postgresql/data。希望这对你有用!

The su command is messing with PATH so the easiest solution is to just use gosu to drop from root to postgres gosu postgres pg_resetxlog /var/lib/postgresql/data. Hopefully that works for you!

等风也等你 2025-01-17 00:32:44

不幸的是今天早上我的系统也出现了同样的错误。

错误已成功解决,数据库再次稳定运行。未检测到数据丢失。

修复此错误的一些建议:

  1. 将数据文件夹备份到单独的区域以避免丢失。

  2. 使用技巧来缩短 postgres 的自动重启时间:

    <前><代码>服务:
    数据库:
    图片:“postgres:13.4-buster”
    入口点:[“tail”,“-f”,“/dev/null”]
    ...

  3. 访问容器并运行以下命令:

    > docker exec -it  $(docker ps -q -f "name=<container-name>") bash
    > pg_resetwal --dry-run /var/lib/postgresql/data/pgdata
    > pg_resetwal /var/lib/postgresql/data/pgdata

      Write-ahead log reset
  1. 如果没有问题,请删除步骤 2 中的代码行并重新启动服务。

祝你好运!

Unfortunately this morning my system also had the same error.

The error has been resolved successfully and the database is operating stably again. No data loss detected.

Some suggestions to fix this error:

  1. Backup the data folder to a separate area to avoid loss.

  2. Use the trick to shorten postgres's automatic restart:

    services:
      database:
        image: "postgres:13.4-buster"
        entrypoint: ["tail", "-f", "/dev/null"]
        ...
    
  3. Access the container and run the following commands:

    > docker exec -it  $(docker ps -q -f "name=<container-name>") bash
    > pg_resetwal --dry-run /var/lib/postgresql/data/pgdata
    > pg_resetwal /var/lib/postgresql/data/pgdata

      Write-ahead log reset
  1. If there are no problems, remove the line of code in step 2 and restart the service.

Good luck!

我的痛♀有谁懂 2025-01-17 00:32:44

另一件需要考虑的事情是检查 PostgreSQL 配置是否存在潜在的错位,例如不正确的 wal_levelcheckpoint_timeout 设置。如果检查点或 WAL 段未正确对齐,此处的错误配置有时会导致恢复期间出现问题。

还值得验证存储层(例如文件系统或 RAID)是否会导致损坏。无提示磁盘错误有时会导致此类问题。

Another thing to consider is checking the PostgreSQL configuration for potential misalignments, like incorrect wal_level or checkpoint_timeout settings. Misconfigurations here can sometimes cause issues during recovery if checkpoints or WAL segments don’t align properly.

It’s also worth verifying that the storage layer (e.g., file system or RAID) isn’t introducing corruption. Silent disk errors can occasionally lead to problems like this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文