主检查点记录中的资源管理器 ID 无效
我已将 Airbyte 映像从 0.35.2-alpha
更新为 0.35.37-alpha
。 [在 kubernetes 中运行]
当系统推出时,db pod 不会终止,我[一个可怕的错误]删除了 pod。 当它恢复时,我收到一个错误 -
PostgreSQL Database directory appears to contain a database; Skipping initialization
2022-02-24 20:19:44.065 UTC [1] LOG: starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG: database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG: invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC: could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG: aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG: database system is shut down
很确定 WAL 文件已损坏,但我不知道如何修复此问题。
I've update my Airbyte image from 0.35.2-alpha
to 0.35.37-alpha
.
[running in kubernetes]
When the system rolled out the db pod wouldn't terminate and I [a terrible mistake] deleted the pod.
When it came back up, I get an error -
PostgreSQL Database directory appears to contain a database; Skipping initialization
2022-02-24 20:19:44.065 UTC [1] LOG: starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG: database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG: invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC: could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG: aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG: database system is shut down
Pretty sure the WAL file is corrupted, but I'm not sure how to fix this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
警告 - 可能会丢失数据
这是一个测试系统,因此我不关心保留最新事务,也没有备份。
首先,我重写了容器命令以保持容器运行,但不尝试启动 postgres。
并在 Pod 上生成一个 shell -
运行 pg_reset_wal
成功!
然后去掉容器中的temp命令,postgres就正确启动了!
Warning - there is a potential for data loss
This is a test system, so I wasn't concerned with keeping the latest transactions, and had no backup.
First I overrode the container command to keep the container running but not try to start postgres.
And spawned a shell on the pod -
Run
pg_reset_wal
Success!
Then removed the temp command in the container, and postgres started up correctly!
su 命令会扰乱 PATH,因此最简单的解决方案是使用 gosu 从根目录删除到 postgres gosu postgres pg_resetxlog /var/lib/postgresql/data。希望这对你有用!
The su command is messing with PATH so the easiest solution is to just use gosu to drop from root to postgres gosu postgres pg_resetxlog /var/lib/postgresql/data. Hopefully that works for you!
不幸的是今天早上我的系统也出现了同样的错误。
错误已成功解决,数据库再次稳定运行。未检测到数据丢失。
修复此错误的一些建议:
将数据文件夹备份到单独的区域以避免丢失。
使用技巧来缩短 postgres 的自动重启时间:
<前><代码>服务:
数据库:
图片:“postgres:13.4-buster”
入口点:[“tail”,“-f”,“/dev/null”]
...
访问容器并运行以下命令:
祝你好运!
Unfortunately this morning my system also had the same error.
The error has been resolved successfully and the database is operating stably again. No data loss detected.
Some suggestions to fix this error:
Backup the data folder to a separate area to avoid loss.
Use the trick to shorten postgres's automatic restart:
Access the container and run the following commands:
Good luck!
另一件需要考虑的事情是检查 PostgreSQL 配置是否存在潜在的错位,例如不正确的
wal_level
或checkpoint_timeout
设置。如果检查点或 WAL 段未正确对齐,此处的错误配置有时会导致恢复期间出现问题。还值得验证存储层(例如文件系统或 RAID)是否会导致损坏。无提示磁盘错误有时会导致此类问题。
Another thing to consider is checking the PostgreSQL configuration for potential misalignments, like incorrect
wal_level
orcheckpoint_timeout
settings. Misconfigurations here can sometimes cause issues during recovery if checkpoints or WAL segments don’t align properly.It’s also worth verifying that the storage layer (e.g., file system or RAID) isn’t introducing corruption. Silent disk errors can occasionally lead to problems like this.