在 postgres 的 pg_dump 中限制 I/O?
因此,我们在具有 16GB RAM 的机器上拥有一个 32GB 的生产数据库。由于缓存,这通常根本不是问题。但是每当我启动数据库的 pg_dump 时,来自应用程序服务器的查询就会开始排队,几分钟后队列就会消失,我们的应用程序就会停止运行。
我将第一个承认我们存在查询性能问题,并且我们正在解决这些问题。同时,我希望能够每晚运行 pg_dump,以一种从数据库中获取数据并且不会导致我们的应用程序崩溃的方式。我不在乎是否需要几个小时。我们的应用程序不运行任何 DDL,因此我不担心锁争用。
为了解决这个问题,我正在使用 Nice 和 ionice 运行 pg_dump。不幸的是,这并不能解决问题。
nice ionice -c2 -n7 pg_dump -Fc production_db -f production_db.sql
即使使用 ionice 我仍然看到上面的问题。看来 i/o 等待和大量搜索导致了该问题。
vmstat 1
显示 iowait 徘徊在 20-25% 左右,有时会飙升至 40%。实际 CPU % 在 2-5% 之间波动,有时会飙升至 70%。
我不认为锁是可能的罪魁祸首。当我运行此查询时:
select pg_class.relname,pg_locks.* from pg_class,pg_locks where pg_class.relfilenode=pg_locks.relation;
我只看到标记为 grant = 't' 的锁。我们通常不会在生产中运行任何 DDL——因此锁似乎不是问题。
以下是启用了 WCHAN 列的 ps 的输出:
PID WIDE S TTY TIME COMMAND
3901 sync_page D ? 00:00:50 postgres: [local] COPY
3916 - S ? 00:00:01 postgres: SELECT
3918 sync_page D ? 00:00:07 postgres: INSERT
3919 semtimedop S ? 00:00:04 postgres: SELECT
3922 - S ? 00:00:01 postgres: SELECT
3923 - S ? 00:00:01 postgres: SELECT
3924 - S ? 00:00:00 postgres: SELECT
3927 - S ? 00:00:06 postgres: SELECT
3928 - S ? 00:00:06 postgres: SELECT
3929 - S ? 00:00:00 postgres: SELECT
3930 - S ? 00:00:00 postgres: SELECT
3931 - S ? 00:00:00 postgres: SELECT
3933 - S ? 00:00:00 postgres: SELECT
3934 - S ? 00:00:02 postgres: SELECT
3935 semtimedop S ? 00:00:13 postgres: UPDATE waiting
3936 - R ? 00:00:12 postgres: SELECT
3937 - S ? 00:00:01 postgres: SELECT
3938 sync_page D ? 00:00:07 postgres: SELECT
3940 - S ? 00:00:07 postgres: SELECT
3943 semtimedop S ? 00:00:04 postgres: UPDATE waiting
3944 - S ? 00:00:05 postgres: SELECT
3948 sync_page D ? 00:00:05 postgres: SELECT
3950 sync_page D ? 00:00:03 postgres: SELECT
3952 sync_page D ? 00:00:15 postgres: SELECT
3964 log_wait_commit D ? 00:00:04 postgres: COMMIT
3965 - S ? 00:00:03 postgres: SELECT
3966 - S ? 00:00:02 postgres: SELECT
3967 sync_page D ? 00:00:01 postgres: SELECT
3970 - S ? 00:00:00 postgres: SELECT
3971 - S ? 00:00:01 postgres: SELECT
3974 sync_page D ? 00:00:00 postgres: SELECT
3975 - S ? 00:00:00 postgres: UPDATE
3977 - S ? 00:00:00 postgres: INSERT
3978 semtimedop S ? 00:00:00 postgres: UPDATE waiting
3981 semtimedop S ? 00:00:01 postgres: SELECT
3982 - S ? 00:00:00 postgres: SELECT
3983 semtimedop S ? 00:00:02 postgres: UPDATE waiting
3984 - S ? 00:00:04 postgres: SELECT
3986 sync_buffer D ? 00:00:00 postgres: SELECT
3988 - R ? 00:00:01 postgres: SELECT
3989 - S ? 00:00:00 postgres: SELECT
3990 - R ? 00:00:00 postgres: SELECT
3992 - R ? 00:00:01 postgres: SELECT
3993 sync_page D ? 00:00:01 postgres: SELECT
3994 sync_page D ? 00:00:00 postgres: SELECT
So we have a production database that is 32GB on a machine with 16GB of RAM. Thanks to caching this is usually not a problem at all. But whenever I start a pg_dump of the database, queries from the app servers start queueing up, and after a few minutes the queue runs away and our app grinds to a halt.
I'll be the first to acknowledge that we have query performance issues, and we're addressing those. Meanwhile, I want to be able to run pg_dump nightly, in a way that sips from the database and doesn't take our app down. I don't care if it takes hours. Our app doesn't run any DDL, so I'm not worried about lock contention.
Attempting to fix the problem, I'm running pg_dump with both nice and ionice. Unfortunately, this doesn't address the issue.
nice ionice -c2 -n7 pg_dump -Fc production_db -f production_db.sql
Even with ionice I still see the issue above. It appears that i/o wait and lots of seeks are causing the problem.
vmstat 1
Shows me that iowait hovers around 20-25% and spikes to 40% sometimes. Real CPU % fluctuates between 2-5% and spikes to 70% sometimes.
I don't believe locks are a possible culprit. When I run this query:
select pg_class.relname,pg_locks.* from pg_class,pg_locks where pg_class.relfilenode=pg_locks.relation;
I only see locks which are marked granted = 't'. We don't typically run any DDL in production -- so locks don't seem to be the issue.
Here is output from a ps with the WCHAN column enabled:
PID WIDE S TTY TIME COMMAND
3901 sync_page D ? 00:00:50 postgres: [local] COPY
3916 - S ? 00:00:01 postgres: SELECT
3918 sync_page D ? 00:00:07 postgres: INSERT
3919 semtimedop S ? 00:00:04 postgres: SELECT
3922 - S ? 00:00:01 postgres: SELECT
3923 - S ? 00:00:01 postgres: SELECT
3924 - S ? 00:00:00 postgres: SELECT
3927 - S ? 00:00:06 postgres: SELECT
3928 - S ? 00:00:06 postgres: SELECT
3929 - S ? 00:00:00 postgres: SELECT
3930 - S ? 00:00:00 postgres: SELECT
3931 - S ? 00:00:00 postgres: SELECT
3933 - S ? 00:00:00 postgres: SELECT
3934 - S ? 00:00:02 postgres: SELECT
3935 semtimedop S ? 00:00:13 postgres: UPDATE waiting
3936 - R ? 00:00:12 postgres: SELECT
3937 - S ? 00:00:01 postgres: SELECT
3938 sync_page D ? 00:00:07 postgres: SELECT
3940 - S ? 00:00:07 postgres: SELECT
3943 semtimedop S ? 00:00:04 postgres: UPDATE waiting
3944 - S ? 00:00:05 postgres: SELECT
3948 sync_page D ? 00:00:05 postgres: SELECT
3950 sync_page D ? 00:00:03 postgres: SELECT
3952 sync_page D ? 00:00:15 postgres: SELECT
3964 log_wait_commit D ? 00:00:04 postgres: COMMIT
3965 - S ? 00:00:03 postgres: SELECT
3966 - S ? 00:00:02 postgres: SELECT
3967 sync_page D ? 00:00:01 postgres: SELECT
3970 - S ? 00:00:00 postgres: SELECT
3971 - S ? 00:00:01 postgres: SELECT
3974 sync_page D ? 00:00:00 postgres: SELECT
3975 - S ? 00:00:00 postgres: UPDATE
3977 - S ? 00:00:00 postgres: INSERT
3978 semtimedop S ? 00:00:00 postgres: UPDATE waiting
3981 semtimedop S ? 00:00:01 postgres: SELECT
3982 - S ? 00:00:00 postgres: SELECT
3983 semtimedop S ? 00:00:02 postgres: UPDATE waiting
3984 - S ? 00:00:04 postgres: SELECT
3986 sync_buffer D ? 00:00:00 postgres: SELECT
3988 - R ? 00:00:01 postgres: SELECT
3989 - S ? 00:00:00 postgres: SELECT
3990 - R ? 00:00:00 postgres: SELECT
3992 - R ? 00:00:01 postgres: SELECT
3993 sync_page D ? 00:00:01 postgres: SELECT
3994 sync_page D ? 00:00:00 postgres: SELECT
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 pv 来限制 pg_dump。
更改您的备份过程。使用例如:
<前> psql -c 'pg_start_backup()'
rsync --checksum --archive /var/lib/pgsql /backups/pgsql
psql -c 'pg_stop_backup()'
但请注意,您还需要设置连续归档为此,备份期间创建的所有 WAL 文件都会与数据文件备份一起保存。
您可以设置复制数据库(例如使用日志传送 )在额外的廉价磁盘上,而不是备份生产数据库备份副本。即使它会落后于一些交易,它最终也会赶上。但在开始备份之前请检查副本是否是最新的。
You can throttle pg_dump using pv.
Change your backup procedure. Use for example:
But take care that you also need to have continuous archiving set up for this to work and all WAL files created during backup stashed along the data files backup.
You can setup a replicated database (using for example log shipping) on additional cheap disk and instead of backing up production database backup a replica. Even it will fall behind some transactions it will eventually catch up. But check if replica is reasonably up to date before starting backup.
您的 PS 输出有多个处于“等待”状态的 UPDATE 语句,这仍然对我说锁定(您的锁定测试查询除外)。我很确定您不会在 PS 输出中看到“等待”。您能否检查一下此查询在问题期间是否显示任何内容:(
您没有说您正在运行什么版本的 PostgreSQL,所以我不确定这是否有效。)
如果有任何内容(也就是说,等待= TRUE),那么这是一个锁/事务问题。
Your PS output has multiple UPDATE statements in "waiting" state, which still says locks to me (your locks test query aside). I'm pretty sure you wouldn't see "waiting" in the PS output otherwise. Can you check to see if this query shows anything during the issue:
(You didn't say what version of PostgreSQL you are running so I'm not sure if this will work.)
If there's anything in there (that is, with waiting = TRUE), then it's a lock/transaction problem.