对 300 万行的 PostgreSQL 数据库进行缓慢的简单更新查询
我正在 Postegres 8.4 上的一个大约有 300 万行的表上尝试一个简单的 UPDATE table SET column1 = 0
,但它需要很长时间才能完成。已经运行了10多分钟了。
之前,我尝试在该表上运行 VACUUM 和 ANALYZE 命令,并且还尝试创建一些索引(尽管我怀疑这在这种情况下会产生任何影响),但似乎没有任何帮助。
还有其他想法吗?
更新:
这是表结构:
CREATE TABLE myTable
(
id bigserial NOT NULL,
title text,
description text,
link text,
"type" character varying(255),
generalFreq real,
generalWeight real,
author_id bigint,
status_id bigint,
CONSTRAINT resources_pkey PRIMARY KEY (id),
CONSTRAINT author_pkey FOREIGN KEY (author_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT c_unique_status_id UNIQUE (status_id)
);
我正在尝试运行UPDATE myTable SET GeneralFreq = 0;
I am trying a simple UPDATE table SET column1 = 0
on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.
Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.
Any other ideas?
Update:
This is the table structure:
CREATE TABLE myTable
(
id bigserial NOT NULL,
title text,
description text,
link text,
"type" character varying(255),
generalFreq real,
generalWeight real,
author_id bigint,
status_id bigint,
CONSTRAINT resources_pkey PRIMARY KEY (id),
CONSTRAINT author_pkey FOREIGN KEY (author_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT c_unique_status_id UNIQUE (status_id)
);
I am trying to run UPDATE myTable SET generalFreq = 0;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我建议的第一件事(来自 https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row)仅更新行“需要”它,例如:(
可能还需要 GeneralFreq 上的索引)。然后您将更新更少的行。虽然如果值已经全部非零,则不会,但更新更少的行“可以有所帮助”,因为否则它会更新它们和所有索引,无论值是否更改。
另一种选择:如果星星按照默认值和非空约束对齐,您可以删除旧列并创建另一个只需调整元数据,即时时间。
The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:
(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.
Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.
在我的测试中,我注意到一次超过 200 000 行的大更新比 2 次 100 000 行的更新慢,即使使用临时表也是如此。
我的解决方案是循环,在每个循环中创建一个 200 000 行的临时表,在这个表中我计算我的值,然后用新值更新我的主表...
每 2 000 000 行,我手动“VACUUM ANALYZE” mytable”,我注意到自动真空吸尘器无法完成此类更新的工作。
In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.
My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...
Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.
我需要更新 PostgreSQL 表上超过 1B+ 的行,其中包含一些索引。我正在研究 PostgreSQL 12 + SQLAlchemy + Python。
受到这里答案的启发,我编写了一个临时表和基于 UPDATE...FROM 的更新程序,看看它是否有所不同。然后,临时表由 Python 生成的 CSV 提供,并通过普通 SQL 客户端连接上传。
使用 SQLAlchemy 的 加速天真的方法bulk_update_mappings 是 4x - 5x。虽然不是一个数量级,但仍然相当大,就我而言,这意味着批处理作业需要 1 天,而不是 1 周。
下面是执行
CREATE TEMPORARY TABLE
、COPY FROM
和UPDATE FROM
的相关 Python 代码。 查看此要点中的完整示例。I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.
Inspired by the answers here, I wrote a temp table and
UPDATE... FROM
based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.
Below is the relevant Python code that does
CREATE TEMPORARY TABLE
,COPY FROM
andUPDATE FROM
. See the full example in this gist.尝试
也许这是一个铸造问题
try
Maybe it is a casting issue
我必须使用每行的不同值来更新 1 或 20 亿行的表。每次运行都会进行约 1 亿次更改 (10%)。
我的第一次尝试是将它们直接分组在特定分区上的 300K 更新事务中,因为如果使用分区,Postgresql 并不总是优化准备好的查询。
myId=id"
每秒1,500 次更新。这意味着每次运行都会
至少需要18小时。
1,600 次更新/秒。我使用 SSD,所以这是一项成本高昂的改进
存储大小加倍。
使用 UPDATE...FROM 每秒提供 18,000 更新。如果我做真空
对于每个分区;否则为 100,000 up/s。太酷了。
这是
操作顺序:
根据可用 RAM 在缓冲区中累积一堆更新
当它被填满,或者需要更改表/分区,或者完成时:
这意味着现在运行 1 亿次更新需要 1.5 小时而不是 18 小时,包括真空。为了节省时间,没有必要在最后将真空吸尘器弄满,但即使是快速的常规真空吸尘器也有助于控制数据库上的事务 ID,并且不会在高峰时段出现不必要的自动真空吸尘器。
I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.
myId=id"
Gives 1,500 updates/sec. which means each run would
take at least 18 hours.
1,600 updates/sec. I use SSD's so it's a costly improvement as it
doubles the storage size.
with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM
for each partition; 100,000 up/s otherwise. Cooool.
Here is the
sequence of operations:
Accumulate a bunch of updates in a buffer depending of available RAM
When it's filled, or need to change of table/partition, or completed:
That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.
看看这个答案:PostgreSQL 在包含数组和大量更新的大型表上运行缓慢
首先从更好的 FILLFACTOR 开始,执行 VACUUM FULL 强制表重写并在 UPDATE 查询后检查热更新:
热更新很多当您有大量记录需要更新时,速度会更快。有关 HOT 的更多信息可以在此 文章。
诗。您需要 8.3 或更高版本。
Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates
First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:
HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.
Ps. You need version 8.3 or better.
等待35分钟后。为了让我的 UPDATE 查询完成(但仍然没有),我决定尝试一些不同的东西。所以我所做的是一个命令:
然后添加索引,然后删除旧表并重命名新表以取代它。这只花了 1.7 分钟。处理加上一些额外的时间来重新创建索引和约束。但这确实有帮助! :)
当然,这确实有效,只是因为没有其他人使用该数据库。如果这是在生产环境中,我需要首先锁定表。
After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:
Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)
Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.
今天我花了很多时间来解决类似的问题。我找到了一个解决方案:在更新之前删除所有约束/索引。无论正在更新的列是否已建立索引,psql 似乎都会更新所有已更新行的所有索引。更新完成后,添加回约束/索引。
Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.
试试这个(注意,
generalFreq
以 REAL 类型开始,并保持不变):这将重写表,类似于 DROP + CREATE,并重建所有索引。但一切都在一个命令中。速度要快得多(大约 2 倍),并且您不必处理依赖关系并重新创建索引和其他内容,尽管它会在持续时间内锁定表(访问独占,即完全锁定)。或者,如果您希望其他所有内容都在它后面排队,这也许就是您想要的。如果您没有更新“太多”行,那么这种方式比仅仅更新要慢。
Try this (note that
generalFreq
starts as type REAL, and stays the same):This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.
你运行得怎么样?如果您循环每一行并执行更新语句,则可能会运行数百万个单独的更新,这就是它执行速度极其缓慢的原因。
如果您对一个语句中的所有记录运行单个更新语句,那么它的运行速度会快得多,如果这个过程很慢,那么它可能主要取决于您的硬件。 300万是一个很大的记录。
How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.
If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.