如何在 PostgreSQL 中进行大型非阻塞更新?

发布于 2024-07-26 11:45:28 字数 926 浏览 16 评论 0原文

我想对 PostgreSQL 中的表进行大型更新,但我不需要在整个操作中维护事务完整性,因为我知道我要更改的列不会在更新期间被写入或读取更新。 我想知道 psql 控制台中是否有一种简单的方法可以使这些类型的操作更快。

例如,假设我有一个名为“orders”的表,包含 3500 万行,我想这样做:

UPDATE orders SET status = null;

为了避免被转移到离题讨论,让我们假设当前设置了 3500 万列的所有状态值到相同的(非空)值,从而使索引变得无用。

该语句的问题在于需要很长时间才能生效(完全是因为锁定),并且所有更改的行都被锁定,直到整个更新完成。 此更新可能需要 5 小时,而类似的更新

UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

可能需要 1 分钟。 超过 3500 万行,执行上述操作并将其分成 35 个块只需要 35 分钟,为我节省了 4 小时 25 分钟。

我可以使用脚本进一步分解它(此处使用伪代码):

for (i = 0 to 3500) {
  db_operation ("UPDATE orders SET status = null
                 WHERE (order_id >" + (i*1000)"
             + " AND order_id <" + ((i+1)*1000) " +  ")");
}

此操作可能只需几分钟即可完成,而不是 35 分钟。

所以这就是我真正要问的。 我不想每次想做这样的大型一次性更新时都编写一个奇怪的脚本来破坏操作。 有没有办法完全在 SQL 中完成我想要的事情?

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

UPDATE orders SET status = null;

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like

UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.

I could break it down even further with a script (using pseudocode here):

for (i = 0 to 3500) {
  db_operation ("UPDATE orders SET status = null
                 WHERE (order_id >" + (i*1000)"
             + " AND order_id <" + ((i+1)*1000) " +  ")");
}

This operation might complete in only a few minutes, rather than 35.

So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

情栀口红 2024-08-02 11:45:28

列/行

...我不需要跨域维护事务完整性
整个操作,因为我知道我要更改的列是
更新期间不会写入或读取。

PostgreSQL 的 MVCC 模型 中的任何 UPDATE 都会写入一个整行的新版本。 如果并发事务更改同一行的任何列,则会出现耗时的并发问题。 手册中的详细信息。了解相同的< em>列不会被并发事务触及,可以避免一些可能出现的复杂情况,但不能避免其他情况。

指数

为了避免被转移到离题讨论,我们假设
当前已设置 3500 万列的所有状态值
到相同(非空)值,从而使索引无用。

更新整个表(或其主要部分)时,Postgres从不使用索引。 当必须读取所有或大多数行时,顺序扫描速度更快。 相反:索引维护意味着UPDATE的额外成本。

表现

例如,假设我有一个名为“orders”的表,其中包含 3500 万条数据
行,我想这样做:

更新订单 SET status = null; 
  

我知道您的目标是寻求更通用的解决方案(见下文)。 但要解决实际问题:无论表大小如何,都可以在几毫秒内处理:

ALTER TABLE orders DROP column status
                 , ADD  column status text;

手册(最多 Postgres 10):

当使用ADD COLUMN添加列时,表中的所有现有行
使用列的默认值进行初始化(如果没有 DEFAULT,则为 NULL
条款已指定)。 如果没有 DEFAULT 子句,则这只是元数据更改 [...]

手册(自 Postgres 11 起):

当使用ADD COLUMN和非易失性DEFAULT添加列时
指定时,默认值在语句时计算
并将结果存储在表的元数据中。 该值将被使用
对于所有现有行的列。 如果没有指定DEFAULT
使用 NULL。 在这两种情况下都不需要重写表。

添加具有易失性DEFAULT的列或更改
现有列将需要整个表及其索引
重写。 [...]

和:

DROP COLUMN 形式不会物理删除列,但是
只是使其对 SQL 操作不可见。 随后插入和
表中的更新操作将为该列存储空值。
因此,删除一列很快,但不会立即减少
表的磁盘大小,即删除的表占用的空间
列不被回收。 随着时间的推移,该空间将被回收
现有行已更新。

确保您没有依赖于列的对象(外键约束、索引、视图...)。 您需要删除/重新创建它们。 除此之外,系统目录表 pg_attribute 上的微小操作即可完成这项工作。 需要表上的独占锁,这对于高并发负载可能会出现问题。 (就像布尔曼在他的

如果您想要保留列默认值,请在单独的命令中将其添加回来。。 在同一命令中执行此操作会立即将其应用于所有行。 请参阅:

实际应用默认情况下,考虑批量执行:

一般解决方案

更新:从 Postgres 11 开始,您可以使用 PROCEDURE 为此,您可以在其中包含事务控制语句,例如 COMMIT。 请参阅:


dblink 已在另一个答案中提到。 它允许通过隐式单独连接访问“远程”Postgres 数据库。 “远程”数据库可以是当前数据库,从而实现“自主事务”:函数在“远程”数据库中写入的内容已提交且无法回滚。

这允许运行单个函数,以较小的部分更新大表,并且每个部分单独提交。 避免为大量行增加事务开销,更重要的是,在每个部分之后释放锁。 这使得并发操作能够在没有太多延迟的情况下继续进行,并减少死锁的可能性。

如果您没有并发访问,这几乎没有用 - 除非在异常后避免ROLLBACK。 对于这种情况,还要考虑 SAVEPOINT

免责声明

首先,大量小额交易实际上更昂贵。 这仅对大表有意义。 最佳位置取决于许多因素。

如果您不确定自己在做什么:单笔交易是安全的方法。 为了使其正常工作,表上的并发操作必须配合。 例如:并发写入可以将行移动到假定已经处理的分区。 或者并发读取可能会看到不一致的中间状态。 您已收到警告。

分步说明

需要首先安装附加模块 dblink:

设置与 dblink 的连接很大程度上取决于数据库集群的设置和适当的安全策略。 这可能很棘手。 相关的后续答案更多如何与 dblink 连接

按照指示创建一个 FOREIGN SERVERUSER MAPPING 以简化连接(除非你已经有一个)。
假设串行主键有或没有一些间隙。

CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(order_id), max(order_id) FROM orders;
                                        -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 100) + 1;  -- rounded, possibly a bit too small
                                        -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('myserver');  -- your foreign server as instructed above

   FOR i IN 0..200 LOOP                 -- 200 >> 100 to make sure we exceed _max
      PERFORM dblink_exec(
       $UPDATE public.orders
         SET    status = 'foo'
         WHERE  order_id >= $ || _cur || $
         AND    order_id <  $ || _cur + _step || $
         AND    status IS DISTINCT FROM 'foo'$);  -- avoid empty update

      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

调用:

SELECT f_update_in_steps();

您可以根据需要对任何部分进行参数化:表名、列名、值……只需确保清理标识符以避免 SQL 注入:

避免空更新:

Column / Row

... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.

Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.

Index

To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.

When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.

Performance

For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:

UPDATE orders SET status = null;

I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:

ALTER TABLE orders DROP column status
                 , ADD  column status text;

The manual (up to Postgres 10):

When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]

The manual (since Postgres 11):

When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.

Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]

And:

The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.

Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.

If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:

To actually apply the default, consider doing it in batches:

General solution

Update: Since Postgres 11 you can use a PROCEDURE for the purpose, where you can include transaction control statements like COMMIT. See:


dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.

This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.

If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.

Disclaimer

First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.

If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.

Step-by-step instructions

The additional module dblink needs to be installed first:

Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:

Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.

CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(order_id), max(order_id) FROM orders;
                                        -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 100) + 1;  -- rounded, possibly a bit too small
                                        -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('myserver');  -- your foreign server as instructed above

   FOR i IN 0..200 LOOP                 -- 200 >> 100 to make sure we exceed _max
      PERFORM dblink_exec(
       $UPDATE public.orders
         SET    status = 'foo'
         WHERE  order_id >= $ || _cur || $
         AND    order_id <  $ || _cur + _step || $
         AND    status IS DISTINCT FROM 'foo'$);  -- avoid empty update

      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

Call:

SELECT f_update_in_steps();

You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:

Avoid empty UPDATEs:

栖迟 2024-08-02 11:45:28

您应该将此列委托给另一个表,如下所示:

create table order_status (
  order_id int not null references orders(order_id) primary key,
  status int not null
);

然后您设置 status=NULL 的操作将是即时的:

truncate order_status;

You should delegate this column to another table like this:

create table order_status (
  order_id int not null references orders(order_id) primary key,
  status int not null
);

Then your operation of setting status=NULL will be instant:

truncate order_status;
大海や 2024-08-02 11:45:28

Postgres 使用 MVCC(多版本并发控制),因此如果您是唯一的写入者,则可以避免任何锁定; 任意数量的并发读取器都可以在该表上工作,并且不会有任何锁定。

因此,如果确实需要 5 小时,则一定是出于不同的原因(例如,您确实有并发写入,与您声称没有的相反)。

Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.

So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).

挥剑断情 2024-08-02 11:45:28

首先 - 您确定需要更新所有行吗?

也许某些行的 status 已经为 NULL?

如果是这样,那么:

UPDATE orders SET status = null WHERE status is not null;

至于对更改进行分区 - 这在纯 sql 中是不可能的。 所有更新都在单个事务中。

在“纯 sql”中执行此操作的一种可能方法是安装 dblink,使用 dblink 连接到同一数据库,然后通过 dblink 发出大量更新,但对于这样一个简单的任务来说,这似乎有点矫枉过正。

通常只需添加适当的 where 即可解决问题。 如果没有 - 只需手动分区即可。 编写脚本太多了 - 您通常可以用简单的一行代码来编写它:

perl -e '
    for (my $i = 0; $i <= 3500000; $i += 1000) {
        printf "UPDATE orders SET status = null WHERE status is not null
                and order_id between %u and %u;\n",
        $i, $i+999
    }
'

为了可读性,我在此处换行,通常是一行。 上述命令的输出可以直接输入到 psql:

perl -e '...' | psql -U ... -d ...

或者先输入到文件,然后输入到 psql(以防您稍后需要该文件):

perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql

First of all - are you sure that you need to update all rows?

Perhaps some of the rows already have status NULL?

If so, then:

UPDATE orders SET status = null WHERE status is not null;

As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.

One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.

Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:

perl -e '
    for (my $i = 0; $i <= 3500000; $i += 1000) {
        printf "UPDATE orders SET status = null WHERE status is not null
                and order_id between %u and %u;\n",
        $i, $i+999
    }
'

I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:

perl -e '...' | psql -U ... -d ...

Or first to file and then to psql (in case you'd need the file later on):

perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
北凤男飞 2024-08-02 11:45:28

我绝不是 DBA,但是经常需要更新 3500 万行的数据库设计可能会存在……问题。

一个简单的 WHERE status IS NOT NULL 可能会大大加快速度(假设您有状态索引) - 不知道实际用例,我假设如果经常运行,这是一个很好的选择3500 万行中的一部分可能已经处于空状态。

但是,您可以通过 循环语句。 我将编写一个小示例:

CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $
DECLARE
    i INTEGER := 0;
BEGIN
    FOR i IN 0..(count/1000 + 1) LOOP
        UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
        RAISE NOTICE 'Count: % and i: %', count,i;
    END LOOP;
    RETURN 1;
END;
$ LANGUAGE plpgsql;

然后可以通过执行类似以下操作来运行它:

SELECT nullstatus(35000000);

您可能想要选择行计数,但请注意确切的行计数可能需要很长时间。 PostgreSQL wiki 有一篇关于缓慢计数以及如何避免它的文章。

此外,“RAISE NOTICE”部分只是用来跟踪脚本的进展情况。 如果您不关注通知,或者不关心,最好将其忽略。

I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.

A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.

However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:

CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $
DECLARE
    i INTEGER := 0;
BEGIN
    FOR i IN 0..(count/1000 + 1) LOOP
        UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
        RAISE NOTICE 'Count: % and i: %', count,i;
    END LOOP;
    RETURN 1;
END;
$ LANGUAGE plpgsql;

It can then be run by doing something akin to:

SELECT nullstatus(35000000);

You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.

Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.

小梨窩很甜 2024-08-02 11:45:28

您确定这是因为锁定吗? 我不这么认为,还有很多其他可能的原因。 要找出答案,您始终可以尝试仅进行锁定。 尝试这个:
开始;
现在选择();
从订单中选择*进行更新;
现在选择();
回滚;

要了解真正发生的情况,您应该首先运行 EXPLAIN(EXPLAIN UPDATE order SET status...)和/或 EXPLAIN ANALYZE。 也许您会发现您没有足够的内存来有效地执行更新。 如果是,则将 work_mem 设置为 'xxxMB'; 可能是一个简单的解决方案。

另外,跟踪 PostgreSQL 日志以查看是否出现一些与性能相关的问题。

Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;

To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.

Also, tail the PostgreSQL log to see if some performance related problems occurs.

暖风昔人 2024-08-02 11:45:28

我会使用 CTAS:

begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;

I would use CTAS:

begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
无人问我粥可暖 2024-08-02 11:45:28

一些尚未提及的选项:

使用新表技巧。 在您的情况下,您可能需要做的是编写一些触发器来处理它,以便对原始表的更改也会传播到您的表副本,类似这样......(percona 是一个以触发方式执行此操作的示例)。 另一种选择可能是“创建一个新列,然后用它替换旧列”技巧,以避免锁定(不清楚是否有助于提高速度)。

可能计算最大 ID,然后生成“您需要的所有查询”并将它们作为单个查询传递,例如 update X set Y = NULL where ID 10000 且 ID >= 0; 更新 X 设置 Y = NULL,其中 ID < 20000和ID> 10000; ... 那么它可能不会做那么多的锁定,并且仍然是所有 SQL,尽管你确实有额外的逻辑来做到这一点:(

Some options that haven't been mentioned:

Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).

Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(

千寻… 2024-08-02 11:45:28

PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.

An explanation is provided in this blog post.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文