如何在 PostgreSQL 中通过排序删除固定数量的行?

发布于 2024-10-20 03:06:27 字数 388 浏览 2 评论 0原文

我正在尝试将一些旧的 MySQL 查询移植到 PostgreSQL,但我在这个方面遇到了麻烦:

DELETE FROM logtable ORDER BY timestamp LIMIT 10;

PostgreSQL 不允许在其删除语法中进行排序或限制,并且该表没有主键,所以我可以'不要使用子查询。此外,我想保留查询完全删除给定数量或记录的行为 - 例如,如果表包含 30 行,但它们都具有相同的时间戳,我仍然想删除 10 行,尽管哪个 10 并不重要

。所以;如何在 PostgreSQL 中通过排序删除固定数量的行?

编辑:没有主键意味着没有 log_id 列或类似列。啊,遗留系统的乐趣!

I'm trying to port some old MySQL queries to PostgreSQL, but I'm having trouble with this one:

DELETE FROM logtable ORDER BY timestamp LIMIT 10;

PostgreSQL doesn't allow ordering or limits in its delete syntax, and the table doesn't have a primary key so I can't use a subquery. Additionally, I want to preserve the behavior where the query deletes exactly the given number or records -- for example, if the table contains 30 rows but they all have the same timestamp, I still want to delete 10, although it doesn't matter which 10.

So; how do I delete a fixed number of rows with sorting in PostgreSQL?

Edit: No primary key means there's no log_id column or similar. Ah, the joys of legacy systems!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

怀里藏娇 2024-10-27 03:06:27

您可以尝试使用 ctid< /a>:
db<>fiddle 演示

DELETE FROM ONLY logtable
WHERE ctid IN (
    SELECT ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

ctid 是:

行版本在其表中的物理位置。请注意,虽然 ctid 可用于非常快速地定位行版本,但如果行的 ctidVACUUM FULL 更新或移动,则会发生变化>。因此,ctid 作为长期行标识符是没有用的。

如果有继承的表logtable 中,使用 DELETE FROM ONLY 以防止从恰好托管具有相同 ctid< 的行的每个后代中删除最多 10 行 /代码>。该字段仅在每个表中是唯一的,并且默认情况下,常规 DELETE 会沿层次结构向下级联。

如果 logtable分区,请添加 < a href="https://www.postgresql.org/docs/current/ddl-system-columns.html#DDL-SYSTEM-COLUMNS-TABLEOID" rel="nofollow noreferrer">tableoid< /a> 以阻止擦除每个分区最多 10 个数据:

DELETE FROM logtable
WHERE (tableoid,ctid) IN (
    SELECT tableoid,ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

DELETE FROM ONLY 不会删除任何内容,因为主表是空的关系,仅重定向到特定分区。

You could try using the ctid:
demo at db<>fiddle

DELETE FROM ONLY logtable
WHERE ctid IN (
    SELECT ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

The ctid is:

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier.

In case there are tables that inherit from logtable, use DELETE FROM ONLY to prevent this from removing up to 10 rows from each of the descendants that happens to host a row with the same ctid. This field is only unique per table and by default, regular DELETE cascades down the hierarchy.

If logtable is partitioned, add tableoid to stop this from wiping up to 10 per partition:

DELETE FROM logtable
WHERE (tableoid,ctid) IN (
    SELECT tableoid,ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

DELETE FROM ONLY wouldn't remove anything because the main table is an empty relation, only redirecting to specific partitions.

半﹌身腐败 2024-10-27 03:06:27

Postgres 文档建议使用数组而不是 IN 和子查询。这应该工作得更快

DELETE FROM logtable 
WHERE id = any (array(SELECT id FROM logtable ORDER BY timestamp LIMIT 10));

这个和其他一些技巧可以在这里找到

Postgres docs recommend to use array instead of IN and subquery. This should work much faster

DELETE FROM logtable 
WHERE id = any (array(SELECT id FROM logtable ORDER BY timestamp LIMIT 10));

This and some other tricks can be found here

苍风燃霜 2024-10-27 03:06:27
delete from logtable where log_id in (
    select log_id from logtable order by timestamp limit 10);
delete from logtable where log_id in (
    select log_id from logtable order by timestamp limit 10);
橘和柠 2024-10-27 03:06:27

级别 1

甚至不必为 DELETE 的 LIMIT 所困扰。

你担心什么?你会把桌子锁起来吗?不,这不是 MySQL。只要被删除的行没有在其他事务中使用,您就不会阻止任何其他事务。

级别 2

好的好的,但有时您希望确保并行运行可能长时间运行的任务,或者在可能并行运行的 cron 作业中(例如,DELETE 作业花费的时间比周期长)预定)。为此,解决方案是:

BEGIN
  LOCK TABLE blah_blah IN SHARE ROW EXCLUSIVE MODE ;
  DELETE FROM blah_blah WHERE some_condition ;
COMMIT ;

今天就到此为止。如果这项工作花费的时间太长,下一项工作就会失败,但不会造成任何损害。您不会遇到死锁问题,就像其他地方的一些伪 LIMIT 解决方案一样。

级别 3

好的好的,但有时您真的只想删除一些,因为您很好奇可能会发生什么,或者因为您只想删除某些内容的前 X 并返回这些值。但绝对不是,因为您想将某些内容放入 cron-job 中...:

  DELETE FROM blah_blah WHERE primary_key IN ( 
    SELECT primary_key FROM blah_blah WHERE some_condition
    LIMIT 10 
  ) RETURNING * ;

如果您将其放入 cronjob 或自动化任务中结合 Level 2 && 3 级。否则陷入僵局,小魔怪会找到你并折磨你。

4 级

好的,但是您是一名认真 DBA/系统管理员,并且您了解得更多,对吗?您确实需要放置“好”而不是占用系统资源。例如,如果您正在执行 PIT,并且执行如此大的删除操作将会损害 WAL 系统。或者也许(现在越来越多的情况)您实际上需要按 CPU 时间付费,并且您希望保持性能曲线尽可能平滑。

我们可以使用 PL/PGSQL 创建一个循环来半自动化这项工作。
我们将调用该过程batch

要使用它,我们假设您有一个带有主键 id 和名为 timestamp 的时间戳字段的表 node。您可以像这样删除行:

  WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $1 days$ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted

这将一次删除 100 条记录,前提是这些记录早于 1 天。出于您将看到的重要原因,删除的计数将被“返回”。 (顺便说一句,我没有像有些人建议的那样对 ID 进行排序,因为效率低下,并且将表锁定在周围的块中更有意义。) ....

调用此 batchx 过程,对于上述查询,您将这样做:

CALL batch( 300, 100, 
  'WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $1 days$ 
      ORDER BY id 
      LIMIT @@ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted' );

这将以 100 个批次(第二个参数)重复运行查询最多 300 秒(第一个参数),或者直到没有更多记录被删除。请注意使用 @@ 而不是数字!这允许您参数化限制,这样您就不必每次都修改查询。 (这将在第 5 级中派上用场)。

请注意:该代码替换了字符串 LIMIT @@ ,包括前导空格。确保它不是换行符或制表符或其他东西。)

create or replace procedure batch( max_time double precision, chunk_size int, query varchar )
language plpgsql
as $
declare 
  final_query     varchar;
  loop_time       double precision;
  target_end      timestamp;
  loop_timestamp  timestamp;
  last_timestamp  timestamp:=now();
  result          int;
  i               int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || chunk_size::varchar);
  LOOP
    i:=i+1;
    EXECUTE final_query INTO result;
    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_timestamp) )::DOUBLE PRECISION;
    last_timestamp := loop_timestamp;
    RAISE info  'loop %: count= %  loop_time= %  records/s= %', i, result, loop_time, to_char(result / loop_time, 'FM99999999.00');
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$;

第 5 级

这很酷,但现在是时候将其带到下一个也是最后一个级别:让过程根据性能自动调整限制。不要指定限制大小,而是指定您希望监视批次运行的时间

我们将其称为“自动批处理”过程:

create or replace procedure autobatch( max_time double precision, chunk_time double precision, query varchar )
language plpgsql
as $
declare 
  max_scaleup   double precision:=2;
  max_scaledown double precision:=0.1;
  next_limit    int;
  loop_time     double precision;
  factor        double precision;
  result        int;
  target_end    timestamp;
  loop_timestamp  timestamp;
  last_loop_timestamp timestamp:=now();
  final_query varchar;
  i int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  next_limit := max_time ;
  LOOP
    i:=i+1;

    final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || next_limit::varchar);

    EXECUTE final_query INTO result;

    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_loop_timestamp) )::DOUBLE PRECISION;
    factor := GREATEST( max_scaledown, LEAST( max_scaleup, chunk_time / loop_time ));
    next_limit := next_limit * factor;
    last_loop_timestamp := loop_timestamp;
    RAISE info  'loop %: count = %  next_limit = %  loop_time = %', i, result, next_limit, loop_time;
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$;

示例输出:

call autobatch( 1200, 30, 'with deleted as (delete from node where id in ( select id from node where timestamp < now() - interval $1 days$ LIMIT @@ ) returning id) select count(*) from deleted' );
INFO:  loop 1: count = 1200  next_limit = 2400  loop_time = 3.210725
INFO:  loop 2: count = 2400  next_limit = 4800  loop_time = 6.350344
INFO:  loop 3: count = 4800  next_limit = 9600  loop_time = 12.373832
INFO:  loop 4: count = 9600  next_limit = 12261  loop_time = 23.489029
INFO:  loop 5: count = 12261  next_limit = 12338  loop_time = 29.812288
INFO:  loop 6: count = 12338  next_limit = 12675  loop_time = 29.202712
INFO:  loop 7: count = 12675  next_limit = 13360  loop_time = 28.461273

看看它如何调整限制大小以最大化批处理时间?如果超过,它将缩小到。放大的上限为 2 倍,缩小的上限为 1/10。这主要是为了稳定查询导致缓存命中时的行为。

Level 1

Don't even bother with a LIMIT with DELETE.

What are you worried about? That you'll lock up the table? Nope, this isn't MySQL. As long as a row that gets deleted isn't used in some other transaction, you'll not block anything else.

Level 2

OK OK, but sometimes you want to make sure you run potentially long-running tasks in parallel, or in cron-jobs that might run in parallel (eg, the DELETE job takes longer than the period scheduled). For that, the solution is:

BEGIN
  LOCK TABLE blah_blah IN SHARE ROW EXCLUSIVE MODE ;
  DELETE FROM blah_blah WHERE some_condition ;
COMMIT ;

and call it a day. If this job takes too long, the next job will simply fail noisily but no harm done. You won't run into deadlock issues, like you might with some of the pseudo-LIMIT solutions elsewhere.

Level 3

OK OK, but sometimes you really really only want to delete a few because you're curious what might happen, or because you only want to delete the top X of something and return those values. But definitely not because you want to put something in a cron-job... :

  DELETE FROM blah_blah WHERE primary_key IN ( 
    SELECT primary_key FROM blah_blah WHERE some_condition
    LIMIT 10 
  ) RETURNING * ;

If you put this in cronjob or automated task please combine Level 2 && Level 3. Or deadlock-gremlins will find you and torture you.

Level 4

Right, but you're a serious DBA / Sysadmin and you know better, right? You really really need to place "nice" and not hog up system resources. For instance, if you're doing PIT and doing such a big delete operation will hammer the WAL system. Or maybe (as increasingly is the case these days) you actually get charged for CPU time, and you want to keep the performance curve as smooth as possible.

We can use PL/PGSQL to create a loop to semi-automate the job for us.
We'll call the procedure batch.

To use it, let's imagine you've got a table node with a primary key id and with a timestamp-field named timestamp. You would delete the rows like this:

  WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $1 days$ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted

This will delete 100 records at a time, provided those records are older than 1 day. The count deleted will be "returned" for reasons that are important as you will see. (BTW, I'm not sorting the IDs here, as some suggest you should do, because it's inefficient, and locking the table in the surrounding block makes more sense.) ....

To invoke this batchx procedure, with the above query, you'll do it like this:

CALL batch( 300, 100, 
  'WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $1 days$ 
      ORDER BY id 
      LIMIT @@ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted' );

This will run the query repeatedly in batches of 100 (second parameter) for up to 300 seconds (first parameter), or until no more records are deleted. Note the use of @@ instead of the number! This allows you to parameterize the limit so that you don't have to modify the query each time. (This will come in handy in Level 5).

Please note:the code replaces the string LIMIT @@ including the leading space. Make sure it's not a newline or tab or something.)

create or replace procedure batch( max_time double precision, chunk_size int, query varchar )
language plpgsql
as $
declare 
  final_query     varchar;
  loop_time       double precision;
  target_end      timestamp;
  loop_timestamp  timestamp;
  last_timestamp  timestamp:=now();
  result          int;
  i               int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || chunk_size::varchar);
  LOOP
    i:=i+1;
    EXECUTE final_query INTO result;
    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_timestamp) )::DOUBLE PRECISION;
    last_timestamp := loop_timestamp;
    RAISE info  'loop %: count= %  loop_time= %  records/s= %', i, result, loop_time, to_char(result / loop_time, 'FM99999999.00');
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$;

Level 5

That was cool and all, but now it's time to bring it to the next and final level: let the procedure auto-scale the limit according to the performance. Instead of specifying the limit-size, specify the time you want wach batch to run.

Let's call this the autobatch procedure:

create or replace procedure autobatch( max_time double precision, chunk_time double precision, query varchar )
language plpgsql
as $
declare 
  max_scaleup   double precision:=2;
  max_scaledown double precision:=0.1;
  next_limit    int;
  loop_time     double precision;
  factor        double precision;
  result        int;
  target_end    timestamp;
  loop_timestamp  timestamp;
  last_loop_timestamp timestamp:=now();
  final_query varchar;
  i int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  next_limit := max_time ;
  LOOP
    i:=i+1;

    final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || next_limit::varchar);

    EXECUTE final_query INTO result;

    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_loop_timestamp) )::DOUBLE PRECISION;
    factor := GREATEST( max_scaledown, LEAST( max_scaleup, chunk_time / loop_time ));
    next_limit := next_limit * factor;
    last_loop_timestamp := loop_timestamp;
    RAISE info  'loop %: count = %  next_limit = %  loop_time = %', i, result, next_limit, loop_time;
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$;

Sample output:

call autobatch( 1200, 30, 'with deleted as (delete from node where id in ( select id from node where timestamp < now() - interval $1 days$ LIMIT @@ ) returning id) select count(*) from deleted' );
INFO:  loop 1: count = 1200  next_limit = 2400  loop_time = 3.210725
INFO:  loop 2: count = 2400  next_limit = 4800  loop_time = 6.350344
INFO:  loop 3: count = 4800  next_limit = 9600  loop_time = 12.373832
INFO:  loop 4: count = 9600  next_limit = 12261  loop_time = 23.489029
INFO:  loop 5: count = 12261  next_limit = 12338  loop_time = 29.812288
INFO:  loop 6: count = 12338  next_limit = 12675  loop_time = 29.202712
INFO:  loop 7: count = 12675  next_limit = 13360  loop_time = 28.461273

See how it adjusts the limit-size to maximize the batch-time? If it goes over, it will scale down to. Scaling up is capped at 2x, while scaling down is capped at 1/10th. That's mainly to stabilize behavior when queries result in cache hits.

如果没有主键,可以将数组Where IN 语法与组合键一起使用。

delete from table1 where (schema,id,lac,cid) in (select schema,id,lac,cid from table1 where lac = 0 limit 1000);

这对我有用。

If you don't have a primary key you can use the array Where IN syntax with a composite key.

delete from table1 where (schema,id,lac,cid) in (select schema,id,lac,cid from table1 where lac = 0 limit 1000);

This worked for me.

烟沫凡尘 2024-10-27 03:06:27

假设您想要删除任何 10 条记录(没有排序),您可以这样做:

DELETE FROM logtable as t1 WHERE t1.ctid < (select t2.ctid from logtable as t2  where (Select count(*) from logtable t3  where t3.ctid < t2.ctid ) = 10 LIMIT 1);

对于我的用例,删除 10M 记录,结果证明这更快。

Assuming you want to delete ANY 10 records (without the ordering) you could do this:

DELETE FROM logtable as t1 WHERE t1.ctid < (select t2.ctid from logtable as t2  where (Select count(*) from logtable t3  where t3.ctid < t2.ctid ) = 10 LIMIT 1);

For my use case, deleting 10M records, this turned out to be faster.

旧时模样 2024-10-27 03:06:27

您可以编写一个循环删除各行的过程,该过程可以采用一个参数来指定要删除的项目数。但与 MySQL 相比,这有点大材小用了。

You could write a procedure which loops over the delete for individual lines, the procedure could take a parameter to specify the number of items you want to delete. But that's a bit overkill compared to MySQL.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文