高效更新非常大的 PostgreSQL 数据库表

发布于 2024-07-06 00:11:30 字数 1140 浏览 9 评论 0原文

我在 PostgresQL 中有一个非常大的数据库表和一个像“复制”这样的列。 每个新行开始时都是未复制的,稍后将由后台程序复制到另一个事物。 该表上有一个部分索引“btree(ID) WHEREreplicated=0”。 后台程序最多选择 2000 个条目(LIMIT 2000),对其进行处理,然后使用 2000 个准备好的 sql 命令在一个事务中提交更改。

现在的问题是我想为用户提供一个选项来重置此复制值,使其再次为零。

更新表集replicated=0;

不可能:

  • 需要很长时间
  • 由于 MVCC,它会重复表的大小
  • 它是在一个事务中完成的:它要么失败,要么通过。

实际上,在这种情况下我不需要事务功能:如果系统出现故障,它只会处理其中的一部分。

其他几个问题: 这样做

update set replicated=0 where id >10000 and id<20000

也很糟糕:它对整个表进行顺序扫描,这太慢了。 如果不这样做,它仍然会很慢,因为它会进行太多的搜索。

我真正需要的是一种遍历所有行、更改它们而不被绑定到巨大事务的方法。

奇怪的是,它

UPDATE table 
  SET replicated=0 
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)

也很慢,尽管它应该是一件好事:以 DISK-order 的方式浏览表...

(请注意,在这种情况下,还有一个索引涵盖了这一点)

(像 Mysql 这样的更新 LIMIT 不可用PostgresQL)

顺便说一句:真正的问题更复杂,我们在这里讨论的是已经部署的嵌入式系统,因此远程模式更改很困难,但也是可能的 不幸的是,它是 PostgresQL 7.4。

我所说的行数例如是90000000。数据库的大小可以是几十GB。

数据库本身只包含5张表,其中一张是非常大的。 但这并不是一个糟糕的设计,因为这些嵌入式盒子只与一种实体一起运行,它不是 ERP 系统或类似的东西!

有任何想法吗?

I have a very large database table in PostgresQL and a column like "copied". Every new row starts uncopied and will later be replicated to another thing by a background programm. There is an partial index on that table "btree(ID) WHERE replicated=0". The background programm does a select for at most 2000 entries (LIMIT 2000), works on them and then commits the changes in one transaction using 2000 prepared sql-commands.

Now the problem ist that I want to give the user an option to reset this replicated-value, make it all zero again.

An update table set replicated=0;

is not possible:

  • It takes very much time
  • It duplicates the size of the tabel because of MVCC
  • It is done in one transaction: It either fails or goes through.

I actually don't need transaction-features for this case: If the system goes down, it shall process only parts of it.

Several other problems:
Doing an

update set replicated=0 where id >10000 and id<20000

is also bad: It does a sequential scan all over the whole table which is too slow.
If it weren't doing that, it would still be slow because it would be too many seeks.

What I really need is a way of going through all rows, changing them and not being bound to a giant transaction.

Strangely, an

UPDATE table 
  SET replicated=0 
WHERE ID in (SELECT id from table WHERE replicated= LIMIT 10000)

is also slow, although it should be a good thing: Go through the table in DISK-order...

(Note that in that case there was also an index that covered this)

(An update LIMIT like Mysql is unavailable for PostgresQL)

BTW: The real problem is more complicated and we're talking about an embedded system here that is already deployed, so remote schema changes are difficult, but possible
It's PostgresQL 7.4 unfortunately.

The amount of rows I'm talking about is e.g. 90000000. The size of the databse can be several dozend gigabytes.

The database itself only contains 5 tables, one is a very large one.
But that is not bad design, because these embedded boxes only operate with one kind of entity, it's not an ERP-system or something like that!

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

染墨丶若流云 2024-07-13 00:11:30

虽然您不太可能解决空间使用问题(这是暂时的,直到真空为止),但您可能确实可以在时钟时间方面加快该过程。 PostgreSQL 使用 MVCC 的事实意味着您应该能够执行此操作,而不会出现与新插入的行相关的任何问题。 创建表作为选择将解决一些性能问题,但不允许继续使用该表,并且占用同样多的空间。 只需放弃索引并重建它,然后进行清理即可。

drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;

While you cannot likely fix the problem of space usage (it is temporary, just until a vacuum) you can probably really speed up the process in terms of clock time. The fact that PostgreSQL uses MVCC means that you should be able to do this without any issues related to newly inserted rows. The create table as select will get around some of the performance issues, but will not allow for continued use of the table, and takes just as much space. Just ditch the index, and rebuild it, then do a vacuum.

drop index replication_flag;
update big_table set replicated=0;
create index replication_flag on big_table btree(ID) WHERE replicated=0;
vacuum full analyze big_table;
尘世孤行 2024-07-13 00:11:30

添加一个新表来存储此复制值(以及将每个记录链接到主表的主键)怎么样? 然后,您只需为每个复制项添加一条记录,并删除记录以删除复制标志。 (或者也许反过来 - 每个非复制记录都有一个记录,具体取决于常见情况)。

当您想将它们全部设置回 0 时,这也会简化情况,因为您可以截断表(这会将磁盘上的表大小归零,您甚至不需要真空来释放空间)

How about adding a new table to store this replicated value (and a primary key to link each record to the main table). Then you simply add a record for every replicated item, and delete records to remove the replicated flag. (Or maybe the other way around - a record for every non-replicated record, depending on which is the common case).

That would also simplify the case when you want to set them all back to 0, as you can just truncate the table (which zeroes the table size on disk, you don't even have to vacuum to free up the space)

柠檬 2024-07-13 00:11:30

如果您尝试重置整个表,而不仅仅是几行,那么简单地 CREATE TABLE bar AS SELECT everything, but, Copy, 0 FROM 通常会更快(在非常大的数据集上 - 而不是在常规表上) foo,然后交换表并删除旧表。 显然,您需要确保在执行此操作时不会将任何内容插入到原始表中。 您还需要重新创建该索引。

编辑:一个简单的改进,以避免在复制 14 GB 时锁定表:(

lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;

让写入在复制时发生,并在事后插入它们)。

If you are trying to reset the whole table, not just a few rows, it is usually faster (on extremely large datasets -- not on regular tables) to simply CREATE TABLE bar AS SELECT everything, but, copied, 0 FROM foo, and then swap the tables and drop the old one. Obviously you would need to ensure nothing gets inserted into the original table while you are doing that. You'll need to recreate that index, too.

Edit: A simple improvement in order to avoid locking the table while you copy 14 Gigabytes:

lock ;
create a new table, bar;
swap tables so that all writes go to bar;
unlock;
create table baz as select from foo;
drop foo;
create the index on baz;
lock;
insert into baz from bar;
swap tables;
unlock;
drop bar;

(let writes happen while you are doing the copy, and insert them post-factum).

初相遇 2024-07-13 00:11:30

这是伪代码。 您将需要 400MB(对于整数)或 800MB(对于大整数)临时文件(如果出现问题,可以使用 zlib 对其进行压缩)。 需要对表进行大约 100 次扫描以进行真空处理。 但它不会使表膨胀超过 1%(任何时候最多 1000000 个死行)。 您还可以用更少的扫描来换取更多的表膨胀。

// write all ids to temporary file in disk order                
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
        if ( $replicated<>0 ) {
                write($file,&$id,sizeof($id));
        }
}

// prepare an update query
query("prepare set_replicated_0(bigint) as
        update table set replicated=0 where id=?");

// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
        query("execute set_replicated_0($id)");
        $counter++;
        if ( $counter % 1000000 == 0 ) {
                query("commit");
                query("vacuum table");
                query("start transaction");
        }
}
query("commit");
query("vacuum table");
close($file);

This is pseudocode. You'll need 400MB (for ints) or 800MB (for bigints) temporary file (you can compress it with zlib if it is a problem). It will need about 100 scans of a table for vacuums. But it will not bloat a table more than 1% (at most 1000000 dead rows at any time). You can also trade less scans for more table bloat.

// write all ids to temporary file in disk order                
// no where clause will ensure disk order
$file = tmpfile();
for $id, $replicated in query("select id, replicated from table") {
        if ( $replicated<>0 ) {
                write($file,&$id,sizeof($id));
        }
}

// prepare an update query
query("prepare set_replicated_0(bigint) as
        update table set replicated=0 where id=?");

// reread this file, launch prepared query and every 1000000 updates commit
// and vacuum a table
rewind($file);
$counter = 0;
query("start transaction");
while read($file,&$id,sizeof($id)) {
        query("execute set_replicated_0($id)");
        $counter++;
        if ( $counter % 1000000 == 0 ) {
                query("commit");
                query("vacuum table");
                query("start transaction");
        }
}
query("commit");
query("vacuum table");
close($file);
独留℉清风醉 2024-07-13 00:11:30

我想你需要做的是
A。 将2000条记录PK值复制到具有相同标准限制的临时表中,等等。
b. 选择相同的 2000 条记录并按原样在光标中执行必要的操作。
C。 如果成功,则对临时表中的记录运行单个更新查询。 清除临时表并再次运行步骤 a。
d. 如果不成功,请清除临时表而不运行更新查询。
简单、高效、可靠。
问候,
康泰

I guess what you need to do is
a. copy the 2000 records PK value into a temporary table with the same standard limit, etc.
b. select the same 2000 records and perform the necessary operations in the cursor as it is.
c. If successful, run a single update query against the records in the temp table. Clear the temp table and run step a again.
d. If unsuccessful, clear the temp table without running the update query.
Simple, efficient and reliable.
Regards,
KT

神妖 2024-07-13 00:11:30

我认为最好将 postgres 更改为 8.X 版本。 可能是Postgres版本低的原因。 另请尝试下面的这个查询。 我希望这能有所帮助。

UPDATE table1 SET name = table2.value
FROM table2 
WHERE table1.id = table2.id;

I think it's better to change your postgres to version 8.X. probably the cause is the low version of Postgres. Also try this query below. I hope this can help.

UPDATE table1 SET name = table2.value
FROM table2 
WHERE table1.id = table2.id;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文