当前位置：文江博客话题详情

在PostgreSQL数据库中更新约1亿行的最有效方法？

发布于 2025-02-07 21:43:46 字数 323 浏览 1 评论 0原文

我有一个带有单个表的数据库。该表将需要每隔几周更新一次。我们需要将第三方数据摄入其中，它将包含100-1.2亿行。因此，基本上是：

从源
检测插入，更新＆amp; 中获取原始数据。删除
可以更新并摄入数据库中

检测和执行更新的最佳方法是什么？一些选项是：

将传入数据与当前数据库一一进行比较，然后进行单个更新。这似乎很慢，不可行。
将传入的数据摄入新表中，然后
在当前表中将新表格更新开出旧表。不知道该怎么做。

您建议最好的选择，或者是否有其他选择？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

就是爱搞怪 2025-02-14 21:43:46

Postgres提供了改善 bulk Loads 的有用指南。从您的描述中，除了散装更新和delete外，还需要执行批量插入。下面是一个大致逐步指南，用于提高此高效：

在操作之前，

ALTER SYSTEM SET max_wal_size = <size>;

您可以完全禁用WAL。

ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;

请注意，这些更改将需要重新启动数据库才能生效。

开始交易，

您希望所有工作都在一次交易中完成，以防万一出现任何问题。在多个连接之间并行运行副本通常不会增加性能，因为磁盘通常是限制因素。

优化事务级别的其他配置变量，

SET LOCAL maintenance_work_mem = <size>
...

如果您要对Postgres内部的数据进行任何其他特殊处理，则可能需要设置其他配置参数（work_mem通常是最重要的，尤其是在使用后扩展时。）请参阅本指南对于性能的最重要的配置变量。

`创建` a `临时`没有没有约束的表。

CREATE TEMPORARY TABLE changes(
  id bigint,
  data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction

批量插入`更改`使用`从`复制

从 命令复制以将原始数据插入临时表中。

COPY changes(id,data) FROM ..

放慢处理的关系

drop可以在TARGET表上， drop 所有外键约束，索引和触发器（如有可能）。不要删除主键，因为您需要insert。

将跟踪列添加到`target`表

添加列<代码>目标表，以确定“ conse：upsert：upsert：

ALTER TABLE target ADD COLUMN seen boolean;

upsert：upsert from `更改`表” `目标`表：

UPSERT是通过在冲突中添加子句中的标准insert语句来执行的。这样可以防止需要执行两个单独的操作。

INSERT INTO target(id,data,seen) 
  SELECT 
    id,
    data,
    true
  FROM
    changes
  ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;

`delete`不在`更改中的行`表

DELETE FROM target WHERE not seen is true;

`更改`表

DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;

跟踪列和临时

drop 所有限制，触发器和索引都被丢弃以提高大量UPSERT性能。

提交事务

批量UPSERT/DELETE已完成，并且应在交易之外执行以下命令。

在`目标`表上运行`真空分析`。

这将使查询计划者可以对桌子进行适当的推断，并恢复死去的分组所占的空间。

SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>

还原数据库配置变量的原始值，

ALTER SYSTEM SET max_wal_size = <size>;
...

您可能需要再次重新启动数据库才能生效。

Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT in addition to a bulk UPDATE and DELETE. Below is a roughly step by step guide for making this efficient:

Configure Global Database Configuration Variables Before the Operation

ALTER SYSTEM SET max_wal_size = <size>;

You can additionally disable WAL entirely.

ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;

Note that these changes will require a database restart to take effect.

Start a Transaction

You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.

Optimize Other Configuration Variables at the Transaction level

SET LOCAL maintenance_work_mem = <size>
...

You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.

`CREATE` a `TEMPORARY` table with no constraints.

CREATE TEMPORARY TABLE changes(
  id bigint,
  data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction

Bulk Insert Into `changes` using `COPY FROM`

Use the COPY FROM Command to bulk insert the raw data into the temporary table.

COPY changes(id,data) FROM ..

`DROP` Relations That Can Slow Processing

On the target table, DROP all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT.

Add a Tracking Column to `target` Table

Add a column to target table to determine if row was present in changes table:

ALTER TABLE target ADD COLUMN seen boolean;

UPSERT from the `changes` table into the `target` table:

UPSERTs are performed by adding an ON CONFLICT clause to a standard INSERT statement. This prevents the need from performing two separate operations.

INSERT INTO target(id,data,seen) 
  SELECT 
    id,
    data,
    true
  FROM
    changes
  ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;

`DELETE` Rows Not In `changes` Table

DELETE FROM target WHERE not seen is true;

`DROP` Tracking Column and Temporary `changes` Table

DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;

Add Back Relations You Dropped For Performance

Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.

Commit Transaction

The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.

Run `VACUUM ANALYZE` on the `target` Table.

This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.

SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>

Restore Original Values of Database Configuration Variables

ALTER SYSTEM SET max_wal_size = <size>;
...

You may need to restart your database again for these settings to take effect.

回复收藏 0 原文

~没有更多了~

关于作者

苍景流年

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在PostgreSQL数据库中更新约1亿行的最有效方法？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

在操作之前，

开始交易，

优化事务级别的其他配置变量，

创建 a 临时没有没有约束的表。

批量插入更改使用从复制

放慢处理的关系

将跟踪列添加到target表

upsert：upsert from 更改表” 目标表：

delete不在更改中的行表

更改表

跟踪列和临时

提交事务

在目标表上运行真空分析。

还原数据库配置变量的原始值，

Configure Global Database Configuration Variables Before the Operation

Start a Transaction

Optimize Other Configuration Variables at the Transaction level

CREATE a TEMPORARY table with no constraints.

Bulk Insert Into changes using COPY FROM

DROP Relations That Can Slow Processing

Add a Tracking Column to target Table

UPSERT from the changes table into the target table:

DELETE Rows Not In changes Table

DROP Tracking Column and Temporary changes Table