在PostgreSQL数据库中更新约1亿行的最有效方法?

发布于 2025-02-07 21:43:46 字数 323 浏览 1 评论 0原文

我有一个带有单个表的数据库。该表将需要每隔几周更新一次。我们需要将第三方数据摄入其中,它将包含100-1.2亿行。因此,基本上是:

  1. 从源
  2. 检测插入,更新& 中获取原始数据。删除
  3. 可以更新并摄入数据库中

检测和执行更新的最佳方法是什么? 一些选项是:

  1. 将传入数据与当前数据库一一进行比较,然后进行单个更新。这似乎很慢,不可行。
  2. 将传入的数据摄入新表中,然后
  3. 在当前表中将新表格更新开出旧表。不知道该怎么做。

您建议最好的选择,或者是否有其他选择?

I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:

  1. Get the raw data from the source
  2. Detect inserts, updates & deletes
  3. Make updates and ingest into the database

What's the best way of detecting and performing updates?
Some options are:

  1. Compare incoming data with current database one by one and make single updates. This seems very slow and not feasible.
  2. Ingest incoming data into a new table, then switch out old table with the new table
  3. Bulk updates in-place in the current table. Not sure how to do this.

What do you suggest is the best option, or if there's a different option out there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

就是爱搞怪 2025-02-14 21:43:46

Postgres提供了改善 bulk Loads 的有用指南。从您的描述中,除了散装更新和delete外,还需要执行批量插入。下面是一个大致逐步指南,用于提高此高效:

在操作之前,

ALTER SYSTEM SET max_wal_size = <size>;

您可以完全禁用WAL。

ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;

请注意,这些更改将需要重新启动数据库才能生效。

开始交易,

您希望所有工作都在一次交易中完成,以防万一出现任何问题。在多个连接之间并行运行副本通常不会增加性能,因为磁盘通常是限制因素。

优化事务级别的其他配置变量,

SET LOCAL maintenance_work_mem = <size>
...

如果您要对Postgres内部的数据进行任何其他特殊处理,则可能需要设置其他配置参数(work_mem通常是最重要的,尤其是在使用后扩展时。)请参阅本指南 对于性能的最重要的配置变量。

创建 a 临时没有没有约束的表。

CREATE TEMPORARY TABLE changes(
  id bigint,
  data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction

批量插入更改使用复制

命令复制以将原始数据插入临时表中。

COPY changes(id,data) FROM .. 

放慢处理的关系

drop可以在TARGET表上 , drop 所有外键约束,索引和触发器(如有可能)。不要删除主键,因为您需要insert

将跟踪列添加到target

添加列<代码>目标表,以确定“ conse:upsert:upsert:

ALTER TABLE target ADD COLUMN seen boolean;

upsert:upsert from 更改表” 目标表:

UPSERT是通过在冲突中添加子句中的标准insert语句来执行的。这样可以防止需要执行两个单独的操作。

INSERT INTO target(id,data,seen) 
  SELECT 
    id,
    data,
    true
  FROM
    changes
  ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;

delete不在更改中的行

DELETE FROM target WHERE not seen is true;

更改

DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;

跟踪列和临时

drop 所有限制,触发器和索引都被丢弃以提高大量UPSERT性能。

提交事务

批量UPSERT/DELETE已完成,并且应在交易之外执行以下命令。

目标表上运行真空分析

这将使查询计划者可以对桌子进行适当的推断,并恢复死去的分组所占的空间。

SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>

还原数据库配置变量的原始值,

ALTER SYSTEM SET max_wal_size = <size>;
...

您可能需要再次重新启动数据库才能生效。

Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT in addition to a bulk UPDATE and DELETE. Below is a roughly step by step guide for making this efficient:

Configure Global Database Configuration Variables Before the Operation

ALTER SYSTEM SET max_wal_size = <size>;

You can additionally disable WAL entirely.

ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;

Note that these changes will require a database restart to take effect.

Start a Transaction

You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.

Optimize Other Configuration Variables at the Transaction level

SET LOCAL maintenance_work_mem = <size>
...

You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.

CREATE a TEMPORARY table with no constraints.

CREATE TEMPORARY TABLE changes(
  id bigint,
  data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction

Bulk Insert Into changes using COPY FROM

Use the COPY FROM Command to bulk insert the raw data into the temporary table.

COPY changes(id,data) FROM .. 

DROP Relations That Can Slow Processing

On the target table, DROP all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT.

Add a Tracking Column to target Table

Add a column to target table to determine if row was present in changes table:

ALTER TABLE target ADD COLUMN seen boolean;

UPSERT from the changes table into the target table:

UPSERTs are performed by adding an ON CONFLICT clause to a standard INSERT statement. This prevents the need from performing two separate operations.

INSERT INTO target(id,data,seen) 
  SELECT 
    id,
    data,
    true
  FROM
    changes
  ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;

DELETE Rows Not In changes Table

DELETE FROM target WHERE not seen is true;

DROP Tracking Column and Temporary changes Table

DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;

Add Back Relations You Dropped For Performance

Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.

Commit Transaction

The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.

Run VACUUM ANALYZE on the target Table.

This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.

SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>

Restore Original Values of Database Configuration Variables

ALTER SYSTEM SET max_wal_size = <size>;
...

You may need to restart your database again for these settings to take effect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文