在PostgreSQL数据库中更新约1亿行的最有效方法?
我有一个带有单个表的数据库。该表将需要每隔几周更新一次。我们需要将第三方数据摄入其中,它将包含100-1.2亿行。因此,基本上是:
- 从源
- 检测插入,更新& 中获取原始数据。删除
- 可以更新并摄入数据库中
检测和执行更新的最佳方法是什么? 一些选项是:
- 将传入数据与当前数据库一一进行比较,然后进行单个更新。这似乎很慢,不可行。
- 将传入的数据摄入新表中,然后
- 在当前表中将新表格更新开出旧表。不知道该怎么做。
您建议最好的选择,或者是否有其他选择?
I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:
- Get the raw data from the source
- Detect inserts, updates & deletes
- Make updates and ingest into the database
What's the best way of detecting and performing updates?
Some options are:
- Compare incoming data with current database one by one and make single updates. This seems very slow and not feasible.
- Ingest incoming data into a new table, then switch out old table with the new table
- Bulk updates in-place in the current table. Not sure how to do this.
What do you suggest is the best option, or if there's a different option out there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Postgres提供了改善 bulk Loads 的有用指南。从您的描述中,除了散装更新和
delete
外,还需要执行批量插入。下面是一个大致逐步指南,用于提高此高效:在操作之前,
您可以完全禁用WAL。
请注意,这些更改将需要重新启动数据库才能生效。
开始交易,
您希望所有工作都在一次交易中完成,以防万一出现任何问题。在多个连接之间并行运行副本通常不会增加性能,因为磁盘通常是限制因素。
优化事务级别的其他配置变量,
如果您要对Postgres内部的数据进行任何其他特殊处理,则可能需要设置其他配置参数(
work_mem
通常是最重要的,尤其是在使用后扩展时。)请参阅本指南 对于性能的最重要的配置变量。创建
a临时
没有没有约束的表。批量插入
更改
使用从
复制从
命令复制以将原始数据插入临时表中。放慢处理的关系
drop
可以在TARGET
表上 , drop 所有外键约束,索引和触发器(如有可能)。不要删除主键,因为您需要insert
。将跟踪列添加到
target
表添加列<代码>目标表,以确定“ conse:upsert:upsert:
upsert:upsert from
更改
表”目标
表:UPSERT是通过在冲突中添加
子句中的标准
insert
语句来执行的。这样可以防止需要执行两个单独的操作。delete
不在更改中的行
表更改
表跟踪列和临时
drop
所有限制,触发器和索引都被丢弃以提高大量UPSERT性能。提交事务
批量UPSERT/DELETE已完成,并且应在交易之外执行以下命令。
在
目标
表上运行真空分析
。这将使查询计划者可以对桌子进行适当的推断,并恢复死去的分组所占的空间。
还原数据库配置变量的原始值,
您可能需要再次重新启动数据库才能生效。
Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk
INSERT
in addition to a bulkUPDATE
andDELETE
. Below is a roughly step by step guide for making this efficient:Configure Global Database Configuration Variables Before the Operation
You can additionally disable WAL entirely.
Note that these changes will require a database restart to take effect.
Start a Transaction
You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.
Optimize Other Configuration Variables at the Transaction level
You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (
work_mem
is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.CREATE
aTEMPORARY
table with no constraints.Bulk Insert Into
changes
usingCOPY FROM
Use the
COPY FROM
Command to bulk insert the raw data into the temporary table.DROP
Relations That Can Slow ProcessingOn the
target
table,DROP
all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for theINSERT
.Add a Tracking Column to
target
TableAdd a column to
target
table to determine if row was present in changes table:UPSERT from the
changes
table into thetarget
table:UPSERTs are performed by adding an
ON CONFLICT
clause to a standardINSERT
statement. This prevents the need from performing two separate operations.DELETE
Rows Not Inchanges
TableDROP
Tracking Column and Temporarychanges
TableAdd Back Relations You Dropped For Performance
Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.
Commit Transaction
The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.
Run
VACUUM ANALYZE
on thetarget
Table.This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.
Restore Original Values of Database Configuration Variables
You may need to restart your database again for these settings to take effect.