当前位置：文江博客话题详情

同一事务中的并发数据库 (PostgreSQL) 命令

发布于 2024-12-12 20:33:39 字数 380 浏览 1 评论 0 原文

我正在编写一个 .NET 4 应用程序，它将文件中的大量数据导入到 PostgreSQL 9.1 数据库中。分析显示，数据库调用实际 INSERT 数据占用了超过 90% 的时间。数据库服务器似乎受 CPU 限制 - 使用所有一个 CPU。

如果可能的话，我希望使用所有 CPU 更快地导入数据。输入文件可以在客户端上分成多个部分，因此这通常不会太难，但我想确保如果导入文件时发生任何错误，则数据库根本不会被修改。为了实现这一目标，我在一笔交易中完成了整个导入。

是否可以以某种方式向数据库服务器发送并发命令（以利用其所有 CPU），但仍然确保整个导入成功或不进行任何更改？据我了解，不能从多个线程使用事务来同时运行多个命令，可以吗？我正在使用 Npgsql 作为 ADO.NET 提供程序，如果这有什么区别的话。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉薄对峙 2024-12-19 20:33:39

在 Postgres 9.6 之前的标准 PostgreSQL 中，事务无法由多个线程并行处理，该功能已添加为 “并行查询”。

不过，您的 INSERT 操作受 CPU 限制似乎很可疑。这里可能需要改进一些事情。具体如何将数据发送到服务器？基本上有四种将数据INSERT插入表中的方法：

使用 VALUES 表达式一次一行，提供文字
一次 VALUES 多行> 表达式
INSERT 与 选择（插入0-n 行)
复制

< code>COPY 是迄今为止最快的方法。

在大量 INSERT / COPY 之前删除索引并在之后重新创建它们会更快。增量添加索引元组的效率比一次创建索引低得多。
触发器、约束或外键约束是其他可能减慢速度的因素。也许您可以在批量加载之前禁用/删除并在之后启用/重新创建？

还有许多设置可以产生重大影响。

您可以关闭 fsync 并同步提交。（有风险！）
暂时禁用autovacuum。之后立即运行ANALYZE。（小心这些！）

阅读有关批量加载和恢复和调整您的 PostgreSQL 服务器（位于 Postgres Wiki 中），特别是关于 checkpoint_segments 和 checkpoint_completion_target 的段落。

该操作可能并不像看起来那样受 CPU 限制。请查看 PostgreSQL Wiki 中的这段。

导致速度下降的另一个原因可能是日志记录。例如，log_statement = all 会产生巨大的日志文件，但需要付出一定的代价，特别是对于单行插入。

这里有一个在 PostgreSQL Wiki 中再次检查所有自定义设置的快速方法。

另一个加快速度的想法，特别是当你无法关闭 fsync 时。创建一个或多个空临时表，如下所示：

CREATE TEMP TABLE x_tmp AS SELECT * FROM real_tbl LIMIT 0;

思考一下如何处理序列和其他默认值！
INSERT 将所有数据插入暂存表，然后通过一个命令写入目标表。索引和约束再次关闭，但时间要短得多。

INSERT INTO real_tbl SELECT * FROM x_tmp ORDER BY something;
DROP TABLE x_tmp;

可能会快得多。确保为各种设置使用足够的 RAM。查看 temp_buffers 特别是。

A transaction cannot be processed in parallel by multiple threads with standard PostgreSQL before Postgres 9.6, where this feature was added as "parallel query".

It seems suspicious, though, that your INSERT operation is CPU-bound. A couple of things might be improved here. How exactly do you send the data to the server? There are basically four ways to INSERT data into a table:

one row at a time with a VALUES expression providing literals
multiple rows at a time a VALUES expression
INSERT with SELECT (inserting 0-n rows)
COPY

COPY is the fastest method by far.

It is faster to delete indexes before a huge bulk INSERT / COPY and recreate them afterwards. Incrementally adding index tuples is much less efficient than creating an index at once.
Triggers, constraints or foreign key constraints are other factors that can slow you down. Maybe you could disable / delete before the bulk load and enable / recreate afterwards?

There are also a number of settings that can make a substantial difference.

You could turn off fsync and synchronous_commit. (Risky!)
Disable autovacuum temporarily. Run ANALYZE immediately afterwards. (Careful with those!)

Read the article about Bulk Loading and Restores and Tuning Your PostgreSQL Server in the Postgres Wiki, especially the paragraphs on checkpoint_segments and checkpoint_completion_target.

The operation may not be as CPU-bound as it seems. Have a look at this paragraph in the PostgreSQL Wiki.

One more source of slowdown might be logging. For instance, log_statement = all produces huge log files, at a cost, especially with single-row inserts.

Here is a quick method to check all your custom settings in the PostgreSQL Wiki once more.

One more idea to speed up things, especially as you cannot turn off fsync. Create one or more empty temporary tables like this:

CREATE TEMP TABLE x_tmp AS SELECT * FROM real_tbl LIMIT 0;

Put some thought into how to deal with sequences and other defaults!
INSERT all data into the staging table(s), then write to target tables in one command. Indexes and constraints off again, but for a much shorter time.

INSERT INTO real_tbl SELECT * FROM x_tmp ORDER BY something;
DROP TABLE x_tmp;

Could be substantially faster. Be sure to use enough RAM for various settings. Look at temp_buffers in particular.

回复收藏 0 原文

爱已欠费 2024-12-19 20:33:39

另一个行动计划可能是这样的：

定义一个新的、未标准化的表来保存您的输入数据。我们称之为“登台”
使用多个线程和多个连接填充该表。
数据到达后，使用真实表中的序列将所需的 ID 插入到临时表中。
并行工作阶段到此结束。
使用一个事务并执行适当的批量插入语句，将数据从临时表移动到实际表中。
截断暂存表。

在这种情况下，您当前的瓶颈（由于输入处理而受到 CPU 限制）可以通过支付更多 IO 硬币来缓解。

回复收藏 0 原文

滴情不沾 2024-12-19 20:33:39

我想说你应该使用准备好的事务。根据需要并行运行任意多个，如果它们都到达阶段，则可以毫无错误地准备它们，然后提交准备它们，否则回滚准备它们。

首先必须将 max_prepared_transactions 设置为大于 0 的某个值并重新启动 postgresql。之后，您在单个会话中启动事务，如下所示：

begin; 
select yada; 
insert yada; 
update yada; 
prepare transaction 'mytrans';

此时您将收到通知“PREPARE TRANSACTION”或“ROLLBACK”。如果您从任何准备好的事务中得到回滚，那么您可以对每个事务进行回滚准备，并且不会执行任何操作。请注意，您不应留下大量准备好的交易。

I'd say you should use prepared transactions. Run as many in parallel as you want, and if they all get to the stage they can be prepared without error, then commit prepared them otherwise rollback prepared them.

You first have to set max_prepared_transactions to some value above 0 and restart postgresql. After that, you start a transaction in a single session like so:

begin; 
select yada; 
insert yada; 
update yada; 
prepare transaction 'mytrans';

at which point you will receive either the notice "PREPARE TRANSACTION" or "ROLLBACK". If you get ROLLBACK from any of your prepared transactions then you can do a rollback prepared on each of your transactions and none go through. Note that you should not leave lots of prepared transactions hanging around.

回复收藏 0 原文