我正在编写一个 .NET 4 应用程序,它将文件中的大量数据导入到 PostgreSQL 9.1 数据库中。分析显示,数据库调用实际 INSERT 数据占用了超过 90% 的时间。数据库服务器似乎受 CPU 限制 - 使用所有一个 CPU。
如果可能的话,我希望使用所有 CPU 更快地导入数据。输入文件可以在客户端上分成多个部分,因此这通常不会太难,但我想确保如果导入文件时发生任何错误,则数据库根本不会被修改。为了实现这一目标,我在一笔交易中完成了整个导入。
是否可以以某种方式向数据库服务器发送并发命令(以利用其所有 CPU),但仍然确保整个导入成功或不进行任何更改?据我了解,不能从多个线程使用事务来同时运行多个命令,可以吗?我正在使用 Npgsql 作为 ADO.NET 提供程序,如果这有什么区别的话。
I'm writing a .NET 4 application that imports a large amount of data from a file into a PostgreSQL 9.1 database. Profiling shows that the DB calls to actually INSERT the data take up over 90% of the time. The DB server appears to be CPU-bound - using all of one CPU.
If possible, I'd like to import the data faster by using all the CPUs. The input file could be broken up into pieces on the client, so this normally wouldn't be too hard, but I want to ensure that if any errors occur in importing a file then the DB is not modified at all. To accomplish this I'm doing the entire import in one transaction.
Is it possible to somehow send concurrent commands to the DB server (to utilise all of its CPUs), but still ensure that either the entire import succeeds or no changes are made? As far as I understand a transaction cannot be used from multiple threads to run multiple commands concurrently, can it? I'm using Npgsql as the ADO.NET provider, if that makes a difference.
发布评论
评论(4)
在 Postgres 9.6 之前的标准 PostgreSQL 中,事务无法由多个线程并行处理,该功能已添加为 “并行查询”。
不过,您的 INSERT 操作受 CPU 限制似乎很可疑。这里可能需要改进一些事情。具体如何将数据发送到服务器?基本上有四种将数据
INSERT
插入表中的方法:VALUES
表达式一次一行,提供文字VALUES
多行> 表达式INSERT
与选择
(插入0-n 行)复制
< code>COPY 是迄今为止最快的方法。
在大量
INSERT
/COPY
之前删除索引并在之后重新创建它们会更快。增量添加索引元组的效率比一次创建索引低得多。触发器、约束或外键约束是其他可能减慢速度的因素。也许您可以在批量加载之前禁用/删除并在之后启用/重新创建?
还有许多设置可以产生重大影响。
您可以关闭 fsync 并同步提交。 (有风险!)
暂时禁用
autovacuum
。之后立即运行ANALYZE
。 (小心这些!)阅读有关批量加载和恢复和调整您的 PostgreSQL 服务器(位于 Postgres Wiki 中),特别是关于 checkpoint_segments 和 checkpoint_completion_target 的段落。
该操作可能并不像看起来那样受 CPU 限制。请查看 PostgreSQL Wiki 中的这段。
导致速度下降的另一个原因可能是日志记录。例如,
log_statement = all
会产生巨大的日志文件,但需要付出一定的代价,特别是对于单行插入。这里有一个在 PostgreSQL Wiki 中再次检查所有自定义设置的快速方法。
另一个加快速度的想法,特别是当你无法关闭 fsync 时。创建一个或多个空临时表,如下所示:
思考一下如何处理序列和其他默认值!
INSERT
将所有数据插入暂存表,然后通过一个命令写入目标表。索引和约束再次关闭,但时间要短得多。可能会快得多。确保为各种设置使用足够的 RAM。查看
temp_buffers
特别是。A transaction cannot be processed in parallel by multiple threads with standard PostgreSQL before Postgres 9.6, where this feature was added as "parallel query".
It seems suspicious, though, that your INSERT operation is CPU-bound. A couple of things might be improved here. How exactly do you send the data to the server? There are basically four ways to
INSERT
data into a table:VALUES
expression providing literalsVALUES
expressionINSERT
withSELECT
(inserting 0-n rows)COPY
COPY
is the fastest method by far.It is faster to delete indexes before a huge bulk
INSERT
/COPY
and recreate them afterwards. Incrementally adding index tuples is much less efficient than creating an index at once.Triggers, constraints or foreign key constraints are other factors that can slow you down. Maybe you could disable / delete before the bulk load and enable / recreate afterwards?
There are also a number of settings that can make a substantial difference.
You could turn off fsync and synchronous_commit. (Risky!)
Disable
autovacuum
temporarily. RunANALYZE
immediately afterwards. (Careful with those!)Read the article about Bulk Loading and Restores and Tuning Your PostgreSQL Server in the Postgres Wiki, especially the paragraphs on checkpoint_segments and checkpoint_completion_target.
The operation may not be as CPU-bound as it seems. Have a look at this paragraph in the PostgreSQL Wiki.
One more source of slowdown might be logging. For instance,
log_statement = all
produces huge log files, at a cost, especially with single-row inserts.Here is a quick method to check all your custom settings in the PostgreSQL Wiki once more.
One more idea to speed up things, especially as you cannot turn off fsync. Create one or more empty temporary tables like this:
Put some thought into how to deal with sequences and other defaults!
INSERT
all data into the staging table(s), then write to target tables in one command. Indexes and constraints off again, but for a much shorter time.Could be substantially faster. Be sure to use enough RAM for various settings. Look at
temp_buffers
in particular.另一个行动计划可能是这样的:
在这种情况下,您当前的瓶颈(由于输入处理而受到 CPU 限制)可以通过支付更多 IO 硬币来缓解。
Another plan of actions might be this:
In this scenario your current bottleneck (CPU bound due to input processing) is relieved by paying more IO coins.
我想说你应该使用准备好的事务。根据需要并行运行任意多个,如果它们都到达阶段,则可以毫无错误地准备它们,然后提交准备它们,否则回滚准备它们。
首先必须将 max_prepared_transactions 设置为大于 0 的某个值并重新启动 postgresql。之后,您在单个会话中启动事务,如下所示:
此时您将收到通知“PREPARE TRANSACTION”或“ROLLBACK”。如果您从任何准备好的事务中得到回滚,那么您可以对每个事务进行回滚准备,并且不会执行任何操作。请注意,您不应留下大量准备好的交易。
I'd say you should use prepared transactions. Run as many in parallel as you want, and if they all get to the stage they can be prepared without error, then commit prepared them otherwise rollback prepared them.
You first have to set max_prepared_transactions to some value above 0 and restart postgresql. After that, you start a transaction in a single session like so:
at which point you will receive either the notice "PREPARE TRANSACTION" or "ROLLBACK". If you get ROLLBACK from any of your prepared transactions then you can do a rollback prepared on each of your transactions and none go through. Note that you should not leave lots of prepared transactions hanging around.
请注意,pg_restore 现在使用多线程模式来从多核架构恢复压缩转储中获益。因此,管理多个连接上的大量导入无疑是一个好主意。我看到过有关将 pg_restore 的作业数量设置为 2*核心数量的良好收益的报告。
但 pg_restore 无法使用
--single-transaction
进行此设置。所以和你有同样的问题。您可以使用 两阶段提交事务 >PREPARE TRANSACTION 语句,它通常是由事务管理器而不是应用程序完成的,但是如果其中一个事务在导入过程中失败,这可以帮助您使多个事务无效。Notice that pg_restore is now using a several-threads mode to get benefits from multicore architecture in restoring compress dumps. So managing heavy imports on several connexions is certainly a good idea. I have seen reports of nice gains with setting the number of jobs to 2*number of core for pg_restore.
But pg_restore cannot use
--single-transaction
with this setting. So the same problem as you. You may try 2-phase commit transactions with PREPARE TRANSACTION statements, it's usually something done by transaction managers, not applications, but this could help you invalidate several transactions if one of them fail in your import process.