提高 PostgreSQL 写入速度是否会以可能丢失数据为代价?
我喜欢 PostgreSQL 的抗崩溃能力,因为我不想花时间修复数据库。但是,我确信必须有一些东西我可以禁用/修改,以便即使我在断电/崩溃之前丢失了几条记录,插入/更新也会更快。我不担心几条记录——只担心整个数据库。
我正在尝试针对大量写入优化 PostgreSQL。目前插入100万行需要22分钟,看起来有点慢。
如何加快 PostgreSQL 写入速度?
我研究过的一些选项(如 full_page_writes)似乎也存在损坏数据的风险,这不是我想要的。我不介意丢失数据 - 我只是不想损坏。
更新 1
这是我正在使用的表 - 因为大多数表将包含整数和小字符串,所以这个“样本”表似乎是我应该期望的最好的例子。
CREATE TABLE "user"
(
id serial NOT NULL,
username character varying(40),
email character varying(70),
website character varying(100),
created integer,
CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);
我有大约 10 个脚本,每个脚本使用准备好的语句一次发出 100,000 个请求。这是为了模拟我的应用程序将给数据库提供的现实负载。在我的应用程序中,每个页面都有 1 个以上的插入内容。
更新 2
我已经在使用异步提交了,因为我有
synchronous_commit = 关闭
主配置文件中的
I love that PostgreSQL is crash resistant, as I don't want to spend time fixing a database. However, I'm sure there must be some things I can disable/modify so that inserts/updates will work faster even if I lose a couple records prior to a power-outage / crash. I'm not worried about a couple records - just the database as a whole.
I am trying to optimize PostgreSQL for large amounts of writes. It currently takes 22 minutes to insert 1 million rows which seems a bit slow.
How can I speed up PostgreSQL writes?
Some of the options I have looked into (like full_page_writes), seem to also run the risk of corrupting data which isn't something I want. I don't mind lost data - I just don't want corruption.
Update 1
Here is the table I am using - this since most of the tables will contain ints and small strings this "sample" table seems to be the best example of what I should expect.
CREATE TABLE "user"
(
id serial NOT NULL,
username character varying(40),
email character varying(70),
website character varying(100),
created integer,
CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);
I have about 10 scripts each issuing 100,000 requests at a time using prepared statements. This is to simulate a real-life load my application will be giving the database. In my application each page has 1+ inserts.
Update 2
I am using asynchronous commits already, because I have
synchronous_commit = off
in the main configuration file.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
22 分钟内插入 1M 条记录相当于每秒 758 条记录。这里的每个 INSERT 都是对磁盘的单独提交,最终包含预写日志和数据库组件。通常情况下,我希望即使是具有电池支持缓存的良好硬件以及您能幸运地达到 3000 次提交/秒的一切。因此,如果这是没有这种写入加速的常规硬件,那么您实际上并没有做得太糟糕。在您所处的情况下,这里的正常限制是每秒 500 到 1000 次提交,无需针对这种情况进行特殊调整。
至于那会是什么样子,如果您无法使提交每次包含更多记录,则加快速度的选项包括:
关闭 synchronous_commit (已经
完成)
增加 wal_writer_delay。什么时候
synchronous_commit 关闭,
数据库假脱机承诺最多为
每 200 毫秒写入一次。你可以使
而是一些秒数
如果你想通过调整这个
向上,它只会增加尺寸
崩溃后的数据丢失。
将 wal_buffers 增加到 16MB,只是为了
使整个操作更加
高效。
增加checkpoint_segments,进行削减
降低常规数据的频率
写入磁盘。你可能想要
这里至少有 64 个。缺点是磁盘空间使用率较高且恢复时间较长
崩溃后。
增加shared_buffers。默认
这里很小,通常为 32MB。你
UNIX共享得增加多少
系统必须分配的内存。
完成后,有用的值是
通常大于总 RAM 的 1/4,最多
8GB。这里的增益率下降
超过 256MB,增加量为
默认可以有真的
不过还是有帮助的。
差不多就这样了。您接触的任何其他可能有帮助的内容都可能导致崩溃中的数据损坏;这些都是完全安全的。
1M records inserted in 22 minutes works out to be 758 records/second. Each INSERT here is an individual commit to disk, with both write-ahead log and database components to it eventually. Normally I expect that even good hardware with a battery-backed cache and everything you will be lucky to hit 3000 commit/second. So you're not actually doing too bad if this is regular hardware without such write acceleration. The normal limit here is in the 500 to 1000 commits/second range in the situation you're in, without special tuning for this situation.
As for what that would look like, if you can't make the commits include more records each, your options for speeding this up include:
Turn off synchronous_commit (already
done)
Increase wal_writer_delay. When
synchronous_commit is off, the
database spools commits up to be
written every 200ms. You can make
that some number of seconds instead
if you want to by tweaking this
upwards, it just increases the size
of data loss after a crash.
Increase wal_buffers to 16MB, just to
make that whole operation more
efficient.
Increase checkpoint_segments, to cut
down on how often the regular data is
written to disk. You probably want
at least 64 here. Downsides are higher disk space use and longer recovery time
after a crash.
Increase shared_buffers. The default
here is tiny, typically 32MB. You
have to increase how much UNIX shared
memory the system has to allocate.
Once that's done, useful values are
typically >1/4 of total RAM, up to
8GB. The rate of gain here falls off
above 256MB, the increase from the
default to there can be really
helpful though.
That's pretty much it. Anything else you touched that might help could potentially cause data corruption in a crash; these are all completely safe.
22 分钟处理 100 万行似乎并没有那么慢,尤其是在您有大量索引的情况下。
你的插入效果如何?我认为您使用的是批量插入,而不是每事务一行。
PG 是否支持某种批量加载,例如从文本文件中读取或向其提供 CSV 数据流?如果是这样,最好建议您使用它。
请发布您用于加载 1M 记录的代码,人们会提供建议。
请发布:
编辑:看来OP对批量插入不感兴趣,但正在执行性能测试许多单行插入。我假设每个插入都在其自己的事务中。
22 minutes for 1 million rows doesn't seem that slow, particularly if you have lots of indexes.
How are you doing the inserts? I take it you're using batch inserts, not one-row-per-transaction.
Does PG support some kind of bulk loading, like reading from a text file or supplying a stream of CSV data to it? If so, you'd probably be best advised to use that.
Please post the code you're using to load the 1M records, and people will advise.
Please post:
EDIT: It seems the OP isn't interested in bulk-inserts, but is doing a performance test for many single-row inserts. I will assume that each insert is in its own transaction.
好吧,你没有给我们太多继续下去的机会。但听起来您正在寻找 异步提交 。
不要忽视硬件升级——更快的硬件通常意味着更快的数据库。
Well, you don't give us much to go on. But it sounds like you're looking for asynchronous commits.
Don't overlook a hardware upgrade--faster hardware usually means a faster database.
您还应该增加
checkpoint_segments
(例如增加到 32 或什至更高),并且最有可能增加wal_buffers
以及编辑:
如果这是批量加载,则应使用 COPY 插入行。它比普通的 INSERT 快得多。
如果您需要使用 INSERT,您是否考虑过使用批处理(对于 JDBC)或多行插入?
You should also increase
checkpoint_segments
(e.g. to 32 or even higher) and most probablywal_buffers
as wellEdit:
if this is a bulk load, you should use COPY to insert the rows. It is much faster than plain INSERTs.
If you need to use INSERT, did you consider using batching (for JDBC) or multi-row inserts?
我认为仅通过处理服务器无法解决该问题。
我发现 PostgreSQL 每秒可以提交 3000+ 行,并且服务器和客户端都不忙,但是时间过去了。相比之下SQL Server可以达到每秒5000+行,Oracle更快,可以达到每秒12000+,大约一行20个字段。
我猜往返是问题所在:向服务器发送一行,并从服务器接收回复。 SQL Server 和 Oracle 都支持批处理操作:在函数调用中发送多行并等待回复。
很多年前我和Oracle一起工作:尝试使用OCI来提高写入性能,我阅读了文档,发现太多的往返会降低性能。最后我通过使用批量操作解决了这个问题:批量发送128行或更多行到服务器并等待回复。它达到了每秒 12000 多行。如果不使用批量并单独发送所有行(包括等待),则每秒仅达到约 2000 行。
I think the problem can't be solved by dealing with the server only.
I found PostgreSQL can commit 3000+ rows per second, and both server and client were not busy, but the time went by. In contrast SQL Server can reach 5000+ rows per second, and Oracle is even faster, it can reach 12000+ per second, about 20 fields in a row.
I guess the roundtrip is the problem: Send a row to server, and receive the reply from the server. Both SQL Server and Oracle support batch operations: send more than one row in a function call and wait for the reply.
Many years ago I worked with Oracle: Trying to improve the write performance using OCI, I read documents and found too many round trips will decrease performance. Finally I solved it by using batch operations: send 128 or more rows to the server in a batch and wait for the reply. It reached 12000+ rows per second. If you do not use batches and send all rows individually (including wait), it reached only about 2000 rows per second.
即使在
synchronous_commit = off
的情况下,22 分钟内提交 1M 次似乎也是合理的,但如果您可以避免每次插入时都需要提交,那么您可以获得比这更快的速度。我刚刚尝试使用bulk-insert COPY 命令将 100 万(相同)行插入到您的示例表中,其中有 10 个并发写入器:显然,那里只有 10 个提交,这并不完全是您想要的,但这希望能为您提供某种通过将插入批处理在一起可能实现的速度指示。这是在相当标准的 Windows 桌面主机上运行 Linux 的 VirtualBox VM 上进行的,因此并不是最高性能的硬件。
为了减少一些玩具数字,我们有一个在生产中运行的服务,它有一个线程,通过与上面类似的 COPY 命令将数据流式传输到 Postgres。它会结束一个批处理,并在一定数量的行之后或事务达到一定期限(以先到者为准)时提交。通过每秒进行约 4 次提交,它可以维持每秒 11,000 次插入,最大延迟约为 300 毫秒。如果我们收紧事务的最大允许期限,我们每秒会获得更多的提交,这将减少延迟,同时也会减少吞吐量。再说一遍,这并不是在令人印象深刻的硬件上实现的。
根据该经验,我强烈建议尝试使用
COPY
而不是INSERT
,并尝试尽可能减少提交次数,同时仍然实现延迟目标。1M commits in 22 minutes seems reasonable, even with
synchronous_commit = off
, but if you can avoid the need to commit on each insert then you can get a lot faster than that. I just tried inserting 1M (identical) rows into your example table from 10 concurrent writers, using the bulk-insertCOPY
command:Clearly there's only 10 commits there, which isn't exactly what you're looking for, but that hopefully gives you some kind of indication of the speed that might be possible by batching your inserts together. This is on a VirtualBox VM running Linux on a fairly bog-standard Windows desktop host, so not exactly the highest-performance hardware possible.
To give some less toy figures, we have a service running in production which has a single thread that streams data to Postgres via a
COPY
command similar to the above. It ends a batch and commits after a certain number of rows or if the transaction reaches a certain age (whichever comes first). It can sustain 11,000 inserts per second with a max latency of ~300ms by doing ~4 commits per second. If we tightened up the maximum permitted age of the transactions we'd get more commits per second which would reduce the latency but also the throughput. Again, this is not on terribly impressive hardware.Based on that experience, I'd strongly recommend trying to use
COPY
rather thanINSERT
, and trying to reduce the number of commits as far as possible while still achieving your latency target.为了加快速度,您可以做的一件事就是删除您手动创建的索引 - 主键约束已经在该列上自动创建了唯一索引,如下所示(我正在测试on 8.3):
另外,请考虑将
wal_sync_method
更改为使用O_DIRECT
的选项 - 这不是 Linux 上的默认设置Well one thing you could do to speed things up is drop the index you are creating manually - the
primary key
constraint already auto-creates a unique index on that column as you can see below (I'm testing on 8.3):Also, consider changing
wal_sync_method
to an option that usesO_DIRECT
- this is not the default on Linux一种可能性是使用键 DEFERRABLE 来推迟约束,因为会检查每一行的约束。
所以我们的想法是要求 postgresql 在提交之前检查约束。
One possibility would be to use the keywork DEFERRABLE to defer constraints because constraints are checked for every lines.
So the idea would be to ask postgresql to check constraints just before you commit.