提高 PostgreSQL 写入速度是否会以可能丢失数据为代价?

发布于 2024-10-19 13:08:01 字数 907 浏览 4 评论 0原文

我喜欢 PostgreSQL 的抗崩溃能力,因为我不想花时间修复数据库。但是,我确信必须有一些东西我可以禁用/修改,以便即使我在断电/崩溃之前丢失了几条记录,插入/更新也会更快。我不担心几条记录——只担心整个数据库。

我正在尝试针对大量写入优化 PostgreSQL。目前插入100万行需要22分钟,看起来有点慢。

如何加快 PostgreSQL 写入速度?

我研究过的一些选项(如 full_page_writes)似乎也存在损坏数据的风险,这不是我想要的。我不介意丢失数据 - 我只是不想损坏。

更新 1

这是我正在使用的表 - 因为大多数表将包含整数和小字符串,所以这个“样本”表似乎是我应该期望的最好的例子。

CREATE TABLE "user"
(
  id serial NOT NULL,
  username character varying(40),
  email character varying(70),
  website character varying(100),
  created integer,
  CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);

我有大约 10 个脚本,每个脚本使用准备好的语句一次发出 100,000 个请求。这是为了模拟我的应用程序将给数据库提供的现实负载。在我的应用程序中,每个页面都有 1 个以上的插入内容。

更新 2

我已经在使用异步提交了,因为我有

synchronous_commit = 关闭

主配置文件中的

I love that PostgreSQL is crash resistant, as I don't want to spend time fixing a database. However, I'm sure there must be some things I can disable/modify so that inserts/updates will work faster even if I lose a couple records prior to a power-outage / crash. I'm not worried about a couple records - just the database as a whole.

I am trying to optimize PostgreSQL for large amounts of writes. It currently takes 22 minutes to insert 1 million rows which seems a bit slow.

How can I speed up PostgreSQL writes?

Some of the options I have looked into (like full_page_writes), seem to also run the risk of corrupting data which isn't something I want. I don't mind lost data - I just don't want corruption.

Update 1

Here is the table I am using - this since most of the tables will contain ints and small strings this "sample" table seems to be the best example of what I should expect.

CREATE TABLE "user"
(
  id serial NOT NULL,
  username character varying(40),
  email character varying(70),
  website character varying(100),
  created integer,
  CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);

I have about 10 scripts each issuing 100,000 requests at a time using prepared statements. This is to simulate a real-life load my application will be giving the database. In my application each page has 1+ inserts.

Update 2

I am using asynchronous commits already, because I have

synchronous_commit = off

in the main configuration file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

嘿哥们儿 2024-10-26 13:08:01

22 分钟内插入 1M 条记录相当于每秒 758 条记录。这里的每个 INSERT 都是对磁盘的单独提交,最终包含预写日志和数据库组件。通常情况下,我希望即使是具有电池支持缓存的良好硬件以及您能幸运地达到 3000 次提交/秒的一切。因此,如果这是没有这种写入加速的常规硬件,那么您实际上并没有做得太糟糕。在您所处的情况下,这里的正常限制是每秒 500 到 1000 次提交,无需针对这种情况进行特殊调整。

至于那会是什么样子,如果您无法使提交每次包含更多记录,则加快速度的选项包括:

  • 关闭 synchronous_commit (已经
    完成)

  • 增加 wal_writer_delay。什么时候
    synchronous_commit 关闭,
    数据库假脱机承诺最多为
    每 200 毫秒写入一次。你可以使
    而是一些秒数
    如果你想通过调整这个
    向上,它只会增加尺寸
    崩溃后的数据丢失。

  • 将 wal_buffers 增加到 16MB,只是为了
    使整个操作更加
    高效。

  • 增加checkpoint_segments,进行削减
    降低常规数据的频率
    写入磁盘。你可能想要
    这里至少有 64 个。缺点是磁盘空间使用率较高且恢复时间较长
    崩溃后。

  • 增加shared_buffers。默认
    这里很小,通常为 32MB。你
    UNIX共享得增加多少
    系统必须分配的内存。
    完成后,有用的值是
    通常大于总 RAM 的 1/4,最多
    8GB。这里的增益率下降
    超过 256MB,增加量为
    默认可以有真的
    不过还是有帮助的。

差不多就这样了。您接触的任何其他可能有帮助的内容都可能导致崩溃中的数据损坏;这些都是完全安全的。

1M records inserted in 22 minutes works out to be 758 records/second. Each INSERT here is an individual commit to disk, with both write-ahead log and database components to it eventually. Normally I expect that even good hardware with a battery-backed cache and everything you will be lucky to hit 3000 commit/second. So you're not actually doing too bad if this is regular hardware without such write acceleration. The normal limit here is in the 500 to 1000 commits/second range in the situation you're in, without special tuning for this situation.

As for what that would look like, if you can't make the commits include more records each, your options for speeding this up include:

  • Turn off synchronous_commit (already
    done)

  • Increase wal_writer_delay. When
    synchronous_commit is off, the
    database spools commits up to be
    written every 200ms. You can make
    that some number of seconds instead
    if you want to by tweaking this
    upwards, it just increases the size
    of data loss after a crash.

  • Increase wal_buffers to 16MB, just to
    make that whole operation more
    efficient.

  • Increase checkpoint_segments, to cut
    down on how often the regular data is
    written to disk. You probably want
    at least 64 here. Downsides are higher disk space use and longer recovery time
    after a crash.

  • Increase shared_buffers. The default
    here is tiny, typically 32MB. You
    have to increase how much UNIX shared
    memory the system has to allocate.
    Once that's done, useful values are
    typically >1/4 of total RAM, up to
    8GB. The rate of gain here falls off
    above 256MB, the increase from the
    default to there can be really
    helpful though.

That's pretty much it. Anything else you touched that might help could potentially cause data corruption in a crash; these are all completely safe.

菊凝晚露 2024-10-26 13:08:01

22 分钟处理 100 万行似乎并没有那么慢,尤其是在您有大量索引的情况下。

你的插入效果如何?我认为您使用的是批量插入,而不是每事务一行。

PG 是否支持某种批量加载,例如从文本文件中读取或向其提供 CSV 数据流?如果是这样,最好建议您使用它。

请发布您用于加载 1M 记录的代码,人们会提供建议。

请发布:

  • 为您要加载到的表创建表语句
  • 您用来加载
  • 小数据示例的代码(如果可能)

编辑:看来OP对批量插入不感兴趣,但正在执行性能测试许多单行插入。我假设每个插入都在其自己的事务中。

  • 考虑在客户端、每个节点上批量插入,将它们写入临时文件(希望持久/稳健),并使用一个守护进程或一些定期进程,以合理大小的批次异步批量插入未完成的记录。
  • 根据我的经验,在审计数据(例如数据仓库应用程序)中,数据不需要立即进入数据库,这种每设备批处理机制确实提供了最佳性能。它还为应用程序提供了针对数据库不可用的恢复能力。
  • 当然,您通常会有多个端点设备创建审核记录(例如,电话交换机、邮件中继、Web 应用程序服务器),每个端点设备都必须有自己的完全独立的机制实例。
  • 这是一个真正“聪明”的优化,它给应用程序设计带来了很多复杂性,并且有很多可能发生错误的地方。除非您确实确定需要它,否则不要实现它。

22 minutes for 1 million rows doesn't seem that slow, particularly if you have lots of indexes.

How are you doing the inserts? I take it you're using batch inserts, not one-row-per-transaction.

Does PG support some kind of bulk loading, like reading from a text file or supplying a stream of CSV data to it? If so, you'd probably be best advised to use that.

Please post the code you're using to load the 1M records, and people will advise.

Please post:

  • CREATE TABLE statement for the table you're loading into
  • Code you are using to load in
  • small example of the data (if possible)

EDIT: It seems the OP isn't interested in bulk-inserts, but is doing a performance test for many single-row inserts. I will assume that each insert is in its own transaction.

  • Consider batching the inserts on the client-side, per-node, writing them into a temporary file (hopefully durably / robustly) and having a daemon or some periodic process which asynchronously does a batch insert of outstanding records, in reasonable sized batches.
  • This per-device batching mechanism really does give the best performance, in my experience, in audit-data like data-warehouse applications where the data don't need to go into the database just now. It also gives the application resilience against the database being unavailable.
  • Of course you will normally have several endpoint devices creating audit-records (for example, telephone switches, mail relays, web application servers), each must have its own instance of this mechanism which is fully independent.
  • This is a really "clever" optimisation which introduces a lot of complexity into the app design and has a lot of places where bugs could happen. Do not implement it unless you are really sure you need it.
┊风居住的梦幻卍 2024-10-26 13:08:01

好吧,你没有给我们太多继续下去的机会。但听起来您正在寻找 异步提交

不要忽视硬件升级——更快的硬件通常意味着更快的数据库。

Well, you don't give us much to go on. But it sounds like you're looking for asynchronous commits.

Don't overlook a hardware upgrade--faster hardware usually means a faster database.

吻风 2024-10-26 13:08:01

您还应该增加 checkpoint_segments(例如增加到 32 或什至更高),并且最有可能增加 wal_buffers 以及

编辑:
如果这是批量加载,则应使用 COPY 插入行。它比普通的 INSERT 快得多。

如果您需要使用 INSERT,您是否考虑过使用批处理(对于 JDBC)或多行插入?

You should also increase checkpoint_segments (e.g. to 32 or even higher) and most probably wal_buffers as well

Edit:
if this is a bulk load, you should use COPY to insert the rows. It is much faster than plain INSERTs.

If you need to use INSERT, did you consider using batching (for JDBC) or multi-row inserts?

高冷爸爸 2024-10-26 13:08:01

我认为仅通过处理服务器无法解决该问题。

我发现 PostgreSQL 每秒可以提交 3000+ 行,并且服务器和客户端都不忙,但是时间过去了。相比之下SQL Server可以达到每秒5000+行,Oracle更快,可以达到每秒12000+,大约一行20个字段。

我猜往返是问题所在:向服务器发送一行,并从服务器接收回复。 SQL Server 和 Oracle 都支持批处理操作:在函数调用中发送多行并等待回复。

很多年前我和Oracle一起工作:尝试使用OCI来提高写入性能,我阅读了文档,发现太多的往返会降低性能。最后我通过使用批量操作解决了这个问题:批量发送128行或更多行到服务器并等待回复。它达到了每秒 12000 多行。如果不使用批量并单独发送所有行(包括等待),则每秒仅达到约 2000 行。

I think the problem can't be solved by dealing with the server only.

I found PostgreSQL can commit 3000+ rows per second, and both server and client were not busy, but the time went by. In contrast SQL Server can reach 5000+ rows per second, and Oracle is even faster, it can reach 12000+ per second, about 20 fields in a row.

I guess the roundtrip is the problem: Send a row to server, and receive the reply from the server. Both SQL Server and Oracle support batch operations: send more than one row in a function call and wait for the reply.

Many years ago I worked with Oracle: Trying to improve the write performance using OCI, I read documents and found too many round trips will decrease performance. Finally I solved it by using batch operations: send 128 or more rows to the server in a batch and wait for the reply. It reached 12000+ rows per second. If you do not use batches and send all rows individually (including wait), it reached only about 2000 rows per second.

晨曦慕雪 2024-10-26 13:08:01

即使在 synchronous_commit = off 的情况下,22 分钟内提交 1M 次似乎也是合理的,但如果您可以避免每次插入时都需要提交,那么您可以获得比这更快的速度。我刚刚尝试使用bulk-insert COPY 命令将 100 万(相同)行插入到您的示例表中,其中

$ head -n3 users.txt | cat -A # the rest of the file is just this another 99997 times
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
$ wc -l users.txt
100000 users.txt
$ time (seq 10 | xargs --max-procs=10 -n 1 bash -c "cat users.txt | psql insertspeed -c 'COPY \"user\" (username, email, website, created) FROM STDIN WITH (FORMAT text);'")

real    0m10.589s
user    0m0.281s
sys     0m0.285s
$ psql insertspeed -Antc 'SELECT count(*) FROM "user"'
1000000

有 10 个并发写入器:显然,那里只有 10 个提交,这并不完全是您想要的,但这希望能为您提供某种通过将插入批处理在一起可能实现的速度指示。这是在相当标准的 Windows 桌面主机上运行 Linux 的 VirtualBox VM 上进行的,因此并不是最高性能的硬件。

为了减少一些玩具数字,我们有一个在生产中运行的服务,它有一个线程,通过与上面类似的 COPY 命令将数据流式传输到 Postgres。它会结束一个批处理,并在一定数量的行之后或事务达到一定期限(以先到者为准)时提交。通过每秒进行约 4 次提交,它可以维持每秒 11,000 次插入,最大延迟约为 300 毫秒。如果我们收紧事务的最大允许期限,我们每秒会获得更多的提交,这将减少延迟,同时也会减少吞吐量。再说一遍,这并不是在令人印象深刻的硬件上实现的。

根据该经验,我强烈建议尝试使用 COPY 而不是 INSERT,并尝试尽可能减少提交次数,同时仍然实现延迟目标。

1M commits in 22 minutes seems reasonable, even with synchronous_commit = off, but if you can avoid the need to commit on each insert then you can get a lot faster than that. I just tried inserting 1M (identical) rows into your example table from 10 concurrent writers, using the bulk-insert COPY command:

$ head -n3 users.txt | cat -A # the rest of the file is just this another 99997 times
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
$ wc -l users.txt
100000 users.txt
$ time (seq 10 | xargs --max-procs=10 -n 1 bash -c "cat users.txt | psql insertspeed -c 'COPY \"user\" (username, email, website, created) FROM STDIN WITH (FORMAT text);'")

real    0m10.589s
user    0m0.281s
sys     0m0.285s
$ psql insertspeed -Antc 'SELECT count(*) FROM "user"'
1000000

Clearly there's only 10 commits there, which isn't exactly what you're looking for, but that hopefully gives you some kind of indication of the speed that might be possible by batching your inserts together. This is on a VirtualBox VM running Linux on a fairly bog-standard Windows desktop host, so not exactly the highest-performance hardware possible.

To give some less toy figures, we have a service running in production which has a single thread that streams data to Postgres via a COPY command similar to the above. It ends a batch and commits after a certain number of rows or if the transaction reaches a certain age (whichever comes first). It can sustain 11,000 inserts per second with a max latency of ~300ms by doing ~4 commits per second. If we tightened up the maximum permitted age of the transactions we'd get more commits per second which would reduce the latency but also the throughput. Again, this is not on terribly impressive hardware.

Based on that experience, I'd strongly recommend trying to use COPY rather than INSERT, and trying to reduce the number of commits as far as possible while still achieving your latency target.

灯角 2024-10-26 13:08:01

为了加快速度,您可以做的一件事就是删除您手动创建的索引 - 主键约束已经在该列上自动创建了唯一索引,如下所示(我正在测试on 8.3):

postgres=> CREATE TABLE "user"
postgres-> (
postgres(>   id serial NOT NULL,
postgres(>   username character varying(40),
postgres(>   email character varying(70),
postgres(>   website character varying(100),
postgres(>   created integer,
postgres(>   CONSTRAINT user_pkey PRIMARY KEY (id)
postgres(> )
postgres-> WITH ( OIDS=FALSE );
NOTICE:  CREATE TABLE will create implicit sequence "user_id_seq" for serial column "user.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "user_pkey" for table "user"
CREATE TABLE
postgres=> CREATE INDEX id ON "user" USING btree (id);
CREATE INDEX
postgres=> \d user
                                  Table "stack.user"
  Column  |          Type          |                     Modifiers
----------+------------------------+---------------------------------------------------
 id       | integer                | not null default nextval('user_id_seq'::regclass)
 username | character varying(40)  |
 email    | character varying(70)  |
 website  | character varying(100) |
 created  | integer                |
Indexes:
    "user_pkey" PRIMARY KEY, btree (id)
    "id" btree (id)

另外,请考虑将 wal_sync_method 更改为使用 O_DIRECT 的选项 - 这不是 Linux 上的默认设置

Well one thing you could do to speed things up is drop the index you are creating manually - the primary key constraint already auto-creates a unique index on that column as you can see below (I'm testing on 8.3):

postgres=> CREATE TABLE "user"
postgres-> (
postgres(>   id serial NOT NULL,
postgres(>   username character varying(40),
postgres(>   email character varying(70),
postgres(>   website character varying(100),
postgres(>   created integer,
postgres(>   CONSTRAINT user_pkey PRIMARY KEY (id)
postgres(> )
postgres-> WITH ( OIDS=FALSE );
NOTICE:  CREATE TABLE will create implicit sequence "user_id_seq" for serial column "user.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "user_pkey" for table "user"
CREATE TABLE
postgres=> CREATE INDEX id ON "user" USING btree (id);
CREATE INDEX
postgres=> \d user
                                  Table "stack.user"
  Column  |          Type          |                     Modifiers
----------+------------------------+---------------------------------------------------
 id       | integer                | not null default nextval('user_id_seq'::regclass)
 username | character varying(40)  |
 email    | character varying(70)  |
 website  | character varying(100) |
 created  | integer                |
Indexes:
    "user_pkey" PRIMARY KEY, btree (id)
    "id" btree (id)

Also, consider changing wal_sync_method to an option that uses O_DIRECT - this is not the default on Linux

固执像三岁 2024-10-26 13:08:01

一种可能性是使用键 DEFERRABLE 来推迟约束,因为会检查每一行的约束。

所以我们的想法是要求 postgresql 在提交之前检查约束。

One possibility would be to use the keywork DEFERRABLE to defer constraints because constraints are checked for every lines.

So the idea would be to ask postgresql to check constraints just before you commit.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文