提高 PostgreSQL 写入速度是否会以可能丢失数据为代价？

发布于 2024-10-19 13:08:01 字数 907 浏览 4 评论 0原文

我喜欢 PostgreSQL 的抗崩溃能力，因为我不想花时间修复数据库。但是，我确信必须有一些东西我可以禁用/修改，以便即使我在断电/崩溃之前丢失了几条记录，插入/更新也会更快。我不担心几条记录——只担心整个数据库。

我正在尝试针对大量写入优化 PostgreSQL。目前插入100万行需要22分钟，看起来有点慢。

如何加快 PostgreSQL 写入速度？

我研究过的一些选项（如 full_page_writes）似乎也存在损坏数据的风险，这不是我想要的。我不介意丢失数据 - 我只是不想损坏。

更新 1

这是我正在使用的表 - 因为大多数表将包含整数和小字符串，所以这个“样本”表似乎是我应该期望的最好的例子。

CREATE TABLE "user"
(
  id serial NOT NULL,
  username character varying(40),
  email character varying(70),
  website character varying(100),
  created integer,
  CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);

我有大约 10 个脚本，每个脚本使用准备好的语句一次发出 100,000 个请求。这是为了模拟我的应用程序将给数据库提供的现实负载。在我的应用程序中，每个页面都有 1 个以上的插入内容。

更新 2

我已经在使用异步提交了，因为我有

synchronous_commit = 关闭

主配置文件中的

原文

I love that PostgreSQL is crash resistant, as I don't want to spend time fixing a database. However, I'm sure there must be some things I can disable/modify so that inserts/updates will work faster even if I lose a couple records prior to a power-outage / crash. I'm not worried about a couple records - just the database as a whole.

I am trying to optimize PostgreSQL for large amounts of writes. It currently takes 22 minutes to insert 1 million rows which seems a bit slow.

How can I speed up PostgreSQL writes?

Some of the options I have looked into (like full_page_writes), seem to also run the risk of corrupting data which isn't something I want. I don't mind lost data - I just don't want corruption.

Update 1

Here is the table I am using - this since most of the tables will contain ints and small strings this "sample" table seems to be the best example of what I should expect.

CREATE TABLE "user"
(
  id serial NOT NULL,
  username character varying(40),
  email character varying(70),
  website character varying(100),
  created integer,
  CONSTRAINT user_pkey PRIMARY KEY (id)
)
WITH ( OIDS=FALSE );
CREATE INDEX id ON "user" USING btree (id);

I have about 10 scripts each issuing 100,000 requests at a time using prepared statements. This is to simulate a real-life load my application will be giving the database. In my application each page has 1+ inserts.

Update 2

I am using asynchronous commits already, because I have

synchronous_commit = off

in the main configuration file.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘿哥们儿 2024-10-26 13:08:01

22 分钟内插入 1M 条记录相当于每秒 758 条记录。这里的每个 INSERT 都是对磁盘的单独提交，最终包含预写日志和数据库组件。通常情况下，我希望即使是具有电池支持缓存的良好硬件以及您能幸运地达到 3000 次提交/秒的一切。因此，如果这是没有这种写入加速的常规硬件，那么您实际上并没有做得太糟糕。在您所处的情况下，这里的正常限制是每秒 500 到 1000 次提交，无需针对这种情况进行特殊调整。

至于那会是什么样子，如果您无法使提交每次包含更多记录，则加快速度的选项包括：

关闭 synchronous_commit （已经
完成）
增加 wal_writer_delay。什么时候
synchronous_commit 关闭，
数据库假脱机承诺最多为
每 200 毫秒写入一次。你可以使
而是一些秒数
如果你想通过调整这个
向上，它只会增加尺寸
崩溃后的数据丢失。
将 wal_buffers 增加到 16MB，只是为了
使整个操作更加
高效。
增加checkpoint_segments，进行削减
降低常规数据的频率
写入磁盘。你可能想要
这里至少有 64 个。缺点是磁盘空间使用率较高且恢复时间较长
崩溃后。
增加shared_buffers。默认
这里很小，通常为 32MB。你
UNIX共享得增加多少
系统必须分配的内存。
完成后，有用的值是
通常大于总 RAM 的 1/4，最多
8GB。这里的增益率下降
超过 256MB，增加量为
默认可以有真的
不过还是有帮助的。

差不多就这样了。您接触的任何其他可能有帮助的内容都可能导致崩溃中的数据损坏；这些都是完全安全的。

回复收藏 0 原文

菊凝晚露 2024-10-26 13:08:01

22 分钟处理 100 万行似乎并没有那么慢，尤其是在您有大量索引的情况下。

你的插入效果如何？我认为您使用的是批量插入，而不是每事务一行。

PG 是否支持某种批量加载，例如从文本文件中读取或向其提供 CSV 数据流？如果是这样，最好建议您使用它。

请发布您用于加载 1M 记录的代码，人们会提供建议。

请发布：

为您要加载到的表创建表语句
您用来加载
小数据示例的代码（如果可能）

编辑：看来OP对批量插入不感兴趣，但正在执行性能测试许多单行插入。我假设每个插入都在其自己的事务中。

考虑在客户端、每个节点上批量插入，将它们写入临时文件（希望持久/稳健），并使用一个守护进程或一些定期进程，以合理大小的批次异步批量插入未完成的记录。
根据我的经验，在审计数据（例如数据仓库应用程序）中，数据不需要立即进入数据库，这种每设备批处理机制确实提供了最佳性能。它还为应用程序提供了针对数据库不可用的恢复能力。
当然，您通常会有多个端点设备创建审核记录（例如，电话交换机、邮件中继、Web 应用程序服务器），每个端点设备都必须有自己的完全独立的机制实例。
这是一个真正“聪明”的优化，它给应用程序设计带来了很多复杂性，并且有很多可能发生错误的地方。除非您确实确定需要它，否则不要实现它。

回复收藏 0 原文

┊风居住的梦幻卍 2024-10-26 13:08:01

好吧，你没有给我们太多继续下去的机会。但听起来您正在寻找异步提交。

不要忽视硬件升级——更快的硬件通常意味着更快的数据库。

回复收藏 0 原文

吻风 2024-10-26 13:08:01

您还应该增加 checkpoint_segments（例如增加到 32 或什至更高），并且最有可能增加 wal_buffers 以及

编辑：
如果这是批量加载，则应使用 COPY 插入行。它比普通的 INSERT 快得多。

如果您需要使用 INSERT，您是否考虑过使用批处理（对于 JDBC）或多行插入？

回复收藏 0 原文

高冷爸爸 2024-10-26 13:08:01

我认为仅通过处理服务器无法解决该问题。

我发现 PostgreSQL 每秒可以提交 3000+ 行，并且服务器和客户端都不忙，但是时间过去了。相比之下SQL Server可以达到每秒5000+行，Oracle更快，可以达到每秒12000+，大约一行20个字段。

我猜往返是问题所在：向服务器发送一行，并从服务器接收回复。 SQL Server 和 Oracle 都支持批处理操作：在函数调用中发送多行并等待回复。

很多年前我和Oracle一起工作：尝试使用OCI来提高写入性能，我阅读了文档，发现太多的往返会降低性能。最后我通过使用批量操作解决了这个问题：批量发送128行或更多行到服务器并等待回复。它达到了每秒 12000 多行。如果不使用批量并单独发送所有行（包括等待），则每秒仅达到约 2000 行。

回复收藏 0 原文

晨曦慕雪 2024-10-26 13:08:01

即使在 synchronous_commit = off 的情况下，22 分钟内提交 1M 次似乎也是合理的，但如果您可以避免每次插入时都需要提交，那么您可以获得比这更快的速度。我刚刚尝试使用bulk-insert COPY 命令将 100 万（相同）行插入到您的示例表中，其中

$ head -n3 users.txt | cat -A # the rest of the file is just this another 99997 times
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
$ wc -l users.txt
100000 users.txt
$ time (seq 10 | xargs --max-procs=10 -n 1 bash -c "cat users.txt | psql insertspeed -c 'COPY \"user\" (username, email, website, created) FROM STDIN WITH (FORMAT text);'")

real    0m10.589s
user    0m0.281s
sys     0m0.285s
$ psql insertspeed -Antc 'SELECT count(*) FROM "user"'
1000000

有 10 个并发写入器：显然，那里只有 10 个提交，这并不完全是您想要的，但这希望能为您提供某种通过将插入批处理在一起可能实现的速度指示。这是在相当标准的 Windows 桌面主机上运行 Linux 的 VirtualBox VM 上进行的，因此并不是最高性能的硬件。

为了减少一些玩具数字，我们有一个在生产中运行的服务，它有一个线程，通过与上面类似的 COPY 命令将数据流式传输到 Postgres。它会结束一个批处理，并在一定数量的行之后或事务达到一定期限（以先到者为准）时提交。通过每秒进行约 4 次提交，它可以维持每秒 11,000 次插入，最大延迟约为 300 毫秒。如果我们收紧事务的最大允许期限，我们每秒会获得更多的提交，这将减少延迟，同时也会减少吞吐量。再说一遍，这并不是在令人印象深刻的硬件上实现的。

根据该经验，我强烈建议尝试使用 COPY 而不是 INSERT，并尝试尽可能减少提交次数，同时仍然实现延迟目标。

1M commits in 22 minutes seems reasonable, even with synchronous_commit = off, but if you can avoid the need to commit on each insert then you can get a lot faster than that. I just tried inserting 1M (identical) rows into your example table from 10 concurrent writers, using the bulk-insert COPY command:

$ head -n3 users.txt | cat -A # the rest of the file is just this another 99997 times
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
Random J. User^[email protected]^Ihttp://example.org^I100$
$ wc -l users.txt
100000 users.txt
$ time (seq 10 | xargs --max-procs=10 -n 1 bash -c "cat users.txt | psql insertspeed -c 'COPY \"user\" (username, email, website, created) FROM STDIN WITH (FORMAT text);'")

real    0m10.589s
user    0m0.281s
sys     0m0.285s
$ psql insertspeed -Antc 'SELECT count(*) FROM "user"'
1000000

Clearly there's only 10 commits there, which isn't exactly what you're looking for, but that hopefully gives you some kind of indication of the speed that might be possible by batching your inserts together. This is on a VirtualBox VM running Linux on a fairly bog-standard Windows desktop host, so not exactly the highest-performance hardware possible.

To give some less toy figures, we have a service running in production which has a single thread that streams data to Postgres via a COPY command similar to the above. It ends a batch and commits after a certain number of rows or if the transaction reaches a certain age (whichever comes first). It can sustain 11,000 inserts per second with a max latency of ~300ms by doing ~4 commits per second. If we tightened up the maximum permitted age of the transactions we'd get more commits per second which would reduce the latency but also the throughput. Again, this is not on terribly impressive hardware.

Based on that experience, I'd strongly recommend trying to use COPY rather than INSERT, and trying to reduce the number of commits as far as possible while still achieving your latency target.

回复收藏 0 原文

灯角 2024-10-26 13:08:01

为了加快速度，您可以做的一件事就是删除您手动创建的索引 - 主键约束已经在该列上自动创建了唯一索引，如下所示（我正在测试on 8.3）：

postgres=> CREATE TABLE "user"
postgres-> (
postgres(>   id serial NOT NULL,
postgres(>   username character varying(40),
postgres(>   email character varying(70),
postgres(>   website character varying(100),
postgres(>   created integer,
postgres(>   CONSTRAINT user_pkey PRIMARY KEY (id)
postgres(> )
postgres-> WITH ( OIDS=FALSE );
NOTICE:  CREATE TABLE will create implicit sequence "user_id_seq" for serial column "user.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "user_pkey" for table "user"
CREATE TABLE
postgres=> CREATE INDEX id ON "user" USING btree (id);
CREATE INDEX
postgres=> \d user
                                  Table "stack.user"
  Column  |          Type          |                     Modifiers
----------+------------------------+---------------------------------------------------
 id       | integer                | not null default nextval('user_id_seq'::regclass)
 username | character varying(40)  |
 email    | character varying(70)  |
 website  | character varying(100) |
 created  | integer                |
Indexes:
    "user_pkey" PRIMARY KEY, btree (id)
    "id" btree (id)

另外，请考虑将 wal_sync_method 更改为使用 O_DIRECT 的选项 - 这不是 Linux 上的默认设置

Well one thing you could do to speed things up is drop the index you are creating manually - the primary key constraint already auto-creates a unique index on that column as you can see below (I'm testing on 8.3):

postgres=> CREATE TABLE "user"
postgres-> (
postgres(>   id serial NOT NULL,
postgres(>   username character varying(40),
postgres(>   email character varying(70),
postgres(>   website character varying(100),
postgres(>   created integer,
postgres(>   CONSTRAINT user_pkey PRIMARY KEY (id)
postgres(> )
postgres-> WITH ( OIDS=FALSE );
NOTICE:  CREATE TABLE will create implicit sequence "user_id_seq" for serial column "user.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "user_pkey" for table "user"
CREATE TABLE
postgres=> CREATE INDEX id ON "user" USING btree (id);
CREATE INDEX
postgres=> \d user
                                  Table "stack.user"
  Column  |          Type          |                     Modifiers
----------+------------------------+---------------------------------------------------
 id       | integer                | not null default nextval('user_id_seq'::regclass)
 username | character varying(40)  |
 email    | character varying(70)  |
 website  | character varying(100) |
 created  | integer                |
Indexes:
    "user_pkey" PRIMARY KEY, btree (id)
    "id" btree (id)

Also, consider changing wal_sync_method to an option that uses O_DIRECT - this is not the default on Linux

回复收藏 0 原文