postgresql批量插入性能问题(相对于mysql)

发布于 2024-10-13 18:57:39 字数 1917 浏览 4 评论 0原文

我有很多数据,我想至少插入到数据库中。我做了一些测试。 我创建了一个包含 21 列的表(使用下面的脚本 *1)。 1 列是 int,其他 20 列是 string。没有索引。我编写了一个测试代码,创建一个随机值并插入到数据库(使用 insert sql *2)。在运行sql命令之前,调用conn.setAutoCommit(false)而不是conn.commit()。这个操作大约需要6-7秒。有官方文档(*3)说使用“COPY”命令进行批量插入。创建一个类似的ascii文件并重新测试它,这个测试大约5秒完成。 在同一台机器上使用相同的测试代码,向Mysql插入相同的数据,测试不到1秒。我真的很惊讶,相对于 6-7 秒,性能有了很大的提高。这种差异真的存在还是我忽略了任何东西。

感谢您的帮助

我的测试配置是solaris 10和PostgreSQL 9.0.2和Mysql 5.0.85。

(*1) PostgreSQL 创建数据库脚本

CREATE TABLE tablo
(
  id integer,
  column1 character varying(50),
  column2 character varying(50),
  column3 character varying(50),
  ....
  column20 character varying(50)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE tablo OWNER TO pgadmin;

(*2)插入查询

INSERT INTO tablo values (1,'column67062724628797','column26007603757271','column73982294239806','column43213154421324','column97722282440805','column79000889379973','column10680880337755','column14322827996050','column80720842739399','column22777514445036','column77771307997926','column92799724462613','column89992937353110','column61693061355353','column43804223262229','column62209656630047','column52150955786400','column85726157993572','column33358888005133','column77743799989746'),(2,'column77383691774831','column67841193885377','column36149612452454','column51161680852595','column91649734476301','column57283307765550','column14997046117948','column29457857794726','column91157683305554','column44413196495111','column40702778794938','column24744999726868','column38356057278249','column16808618337554','column64362413535503','column19577167594144','column72943639162993','column46830376244427','column01942608599939','column66479131355003'),
....
10K lines

(*3) PostgreSql 官方文档地址 http://www.postgresql.org/docs/8.3/interactive/populate.html

I have a lot of data and I want to insert to DB at least time. I do some test.
I create a table (using below script *1) with 21 column. 1 column is int, others 20 columns are string. There is no index. I write a test code, create a random values and insert to DB (using insert sql *2). Before to run sql command, call conn.setAutoCommit(false) than conn.commit(). This operation about 6-7 seconds. There is official document (*3)said that use "COPY" command for bulk insert. Create similar an ascii file and re-test it, this test finished about 5 seconds.
Using same test code on same machine, insert same data to Mysql, test less than 1 second . I realy suprise that with respect to 6-7 seconds great performance improvement. Is this differ really exists or I overlook anything.

Thanks for helps

My test configuration is solaris 10 and PostgreSQL 9.0.2 and Mysql 5.0.85.

(*1) PostgreSQL Create DB Script

CREATE TABLE tablo
(
  id integer,
  column1 character varying(50),
  column2 character varying(50),
  column3 character varying(50),
  ....
  column20 character varying(50)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE tablo OWNER TO pgadmin;

(*2)Insert query

INSERT INTO tablo values (1,'column67062724628797','column26007603757271','column73982294239806','column43213154421324','column97722282440805','column79000889379973','column10680880337755','column14322827996050','column80720842739399','column22777514445036','column77771307997926','column92799724462613','column89992937353110','column61693061355353','column43804223262229','column62209656630047','column52150955786400','column85726157993572','column33358888005133','column77743799989746'),(2,'column77383691774831','column67841193885377','column36149612452454','column51161680852595','column91649734476301','column57283307765550','column14997046117948','column29457857794726','column91157683305554','column44413196495111','column40702778794938','column24744999726868','column38356057278249','column16808618337554','column64362413535503','column19577167594144','column72943639162993','column46830376244427','column01942608599939','column66479131355003'),
....
10K lines

(*3) Official PostgreSql document address
http://www.postgresql.org/docs/8.3/interactive/populate.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

半暖夏伤 2024-10-20 18:57:39

看起来很奇怪,您没有看到使用 COPY 之类的加速。我生成了一个脚本来创建一个类似的表并用 10,000 行填充它,并发现:

  1. 最慢:每次插入 1 行,无事务块,fsync=on
  2. 更快:将所有语句(创建表、插入...)包装在事务
  3. 速度与 2 相同:与 1 相同,但关闭 fsync=off
  4. 最快:使用 pg_dump 创建的脚本进行恢复,使用 COPY 填充表

方法 2 和 3 比方法 1 快约 4 倍。方法 4 比方法 4 快约 10 倍2或3。

如果我将相同的数据导入到我的机器上的mysql中,大约需要方法2或3一半的时间。转储并重新加载它,相同。用 -e 转储并重新加载它,同样的。使用 InnoDB 将时间缩短到与方法 2 或 3 相同。

因此,至少在我的硬件/操作系统组合上,两者之间的速度是相当的……当然,我更好地照顾 postgresql 的设置,但对于一个小表像这样我不认为缓冲区缓存大小之类的事情很重要?

现在,至于 JDBC 对批量插入的支持有多好,我不知道。我仅使用命令行客户端完成了所有这些事情。

Seems odd that you're not seeing a speedup with things like using COPY. I generated a script to create a similar table and populate it with 10,000 rows, and found that:

  1. Slowest: 1 row per insert, no transaction block, fsync=on
  2. Faster: wrap all statements (create table, insert....) in a transaction
  3. Same speed as 2: as 1 but turn fsync=off
  4. Fastest: restore using a script created by pg_dump to populate the table using COPY

Methods 2 and 3 were about 4 times faster than method 1. Method 4 was about 10 times faster than 2 or 3.

If I import the same data into mysql on my machine, it takes about half the time as methods 2 or 3. Dumping and reloading it, same. Dumping with -e and reloading it, same. Using InnoDB bumped the time up to being the same as methods 2 or 3.

So at least on my hardware/OS combination the speeds between the two are comparable... although of course I look after postgresql's settings better, but for a small table like this I wouldn't expect things like the buffer cache size to matter much?

Now, as to how good the JDBC support for doing batch inserts is, I have no idea. I did all these things using just the command-line clients.

虚拟世界 2024-10-20 18:57:39

这里有两个主要的考虑因素:

  1. mysql 的设计目的是为了速度高于其他一切,postgres 不是,默认的 mysql 表类型甚至不处理事务......postgres 支持的这些额外功能在某些情况下会增加相当多的开销使它们成为可能...
  2. 这不是一个非常常见的用例 - 在大多数应用程序中一次插入 10000 个条目是极其罕见的。更常见的是同时从不同连接插入小块数据,在这种情况下,postgres 使用行级锁定 - 默认 mysql 表类型不支持该功能,这(至少理论上)应该使其性能优于 mysql。

因此,除非从单个连接进行批量插入是您的应用程序的常态,否则此测试实际上不会显示任何有用的信息...您更有可能同时使用数十个连接来插入、查询和/或更新小块数据的

There are two major considerations here:

  1. mysql is designed for speed over everything else, postgres is not, the default mysql table type does not even handle transactions... these extra features supported by postgres can add a fair bit of overhead in some situations to make them possible...
  2. This is not a very common use case - inserting 10000 entries at once is extremely uncommon in most applications. What is more common is inserting small chunks of data from different connections simultaneously in this case postgres uses row-level locking - a feature not supported by the default mysql table type, which (theoretically at least) should cause it to outperform mysql.

So, unless bulk inserts from a single connection are the norm for your application, this test really doesn't show anything useful... It is more likely you will be using dozens of connections simultaneously to insert, query and/or update small chunks of data

落在眉间の轻吻 2024-10-20 18:57:39
  1. 创建没有任何索引的父表,只有列和类型(创建表 some_data (c_1 int, c_2 varchar,....))
  2. 为新数据表枚举创建序列
  3. 从序列中获取新 id
  4. 使用 'like 为实际数据创建新表' 关键字(像 some_data 一样创建表 some_data_X)
  5. 在 some_data_X 中插入真实数据并以二进制格式复制
  6. 创建索引、约束(使用与 postgresql 的多个连接使您的核心能够使用)
  7. 继承父表现
  8. 在可以选择了!

通过这种方式,我在 10 列(2 个 xeon、24 个核心、24 GB 内存、SSD)上创建索引,实现了每秒 400000-500000 次插入。

  1. 奖励:在单独的线程中删除旧数据(some_data_X 与 min X):带有索引的巨大循环缓冲区!
  1. create parent table without any index, only column and there types (create table some_data (c_1 int, c_2 varchar,....))
  2. create sequence for new data tables enumeration
  3. take new id from sequence
  4. create new table for real data with 'like' key word (create table some_data_X like some_data)
  5. insert real data in some_data_X with copy in binary format
  6. create indexes, constraints (empower your cores using multiple connections to postgresql)
  7. inherit parent table
  8. now ready to select!

In such way I have achieve 400000-500000 inserts per seconds with index creation on a 10 columns (2 xeon, 24 cores, 24 Gb of memory, SSD).

  1. Bonus: in separete thread remove old data (some_data_X with min X): huge circular buffer with indexing!
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文