Psycopg2、Postgresql、Python:批量插入的最快方法

发布于 2024-08-22 02:27:00 字数 584 浏览 7 评论 0原文

我正在寻找将数百万个元组批量插入数据库的最有效方法。我正在使用 Python、PostgreSQL 和 psycopg2

我创建了一个长长的郁金香列表,应将其插入数据库,有时使用几何Simplify等修饰符。

最简单的方法是对 INSERT 语句列表进行字符串格式化,但我还读过其他三种方法:

  1. 使用 pyformat 绑定样式 用于参数插入
  2. 使用 executemany元组列表,以及
  3. 使用将结果写入文件并使用COPY

似乎第一种方法是最有效的,但我很感激您的见解和代码片段告诉我如何正确地做。

I'm looking for the most efficient way to bulk-insert some millions of tuples into a database. I'm using Python, PostgreSQL and psycopg2.

I have created a long list of tulpes that should be inserted to the database, sometimes with modifiers like geometric Simplify.

The naive way to do it would be string-formatting a list of INSERT statements, but there are three other methods I've read about:

  1. Using pyformat binding style for parametric insertion
  2. Using executemany on the list of tuples, and
  3. Using writing the results to a file and using COPY.

It seems that the first way is the most efficient, but I would appreciate your insights and code snippets telling me how to do it right.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

烟雨凡馨 2024-08-29 02:27:00

是的,我会投票支持 COPY,前提是您可以将文件写入服务器的硬盘驱动器(不是运行应用程序的驱动器),因为 COPY 只会从服务器读取。

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

合久必婚 2024-08-29 02:27:00

有一个新的 psycopg2 手册,其中包含所有选项的示例。

COPY 选项是最有效的。然后执行many。然后用pyformat执行。

There is a new psycopg2 manual containing examples for all the options.

The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

墨小沫ゞ 2024-08-29 02:27:00

根据我的经验 executemany 并不比自己运行许多插入更快,
最快的方法是自己格式化一个具有多个值的 INSERT ,也许将来 executemany 会改进,但目前

我子类化 list< /code> 并重载追加方法,因此当列表达到一定大小时,我格式化 INSERT 来运行它

in my experience executemany is not any faster than running many inserts yourself,
the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow

i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it

梦年海沫深 2024-08-29 02:27:00

插入许多项目的最新方法是使用 execute_values 帮助器 (https://www.psycopg.org/docs/extras.html#fast-execution-helpers)。

from psycopg2.extras import execute_values

insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"

cur = conn.cursor()

items = []
items.append((1, "name", 123123))
# append more...

execute_values(cur, insert_sql, items, value_template)
conn.commit()

The newest way of inserting many items is using the execute_values helper (https://www.psycopg.org/docs/extras.html#fast-execution-helpers).

from psycopg2.extras import execute_values

insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"

cur = conn.cursor()

items = []
items.append((1, "name", 123123))
# append more...

execute_values(cur, insert_sql, items, value_template)
conn.commit()
雅心素梦 2024-08-29 02:27:00

您可以使用新的upsert库:(

$ pip install upsert

您可能必须pip installdecorator 首先)

conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
    upsert.row(selector, setter)

其中 selector 是一个 dict 对象,例如 {'name': 'Chris Smith'}setter > 是一个类似于 { 'age': 28, 'state': 'WI' }dict

几乎与编写自定义 INSERT[ 一样快/UPDATE] 代码并直接使用 psycopg2 运行它...如果该行已经存在,它就不会爆炸。

You could use a new upsert library:

$ pip install upsert

(you may have to pip install decorator first)

conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
    upsert.row(selector, setter)

Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }

It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.

忘羡 2024-08-29 02:27:00

使用 SQLalchemy 的任何人都可以尝试 1.2 版本,该版本添加了对批量插入的支持,以在使用 use_batch_mode=True 初始化引擎时使用 psycopg2.extras.execute_batch() 而不是 executemany,例如:

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    use_batch_mode=True)

http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109

然后有人必须使用 SQLalchmey,他们不会费心尝试 sqla 和 psycopg2 的不同组合以及直接 SQL 在一起。

Anyone using SQLalchemy could try 1.2 version which added support of bulk insert to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:

engine = create_engine(
    "postgresql+psycopg2://scott:tiger@host/dbname",
    use_batch_mode=True)

http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109

Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together.

赤濁 2024-08-29 02:27:00

经过一些测试,取消嵌套经常似乎是一个非常快的选择,正如我从@Clodoaldo Neto答案中了解到的类似的问题。

data = [(1, 100), (2, 200), ...]  # list of tuples

cur.execute("""CREATE TABLE table1 AS
               SELECT u.id, u.var1
               FROM unnest(%s) u(id INT, var1 INT)""", (data,))

然而,对于极大的数据可能会很棘手

After some testing, unnest often seems to be an extremely fast option, as I learned from @Clodoaldo Neto's answer to a similar question.

data = [(1, 100), (2, 200), ...]  # list of tuples

cur.execute("""CREATE TABLE table1 AS
               SELECT u.id, u.var1
               FROM unnest(%s) u(id INT, var1 INT)""", (data,))

However, it can be tricky with extremely large data.

最后的乘客 2024-08-29 02:27:00

第一个和第二个将一起使用,而不是单独使用。不过,第三种方法在服务器方面是最高效的,因为服务器将完成所有艰苦的工作。

The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.

终止放荡 2024-08-29 02:27:00

一个非常相关的问题: Bulk insert with SQLAlchemy ORM


条条大路通向罗马,但其中一些要翻山越岭,需要渡轮,但如果你想快速到达那里,只需走高速公路即可。


在这种情况下,高速公路将使用 execute_batch() psycopg2 的功能。文档说得最好:

executemany() 的当前实现(用一种极其仁慈的轻描淡写的说法)表现不佳。这些函数可用于加速针对一组参数的语句的重复执行。通过减少服务器往返次数,性能可以比使用 executemany() 好几个数量级。

在我自己的测试中,execute_batch() 的速度大约是 executemany() 的两倍,并且提供了配置 page_size 以便进一步调整的选项(如果您想榨取驱动程序最后 2-3% 的性能)。

如果您使用 SQLAlchemy,在使用 create_engine() 实例化引擎时将 use_batch_mode=True 设置为参数,则可以轻松启用相同的功能

A very related question: Bulk insert with SQLAlchemy ORM


All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.


In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().

In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文