执行大型插入时 PostgreSQL 连接意外关闭
我正在使用之前从另一个数据库中选择的约 11.000.000 行填充 PostgreSQL 表。我正在使用 Python 和 psycopg2。整个过程预计需要 1.5 小时才能完成。然而,大约 30 分钟后,我收到“连接意外关闭”异常。源代码如下所示:
incursor = indb.cursor()
incursor.execute("SELECT ...")
indb.commit() # (1) close transaction
outcursor = outdb.cursor()
rows = 0
for (col1, col2, col3) in incursor: # incursor contains ~11.000.000 rows
outcursor.execute("INSERT ...", (col1, col2, col3)) # This fails after ~30 minutes
row += 1
if row % 100 == 0: # (2) Write data every 100 rows
outcursor.close()
outdb.commit()
outcursor = outdb.cursor()
incursor.close()
outcursor.close()
outdb.commit()
我在第一次尝试失败后插入了 (1)
和 (2)
,假设开放交易的上限时间限制为 ~30分钟或游标具有挂起插入的上限。看来这些假设都不成立,错误出在其他地方。
这两个数据库都存储在我通过主机端口转发连接的 VirtualBox 计算机上。我在主机上运行该程序。
这两个数据库仅用于测试目的,没有其他连接需要管理。也许我必须重写问题来解决这个问题,但我需要在其他地方进行非常耗时的插入(运行大约几天),所以我非常担心 psycopg2 或 PostgreSQL 中的一些隐藏时间限制。
I'm populating a PostgreSQL table with ~11.000.000 rows that have been selected before from another database. I'm using Python and psycopg2. The whole process takes an estimated 1.5 hours to complete. However, after ~30 minutes I get "connection closed unexpectedly" exception. The source code looks like this:
incursor = indb.cursor()
incursor.execute("SELECT ...")
indb.commit() # (1) close transaction
outcursor = outdb.cursor()
rows = 0
for (col1, col2, col3) in incursor: # incursor contains ~11.000.000 rows
outcursor.execute("INSERT ...", (col1, col2, col3)) # This fails after ~30 minutes
row += 1
if row % 100 == 0: # (2) Write data every 100 rows
outcursor.close()
outdb.commit()
outcursor = outdb.cursor()
incursor.close()
outcursor.close()
outdb.commit()
I inserted (1)
and (2)
after the first tries that failed, assuming that an open transaction has an upper time limit of ~30 minutes or that a cursor has an upper limit of pending inserts. It seems that none of this assumptions are true and the error lies somewhere else.
Both databases are stored on a VirtualBox machine that I connect via port forwarding from the host. I run the program on the host machine.
Both database are just for testing purposes and they have no other connections to manage. Maybe I have to rewrite the problem to get around this, but I need very time-consuming inserts elsewhere (running approx. for days) so I'm very concerned about some hidden time limits in psycopg2
or PostgreSQL.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不知道 postgresql 本身有任何这样的“隐藏”超时。 PostgreSQL 确实有
statement_timeout
,但如果你点击了它,你应该在服务器日志中看到一个ERROR:canceling statements due to statements timeout
(并且它也会记录已取消的语句) 。我不能代表psycopg2说话。一定要检查服务器日志中是否有任何看起来相关的内容。也许这是一个网络问题?长时间运行的语句将是一个长时间保持空闲状态的 TCP 连接。也许您的端口转发会清除空闲时间超过 30 分钟的连接?也许您的 TCP 连接没有使用 keepalive。 Postgresql 有一些用于调整 TCP keepalive 的设置(tcp_keepalives_interval 等),您可能还需要进行一些内核/网络配置以确保它们实际上已启用。
例如,我刚刚尝试连接到我自己的机器,
tcp_keepalives_interval
默认为 7200,即 2 小时。如果您的端口转发在 30 分钟后中断,则此默认设置不起作用。您可以覆盖客户端连接字符串中使用的设置(假设您可以直接调整 conninfo 字符串),或者在用户/数据库属性或 postgresql.conf 中设置 GUC 变量。请参阅:
I don't know of any such "hidden" timeout in postgresql itself. PostgreSQL does have
statement_timeout
, but if you hit that you should get aERROR: canceling statement due to statement timeout
in the server log (and it will log the cancelled statement too). I can't speak for psycopg2. Definitely check the server log for anything that looks relevant.Maybe it's a networking issue? A long-running statement will be a TCP connection that stays idle for a long time. Perhaps your port forwarding purges connections that are idle for more than 30 minutes? Maybe your TCP connections aren't using keepalive. Postgresql has some settings for tuning TCP keepalive (tcp_keepalives_interval etc.) and you may also need to do some kernel/networking configuration to make sure they are actually enabled.
e.g. I just tried connecting to my own machine here and
tcp_keepalives_interval
defaults to 7200, which is 2 hours. If your port forwarding cuts off after 30 minutes, this default won't do. You can override the setting used in the client connection string (assuming you can twiddle the conninfo string directly), or set the GUC variable in user/database properties or postgresql.conf.See:
要插入数百万行,我会查看官方指南填充数据库并考虑使用copy。
To insert millons of rows, I'd look through the official guide to populating a db and consider using copy.
我有一个 django 管理命令,可以更新成千上万行。一段时间后,我看到同样的错误。我相信内存使用超出了限制。但不知道如何在命令中手动控制事务。
I have a django admin commands which updates thousands of thousands of rows. After some time, I see the same error. I believe the memory usage exceeds the limit. Don't know how to manually control the transaction in commands, though.