pyodbc - 批量插入速度非常慢
根据此表:
CREATE TABLE test_insert (
col1 INT,
col2 VARCHAR(10),
col3 DATE
)
以下代码需要 40 秒才能运行:
import pyodbc
from datetime import date
conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')
rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
rows.append(row)
cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)
conn.commit()
psycopg2 的等效代码仅需要 3 秒。我不认为 mssql 比 postgresql 慢多少。关于如何提高使用 pyodbc 时的批量插入速度有什么想法吗?
编辑:在ghoerz的发现后添加一些注释
在pyodbc中,executemany的流程是:
- 准备语句
- 为每组参数 循环
- 绑定参数集
- 执行
在ceODBC中,executemany
的流程是:
- prepare语句
- 绑定所有参数
- 执行
With this table:
CREATE TABLE test_insert (
col1 INT,
col2 VARCHAR(10),
col3 DATE
)
the following code takes 40 seconds to run:
import pyodbc
from datetime import date
conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')
rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
rows.append(row)
cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)
conn.commit()
The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?
EDIT: Add some notes following ghoerz's discovery
In pyodbc, the flow of executemany
is:
- prepare statement
- loop for each set of parameters
- bind the set of parameters
- execute
In ceODBC, the flow of executemany
is:
- prepare statement
- bind all parameters
- execute
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我在使用executemany()将pyODBC插入SQL Server 2008 DB时遇到了类似的问题。当我在 SQL 端运行探查器跟踪时,pyODBC 正在创建一个连接,准备参数化插入语句,并针对一行执行它。然后它会取消准备语句并关闭连接。然后,它对每一行重复此过程。
我在 pyODBC 中找不到任何不执行此操作的解决方案。我最终切换到 ceODBC 来连接 SQL Server,并且它正确地使用了参数化语句。
I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.
I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.
与 Postgres (psycopg2) 和 Oracle (cx_Oracle) 中的批量操作相比,尝试使用 pyodbc 将 +2M 行插入 MSSQL 花费的时间长得离谱。我没有使用BULK INSERT操作的权限,但能够通过以下方法解决问题。
许多解决方案正确地建议了 fast_executemany,但是,正确使用它有一些技巧。首先,我注意到当 connect 方法中 autocommit 设置为 True 时,pyodbc 在每行之后提交,因此必须设置为 False。我还观察到一次插入超过 20k 行时出现非线性减慢,即插入 10k 行是亚秒级的,但插入 50k 行则需要 20 秒以上。我认为事务日志变得非常大并且减慢了整个过程。因此,您必须对插入进行分块,并在每个块之后提交。我发现每个块 5k 行可以提供良好的性能,但这显然取决于许多因素(数据、机器、数据库配置等)。
Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.
Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).
尝试了 ceODBC 和 mxODBC,两者都非常慢。最终在 http://www.ecp.cc/pyado.html。总运行时间提高了 6 倍!
Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!
pyodbc 4.0.19 添加了一个
Cursor#fast_executemany
选项来帮助解决此问题。有关详细信息,请参阅此答案。pyodbc 4.0.19 added a
Cursor#fast_executemany
option to help address this issue. See this answer for details.我将数据写入文本文件,然后调用 BCP 实用程序。快得多。从大约20到30分钟到几秒钟。
I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.
我使用的是 pypyODBC w/ python 3.5 和 Microsoft SQL Server Management Studio。
使用 pypyodbc 的 .executemany() 方法来插入特定表(约 70K 行,带 40 个变量)需要 112 秒。
使用 ceODBC 需要 4 秒。
I was using pypyODBC w/ python 3.5 and Microsoft SQL Server Management Studio.
A particular table ( ~70K rows w/ 40 vars) was taking 112 seconds to INSERT using the .executemany() method with pypyodbc.
With ceODBC it took 4 seconds.
当我将约 160 万行(来自约 195 个 csv 文件)上传到 Microsoft SQL Server 数据库表时,我发现这很疯狂。在网上做了很多研究,但对我来说没有任何作用。阅读本文后 当尝试构建数据库时,为什么查询会在几分钟后变慢?,我查看了该表,它的结构如何。我的表有 15 列,并且所有列都有索引。我删除了所有这些并运行了我的 python 脚本,这真是一个奇迹。我的上传时间减少了 98%。
I fount it insane while uploading ~1.6 million rows (from ~195 csv files) to a Microsoft SQL Server database table. Did lots of research over the net, but nothing worked for me. After reading this article Why do the queries slow down after a few minutes, when trying to build a DB?, I looked into the table, how it is structured. My table had 15 columns and all of them had indexes. I deleted all of them and ran my python script, it was a damn miracle. My upload time reduced by 98%.