pyodbc - 批量插入速度非常慢

发布于 2024-11-02 05:12:13 字数 866 浏览 1 评论 0原文

根据此表：

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

以下代码需要 40 秒才能运行：

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

psycopg2 的等效代码仅需要 3 秒。我不认为 mssql 比 postgresql 慢多少。关于如何提高使用 pyodbc 时的批量插入速度有什么想法吗？

编辑：在ghoerz的发现后添加一些注释

在pyodbc中，executemany的流程是：

准备语句
为每组参数循环
- 绑定参数集
- 执行

在ceODBC中，executemany的流程是：

prepare语句
绑定所有参数
执行

原文

With this table:

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

the following code takes 40 seconds to run:

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?

EDIT: Add some notes following ghoerz's discovery

In pyodbc, the flow of executemany is:

prepare statement
loop for each set of parameters
- bind the set of parameters
- execute

In ceODBC, the flow of executemany is:

prepare statement
bind all parameters
execute

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

滿滿的愛 2024-11-09 05:12:13

我在使用executemany()将pyODBC插入SQL Server 2008 DB时遇到了类似的问题。当我在 SQL 端运行探查器跟踪时，pyODBC 正在创建一个连接，准备参数化插入语句，并针对一行执行它。然后它会取消准备语句并关闭连接。然后，它对每一行重复此过程。

我在 pyODBC 中找不到任何不执行此操作的解决方案。我最终切换到 ceODBC 来连接 SQL Server，并且它正确地使用了参数化语句。

回复收藏 0 原文

孤独患者 2024-11-09 05:12:13

与 Postgres (psycopg2) 和 Oracle (cx_Oracle) 中的批量操作相比，尝试使用 pyodbc 将 +2M 行插入 MSSQL 花费的时间长得离谱。我没有使用BULK INSERT操作的权限，但能够通过以下方法解决问题。

许多解决方案正确地建议了 fast_executemany，但是，正确使用它有一些技巧。首先，我注意到当 connect 方法中 autocommit 设置为 True 时，pyodbc 在每行之后提交，因此必须设置为 False。我还观察到一次插入超过 20k 行时出现非线性减慢，即插入 10k 行是亚秒级的，但插入 50k 行则需要 20 秒以上。我认为事务日志变得非常大并且减慢了整个过程。因此，您必须对插入进行分块，并在每个块之后提交。我发现每个块 5k 行可以提供良好的性能，但这显然取决于许多因素（数据、机器、数据库配置等）。

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()

回复收藏 0 原文

只是我以为 2024-11-09 05:12:13

尝试了 ceODBC 和 mxODBC，两者都非常慢。最终在 http://www.ecp.cc/pyado.html。总运行时间提高了 6 倍！

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()

回复收藏 0 原文

心舞飞扬 2024-11-09 05:12:13

pyodbc 4.0.19 添加了一个 Cursor#fast_executemany 选项来帮助解决此问题。有关详细信息，请参阅此答案。

回复收藏 0 原文

如果没结果 2024-11-09 05:12:13

我将数据写入文本文件，然后调用 BCP 实用程序。快得多。从大约20到30分钟到几秒钟。

回复收藏 0 原文

誰ツ都不明白 2024-11-09 05:12:13

我使用的是 pypyODBC w/ python 3.5 和 Microsoft SQL Server Management Studio。
使用 pypyodbc 的 .executemany() 方法来插入特定表（约 70K 行，带 40 个变量）需要 112 秒。

使用 ceODBC 需要 4 秒。

回复收藏 0 原文

浪荡不羁 2024-11-09 05:12:13

当我将约 160 万行（来自约 195 个 csv 文件）上传到 Microsoft SQL Server 数据库表时，我发现这很疯狂。在网上做了很多研究，但对我来说没有任何作用。阅读本文后当尝试构建数据库时，为什么查询会在几分钟后变慢？，我查看了该表，它的结构如何。我的表有 15 列，并且所有列都有索引。我删除了所有这些并运行了我的 python 脚本，这真是一个奇迹。我的上传时间减少了 98%。

回复收藏 0 原文

~没有更多了~