python:df.to_sql和fast_executemany = true仍然不够快,对于通过VPN连接的SQL Server中的850k记录
我正在尝试使用Python编写一个ETL过程,以将数据从本地计算机更新为SQL Server。我正在处理的文件有46列和约850K记录。我在没有fast_executemany = true的情况下尝试了,但这花了近3个小时。我在线阅读并添加了fast_executemany = true,并用两者的10k记录进行了测试。 fast_executemany = true仅在10k记录中仅提高10-15秒。我在网上读到人们已经使用了这一点,只需要5分钟才能插入近一百万行。
以下是我的python脚本。
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
chunksize=math.floor(2000/len(test.columns)),
#chunksize=1000,
method='multi',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
我在这里缺少任何东西,还是有任何更快的方法可以做到这一点。由于访问问题,我无法使用散装插入物。
编辑
我看到了文章在说fast_executemany默认情况下,fast_executemany使用多插入作为一种方法,在上面的一节中,我还提到了块大小和方法='multi',这就是为什么它无法正常工作的原因。我删除了块和方法后,插入850K记录的时间从3小时下降到25分钟左右。
以下是我的代码。
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
尽管这比以前好得多,但我觉得必须更快一些。我认为这不应该花费超过5分钟。
I am trying to write an ETL process to update the data from local machine to the SQL server using Python. The file that I am working on, has 46 columns and around 850K records. I tried without fast_executemany=True first but that took almost 3 hours. I read online and added fast_executemany=True and tested out with 10K record for both. Fast_executemany=True only improved by 10-15 seconds for 10K records. I have read online that people have used this and it took them only 5 minutes to insert almost a million rows.
Below is my python script.
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
chunksize=math.floor(2000/len(test.columns)),
#chunksize=1000,
method='multi',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
Is there anything that I'm missing here or is there any faster way to do this. I cannot use BULK INSERT due to access issues.
Edited
I saw an article where it stated that fast_executemany uses multi-inserted as a method by default and in the above section I was also mentioning the chunk size and method='multi' that's why it was not working properly. After I removed the chunksize and method, the time to insert 850K records dropped from 3 hours to around 25 minutes.
Below is my code.
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
Although this is much better than before but I feel like there has to be something more faster. I don't think it should take more than 5 minutes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在这里发布了我的发现作为答案。我必须从test.to_sql中删除块和方法,因为fast_executemany根据我阅读的文章将方法设置为默认情况下的多插入方法。它的时间从3小时降低到或多或少。
对于戈德·汤普森(Gord Thompson)的观点来说,因为我正在使用VPN连接连接到公司的网络,这就是为什么它相对较慢,但20分钟仍然不错。
I am posting my findings here as an answer. I had to remove the chunksize and method from test.to_sql because fast_executemany sets the method to multi-insert by default based on the article I read. It reduced the timing from 3 hours down to 20 minutes more or less.
To Gord Thompson's point, since I am using the VPN connection to connect to company's network that's why it may be comparatively slower but 20 minutes is still not bad.