python：df.to_sql和fast_executemany = true仍然不够快，对于通过VPN连接的SQL Server中的850k记录

发布于 2025-02-04 06:21:36 字数 2092 浏览 2 评论 0原文

我正在尝试使用Python编写一个ETL过程，以将数据从本地计算机更新为SQL Server。我正在处理的文件有46列和约850K记录。我在没有fast_executemany = true的情况下尝试了，但这花了近3个小时。我在线阅读并添加了fast_executemany = true，并用两者的10k记录进行了测试。 fast_executemany = true仅在10k记录中仅提高10-15秒。我在网上读到人们已经使用了这一点，只需要5分钟才能插入近一百万行。

以下是我的python脚本。

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  chunksize=math.floor(2000/len(test.columns)), 
                  #chunksize=1000,
                  method='multi',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

我在这里缺少任何东西，还是有任何更快的方法可以做到这一点。由于访问问题，我无法使用散装插入物。

编辑

我看到了文章在说fast_executemany默认情况下，fast_executemany使用多插入作为一种方法，在上面的一节中，我还提到了块大小和方法='multi'，这就是为什么它无法正常工作的原因。我删除了块和方法后，插入850K记录的时间从3小时下降到25分钟左右。

以下是我的代码。

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

尽管这比以前好得多，但我觉得必须更快一些。我认为这不应该花费超过5分钟。

原文

I am trying to write an ETL process to update the data from local machine to the SQL server using Python. The file that I am working on, has 46 columns and around 850K records. I tried without fast_executemany=True first but that took almost 3 hours. I read online and added fast_executemany=True and tested out with 10K record for both. Fast_executemany=True only improved by 10-15 seconds for 10K records. I have read online that people have used this and it took them only 5 minutes to insert almost a million rows.

Below is my python script.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  chunksize=math.floor(2000/len(test.columns)), 
                  #chunksize=1000,
                  method='multi',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

Is there anything that I'm missing here or is there any faster way to do this. I cannot use BULK INSERT due to access issues.

Edited

I saw an article where it stated that fast_executemany uses multi-inserted as a method by default and in the above section I was also mentioning the chunk size and method='multi' that's why it was not working properly. After I removed the chunksize and method, the time to insert 850K records dropped from 3 hours to around 25 minutes.

Below is my code.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

Although this is much better than before but I feel like there has to be something more faster. I don't think it should take more than 5 minutes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

木格 2025-02-11 06:21:36

我在这里发布了我的发现作为答案。我必须从test.to_sql中删除块和方法，因为fast_executemany根据我阅读的文章将方法设置为默认情况下的多插入方法。它的时间从3小时降低到或多或少。

对于戈德·汤普森（Gord Thompson）的观点来说，因为我正在使用VPN连接连接到公司的网络，这就是为什么它相对较慢，但20分钟仍然不错。

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

I am posting my findings here as an answer. I had to remove the chunksize and method from test.to_sql because fast_executemany sets the method to multi-insert by default based on the article I read. It reduced the timing from 3 hours down to 20 minutes more or less.

To Gord Thompson's point, since I am using the VPN connection to connect to company's network that's why it may be comparatively slower but 20 minutes is still not bad.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

回复收藏 0 原文

~没有更多了~