为了并行化for循环,如何使用并发。future具有1个重复使用的ᴛᴄᴘ每个线程连接?
使用并发。FUTURES
可以做到这一点 :
import pandas,concurrent.futures,urllib.request
URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
'http://some-made-up-domain.com/rthrgfgd',
'http://some-made-up-domain.com/gsezeraz',
'http://some-made-up-domain.com/gfsrgerfg',
'http://some-made-up-domain.com/sdfgdfdfh']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
result.append(future.result())
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')
但是现在,由于所有请求都是对同一网站提出的(我实际上有80,000个URL),如何重复使用基础连接?我的意思是每个工作线程有1个连接,每个线程都在重复使用URL调用的连接吗?
问题是我未能找到如何初始化对象(代表连接)每个工作线程的对象(表示连接),并以某种方式初始化它的变量名称对所有线程(所述线程局部存储)保持常见。
我正在使用Linux和Python3.9。
Using concurrent.futures
one can do this :
import pandas,concurrent.futures,urllib.request
URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
'http://some-made-up-domain.com/rthrgfgd',
'http://some-made-up-domain.com/gsezeraz',
'http://some-made-up-domain.com/gfsrgerfg',
'http://some-made-up-domain.com/sdfgdfdfh']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
result.append(future.result())
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')
But now, since all requests are made to the same website (I have in reality 80,000 URLs), how to reuse the underlying connections ? I’m meaning having 1 connection per worker thread where each thread is reusing the connection across URL calls ?
The problem is I failed to find how to initialize an Object (representing a connection) one time per worker thread and initialize it in a way the variable name remain common to all threads (said thread local storage).
I’m using Linux and Python3.9.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论