为了并行化for循环,如何使用并发。future具有1个重复使用的ᴛᴄᴘ每个线程连接?

发布于 2025-02-02 12:58:30 字数 1414 浏览 3 评论 0原文

使用并发。FUTURES可以做到这一点 :

import pandas,concurrent.futures,urllib.request

URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
        'http://some-made-up-domain.com/rthrgfgd',
        'http://some-made-up-domain.com/gsezeraz',
        'http://some-made-up-domain.com/gfsrgerfg',
        'http://some-made-up-domain.com/sdfgdfdfh']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result.append(future.result())
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')

但是现在,由于所有请求都是对同一网站提出的(我实际上有80,000个URL),如何重复使用基础连接?我的意思是每个工作线程有1个连接,每个线程都在重复使用URL调用的连接吗?
问题是我未能找到如何初始化对象(代表连接)每个工作线程的对象(表示连接),并以某种方式初始化它的变量名称对所有线程(所述线程局部存储)保持常见。

我正在使用Linux和Python3.9。

Using concurrent.futures one can do this :

import pandas,concurrent.futures,urllib.request

URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
        'http://some-made-up-domain.com/rthrgfgd',
        'http://some-made-up-domain.com/gsezeraz',
        'http://some-made-up-domain.com/gfsrgerfg',
        'http://some-made-up-domain.com/sdfgdfdfh']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result.append(future.result())
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')

But now, since all requests are made to the same website (I have in reality 80,000 URLs), how to reuse the underlying connections ? I’m meaning having 1 connection per worker thread where each thread is reusing the connection across URL calls ?
The problem is I failed to find how to initialize an Object (representing a connection) one time per worker thread and initialize it in a way the variable name remain common to all threads (said thread local storage).

I’m using Linux and Python3.9.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文