为了并行化for循环，如何使用并发。future具有1个重复使用的＆＃x1d1b;＆＃x1d04;＆＃x1d18;每个线程连接？

发布于 2025-02-02 12:58:30 字数 1414 浏览 3 评论 0原文

使用并发。FUTURES可以做到这一点＆nbsp;：

import pandas,concurrent.futures,urllib.request

URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
        'http://some-made-up-domain.com/rthrgfgd',
        'http://some-made-up-domain.com/gsezeraz',
        'http://some-made-up-domain.com/gfsrgerfg',
        'http://some-made-up-domain.com/sdfgdfdfh']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result.append(future.result())
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')

但是现在，由于所有请求都是对同一网站提出的（我实际上有80,000个URL），如何重复使用基础连接？我的意思是每个工作线程有1个连接，每个线程都在重复使用URL调用的连接吗？
问题是我未能找到如何初始化对象（代表连接）每个工作线程的对象（表示连接），并以某种方式初始化它的变量名称对所有线程（所述线程局部存储）保持常见。

我正在使用Linux和Python3.9。

原文

Using concurrent.futures one can do this :

import pandas,concurrent.futures,urllib.request

URLS = ['http://some-made-up-domain.com/sifhzfhihrffhzs',
        'http://some-made-up-domain.com/rthrgfgd',
        'http://some-made-up-domain.com/gsezeraz',
        'http://some-made-up-domain.com/gfsrgerfg',
        'http://some-made-up-domain.com/sdfgdfdfh']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

result=[]
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            result.append(future.result())
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
# Save for later use in different scripts
pandas.to_pickle(result,'result.pickle')

But now, since all requests are made to the same website (I have in reality 80,000 URLs), how to reuse the underlying connections ? I’m meaning having 1 connection per worker thread where each thread is reusing the connection across URL calls ?
The problem is I failed to find how to initialize an Object (representing a connection) one time per worker thread and initialize it in a way the variable name remain common to all threads (said thread local storage).

I’m using Linux and Python3.9.

分享到QQ

分享到微博