线程在Python中运行一个之后?

发布于 2025-02-13 15:08:30 字数 1882 浏览 0 评论 0原文

update1:​​

如果我将for循环内部的代码更改为:

print('processing new page')
pool.apply_async(time.sleep, (5,))

我会看到 每个打印后5秒延迟,因此问题与WebDriver无关。

update2:

感谢 @user56700,但我有兴趣知道我在这里做什么以及如何修复而不从使用线程的方式切换。


在Python中,我有以下代码:

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
    try:
        print('processing new page')
        result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
        # Change global variables
    except Exception as e:
        log_warning(str(e))

如果我有10页,则上述代码需要300秒才能完成,这是很多。

我读了有关python中称为螺纹的东西: https://stackoverflow.com/a/a/15144765/19500354 所以我想使用它,但我不确定我是否以正确的方式进行操作。

这是我的尝试:

import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
    try:
        print('processing new page')
        result = pool.apply_async(parse_page, (driver, url,)).get()
        G_LOCK.acquire()
        # Change global variables
        G_LOCK.release()
    except Exception as e:
        log_warning(str(e))

pool.close()
pool.join()

# Here I want to make sure ALL threads have finished working before running the below code

为什么我的实施错误?请注意,我使用的是同一驱动程序实例,

我尝试在处理新页面旁边打印时间,并且我看到:

[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page

这意味着有问题,因为我期望1秒diff仅此而已。我所做的就是改变全局变量。

Update1:

If I change the code inside the for loop to:

print('processing new page')
pool.apply_async(time.sleep, (5,))

I see 5 sec delay after Every printing, so the problem isn't related to webdriver.

Update2:

Thanks for @user56700 but I'm interested in knowing what did I do here and how to fix without switching from the way I'm using threads.


In python I have the following code:

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
    try:
        print('processing new page')
        result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
        # Change global variables
    except Exception as e:
        log_warning(str(e))

If I have 10 pages the above code needs 300 seconds to finish which is a lot.

I read about something called threading in python: https://stackoverflow.com/a/15144765/19500354 so I wanted to use it but I'm not sure if I'm doing it the right way.

Here's my try:

import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
    try:
        print('processing new page')
        result = pool.apply_async(parse_page, (driver, url,)).get()
        G_LOCK.acquire()
        # Change global variables
        G_LOCK.release()
    except Exception as e:
        log_warning(str(e))

pool.close()
pool.join()

# Here I want to make sure ALL threads have finished working before running the below code

Why is my implementation wrong? note I'm using same driver instance

I tried to print time next to processing new page and I see:

[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page

Which means something is wrong as I would expect 1 sec diff nothing more. as all I'm doing is to change global variables.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时常饿 2025-02-20 15:08:31

我刚刚创建了一个简单的示例来展示我将如何解决它。您当然需要添加自己的代码。

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver

driver = webdriver.Chrome()
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []

def parse_page(driver, url):
    driver.get(url)
    data = driver.title
    return(data)

with ThreadPoolExecutor(max_workers=10) as executor:
    results = {executor.submit(parse_page, driver, url) for url in urls}
    for result in as_completed(results):
        your_data.append(result.result())

driver.close()
print(your_data)

结果:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

如果愿意,可以将WebDriver用作上下文管理器,以避免必须关闭它,例如:

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    with ThreadPoolExecutor(max_workers=10) as executor:
        results = {executor.submit(parse_page, driver, url) for url in urls}
        for result in as_completed(results):
            your_data.append(result.result())

print(your_data)

使用Multiprocessing.pool库:

from selenium import webdriver
from multiprocessing.pool import ThreadPool

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    pool = ThreadPool(processes=10)
    results = [pool.apply_async(parse_page, (driver, url)) for url in urls]
    for result in results:
        your_data.append(result.get())

print(your_data)

结果:结果:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

I just created a simple example to show case how I would solve it. You need to add your own code of course.

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver

driver = webdriver.Chrome()
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []

def parse_page(driver, url):
    driver.get(url)
    data = driver.title
    return(data)

with ThreadPoolExecutor(max_workers=10) as executor:
    results = {executor.submit(parse_page, driver, url) for url in urls}
    for result in as_completed(results):
        your_data.append(result.result())

driver.close()
print(your_data)

Result:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

If you want, you could use the webdriver as a context manager to avoid having to close it, like this:

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    with ThreadPoolExecutor(max_workers=10) as executor:
        results = {executor.submit(parse_page, driver, url) for url in urls}
        for result in as_completed(results):
            your_data.append(result.result())

print(your_data)

Example using the multiprocessing.pool library:

from selenium import webdriver
from multiprocessing.pool import ThreadPool

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    pool = ThreadPool(processes=10)
    results = [pool.apply_async(parse_page, (driver, url)) for url in urls]
    for result in results:
        your_data.append(result.get())

print(your_data)

Result:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文