线程在Python中运行一个之后?
update1:
如果我将for循环内部的代码更改为:
print('processing new page')
pool.apply_async(time.sleep, (5,))
我会看到 每个打印后5秒延迟,因此问题与WebDriver无关。
update2:
感谢 @user56700,但我有兴趣知道我在这里做什么以及如何修复而不从使用线程的方式切换。
在Python中,我有以下代码:
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
try:
print('processing new page')
result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
# Change global variables
except Exception as e:
log_warning(str(e))
如果我有10页,则上述代码需要300秒才能完成,这是很多。
我读了有关python中称为螺纹的东西: https://stackoverflow.com/a/a/15144765/19500354 所以我想使用它,但我不确定我是否以正确的方式进行操作。
这是我的尝试:
import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
try:
print('processing new page')
result = pool.apply_async(parse_page, (driver, url,)).get()
G_LOCK.acquire()
# Change global variables
G_LOCK.release()
except Exception as e:
log_warning(str(e))
pool.close()
pool.join()
# Here I want to make sure ALL threads have finished working before running the below code
为什么我的实施错误?请注意,我使用的是同一驱动程序实例,
我尝试在处理新页面
旁边打印时间,并且我看到:
[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page
这意味着有问题,因为我期望1秒diff仅此而已。我所做的就是改变全局变量。
Update1:
If I change the code inside the for loop to:
print('processing new page')
pool.apply_async(time.sleep, (5,))
I see 5 sec delay after Every printing, so the problem isn't related to webdriver.
Update2:
Thanks for @user56700 but I'm interested in knowing what did I do here and how to fix without switching from the way I'm using threads.
In python I have the following code:
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
try:
print('processing new page')
result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
# Change global variables
except Exception as e:
log_warning(str(e))
If I have 10 pages the above code needs 300 seconds to finish which is a lot.
I read about something called threading in python: https://stackoverflow.com/a/15144765/19500354 so I wanted to use it but I'm not sure if I'm doing it the right way.
Here's my try:
import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
try:
print('processing new page')
result = pool.apply_async(parse_page, (driver, url,)).get()
G_LOCK.acquire()
# Change global variables
G_LOCK.release()
except Exception as e:
log_warning(str(e))
pool.close()
pool.join()
# Here I want to make sure ALL threads have finished working before running the below code
Why is my implementation wrong? note I'm using same driver instance
I tried to print time next to processing new page
and I see:
[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page
Which means something is wrong as I would expect 1 sec diff nothing more. as all I'm doing is to change global variables.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我刚刚创建了一个简单的示例来展示我将如何解决它。您当然需要添加自己的代码。
结果:
如果愿意,可以将WebDriver用作上下文管理器,以避免必须关闭它,例如:
使用
Multiprocessing.pool
库:结果:结果:
I just created a simple example to show case how I would solve it. You need to add your own code of course.
Result:
If you want, you could use the webdriver as a context manager to avoid having to close it, like this:
Example using the
multiprocessing.pool
library:Result: