剧作家 - 派恩(Python)并行刮擦URL列表
我有一个URL列表。在列表上迭代时,每个URL使用context.new_page()。goto(url)
使用新页面打开。我想打开多个页面并并行刮擦。 注意:每个新页面都使用指定的URL打开一个新选项卡,
我尝试使用Joblib尝试以下代码片段。
from joblib import Parallel, delayed
def fetch_info(profile, _page):
name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
image_path = "#big-photo > img"
name = _page.locator(name_path).inner_html()
gender = _page.locator(gender_path).inner_html()
age = _page.locator(age_path).inner_html()
image = _page.locator(image_path).get_attribute("src")
def clean_data(text):
return text.strip().replace("\n","")
name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
gender = gender.split("/")[0].strip()
age = age.split('•')[0].strip()
user = {
"name":name,
"gender":gender,
"age":age,
"image_url":image
}
return user
results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
print(results)
以下是我收到的错误
greenlet.error:无法切换到其他线程
I have a list of URLs. While iterating over the list, each URL is opened by a new page using context.new_page().goto(URL)
. I want to open multiple pages and scrape in parallel.
note: each new page opens up a new tab with the URL specified
I tried the following code snippet using joblib.
from joblib import Parallel, delayed
def fetch_info(profile, _page):
name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
image_path = "#big-photo > img"
name = _page.locator(name_path).inner_html()
gender = _page.locator(gender_path).inner_html()
age = _page.locator(age_path).inner_html()
image = _page.locator(image_path).get_attribute("src")
def clean_data(text):
return text.strip().replace("\n","")
name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
gender = gender.split("/")[0].strip()
age = age.split('•')[0].strip()
user = {
"name":name,
"gender":gender,
"age":age,
"image_url":image
}
return user
results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
print(results)
below is the error I get
greenlet.error: cannot switch to a different thread
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我一直在尝试解决同一问题。但是,来自研究;剧作家的设计并非以这种方式在Python上工作,但它确实具有同步结构,可以以多线程的方式与平行的结构进行运行。因此,它将允许一次打开和“刮擦”多个站点。但是,从技术上讲,它不会是“平行的”,而是可以“多线程”。我利用这
I have been attempting to resolve the same issue. However, from research; playwright was not designed to work in such a way on python, but it does have a synchronous structure that can be ran in a multi-threaded manner vs a parallel one. So it will allow to open and "scrape" multiple sites at once. However, it technically will not be "parallel" but can be "Multi-Threaded". I leveraged this code and It allowed me to open multiple pages at once; per my needs for testing. Potentially this will work for what you are attempting to accomplish. Friendly reminder; One must create a playwright object per thread to allow it to work. For more understanding around multi-threaded vs parallel you can refer to this site here