剧作家 - 派恩(Python)并行刮擦URL列表

发布于 2025-02-04 04:40:37 字数 1772 浏览 3 评论 0原文

我有一个URL列表。在列表上迭代时,每个URL使用context.new_page()。goto(url)使用新页面打开。我想打开多个页面并并行刮擦。 注意:每个新页面都使用指定的URL打开一个新选项卡,

我尝试使用Joblib尝试以下代码片段。

from joblib import Parallel, delayed
def fetch_info(profile, _page):
        name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
        gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
        age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
        image_path = "#big-photo > img"

        name = _page.locator(name_path).inner_html()
        gender = _page.locator(gender_path).inner_html()
        age = _page.locator(age_path).inner_html()
        image = _page.locator(image_path).get_attribute("src")
        
        def clean_data(text):
            return text.strip().replace("\n","")

        name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
        gender = gender.split("/")[0].strip()
        age = age.split('•')[0].strip()
        user = {
            "name":name,
            "gender":gender,
            "age":age,
            "image_url":image
        }
        return user

    results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
    print(results)

以下是我收到的错误

greenlet.error:无法切换到其他线程

I have a list of URLs. While iterating over the list, each URL is opened by a new page using context.new_page().goto(URL). I want to open multiple pages and scrape in parallel.
note: each new page opens up a new tab with the URL specified

I tried the following code snippet using joblib.

from joblib import Parallel, delayed
def fetch_info(profile, _page):
        name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
        gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
        age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
        image_path = "#big-photo > img"

        name = _page.locator(name_path).inner_html()
        gender = _page.locator(gender_path).inner_html()
        age = _page.locator(age_path).inner_html()
        image = _page.locator(image_path).get_attribute("src")
        
        def clean_data(text):
            return text.strip().replace("\n","")

        name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
        gender = gender.split("/")[0].strip()
        age = age.split('•')[0].strip()
        user = {
            "name":name,
            "gender":gender,
            "age":age,
            "image_url":image
        }
        return user

    results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
    print(results)

below is the error I get

greenlet.error: cannot switch to a different thread

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

很快妥协 2025-02-11 04:40:37

我一直在尝试解决同一问题。但是,来自研究;剧作家的设计并非以这种方式在Python上工作,但它确实具有同步结构,可以以多线程的方式与平行的结构进行运行。因此,它将允许一次打开和“刮擦”多个站点。但是,从技术上讲,它不会是“平行的”,而是可以“多线程”。我利用这

I have been attempting to resolve the same issue. However, from research; playwright was not designed to work in such a way on python, but it does have a synchronous structure that can be ran in a multi-threaded manner vs a parallel one. So it will allow to open and "scrape" multiple sites at once. However, it technically will not be "parallel" but can be "Multi-Threaded". I leveraged this code and It allowed me to open multiple pages at once; per my needs for testing. Potentially this will work for what you are attempting to accomplish. Friendly reminder; One must create a playwright object per thread to allow it to work. For more understanding around multi-threaded vs parallel you can refer to this site here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文