剧作家 - 派恩（Python）并行刮擦URL列表

发布于 2025-02-04 04:40:37 字数 1772 浏览 3 评论 0原文

我有一个URL列表。在列表上迭代时，每个URL使用context.new_page（）。goto（url）使用新页面打开。我想打开多个页面并并行刮擦。注意：每个新页面都使用指定的URL打开一个新选项卡，

我尝试使用Joblib尝试以下代码片段。

from joblib import Parallel, delayed
def fetch_info(profile, _page):
        name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
        gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
        age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
        image_path = "#big-photo > img"

        name = _page.locator(name_path).inner_html()
        gender = _page.locator(gender_path).inner_html()
        age = _page.locator(age_path).inner_html()
        image = _page.locator(image_path).get_attribute("src")
        
        def clean_data(text):
            return text.strip().replace("\n","")

        name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
        gender = gender.split("/")[0].strip()
        age = age.split('•')[0].strip()
        user = {
            "name":name,
            "gender":gender,
            "age":age,
            "image_url":image
        }
        return user

    results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
    print(results)

以下是我收到的错误

greenlet.error：无法切换到其他线程

原文

I have a list of URLs. While iterating over the list, each URL is opened by a new page using context.new_page().goto(URL). I want to open multiple pages and scrape in parallel.
note: each new page opens up a new tab with the URL specified

I tried the following code snippet using joblib.

from joblib import Parallel, delayed
def fetch_info(profile, _page):
        name_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > div.flex.items-center.mt2 > h2"
        gender_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p.mt1.mb0.line-height-4"
        age_path = "body > div.profile.sm-col-10.lg-col-9.mt1.content-max-width.mx-auto.clearfix > div.flex.flex-wrap.relative.mb2 > div.profile-text.ps1 > p:nth-child(3)"
        image_path = "#big-photo > img"

        name = _page.locator(name_path).inner_html()
        gender = _page.locator(gender_path).inner_html()
        age = _page.locator(age_path).inner_html()
        image = _page.locator(image_path).get_attribute("src")
        
        def clean_data(text):
            return text.strip().replace("\n","")

        name, gender, age, image = clean_data(name), clean_data(gender), clean_data(age),clean_data(image)
        gender = gender.split("/")[0].strip()
        age = age.split('•')[0].strip()
        user = {
            "name":name,
            "gender":gender,
            "age":age,
            "image_url":image
        }
        return user

    results = Parallel(n_jobs=4)(delayed(fetch_info)(i,context.new_page()) for profile in user_profiles)
    print(results)

below is the error I get