带有不同模板的多个URL的美丽套件

发布于 2025-02-06 02:55:28 字数 1539 浏览 1 评论 0原文

我想用2个不同的HTML模板刮擦多个URL。我可以毫无问题地刮擦每个HTML，但是在尝试将两个刮板组合起来时，我遇到了一个问题。下面是我的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []
if page_url_df['template'] == 1:
    for url in page_url_df['url']:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        for e in soup.select('#tabs-publications em'):
            data.append({
                'author':e.previous.get_text(strip=True)[:-1],
                'title':e.get_text(strip=True),
                'journal':e.next_sibling.get_text(strip=True),
                'source': url
            })
else:
    for url_2 in page_url_df['url']:
        r_2 = requests.get(url_2)
        soup_2 = BeautifulSoup(r_2.text, 'lxml')
        for a in soup_2.find_all('span',{'class':'fac_citation'}):
            data.append({
                'author':a.find('b').get_text(),
                'title':a.find('i').get_text(strip=True),
                'journal':a.find('i').next_sibling.get_text(strip=True),
                'source': url_2
            })

如果列“模板”返回值1，则在此处的逻辑，然后使用第一个模板提取数据，否则使用第二个模板提取数据。但是，此代码返回此错误：系列的真实值模棱两可。使用A.Empty，A.Bool（），A.Item（），A.Any（）或a.all（）。

预先感谢您！

原文

I want to scrape multiple URLs with 2 different HTML templates. I can scrape each HTML by itself without issue, but I ran into a problem when trying to combine the two scrapers. Below is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []
if page_url_df['template'] == 1:
    for url in page_url_df['url']:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        for e in soup.select('#tabs-publications em'):
            data.append({
                'author':e.previous.get_text(strip=True)[:-1],
                'title':e.get_text(strip=True),
                'journal':e.next_sibling.get_text(strip=True),
                'source': url
            })
else:
    for url_2 in page_url_df['url']:
        r_2 = requests.get(url_2)
        soup_2 = BeautifulSoup(r_2.text, 'lxml')
        for a in soup_2.find_all('span',{'class':'fac_citation'}):
            data.append({
                'author':a.find('b').get_text(),
                'title':a.find('i').get_text(strip=True),
                'journal':a.find('i').next_sibling.get_text(strip=True),
                'source': url_2
            })

The logic here if the column 'template' returns a value of 1, then extract the data using the first template, else extract the data using the second template. However, this code return this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thank you in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

懒的傷心 2025-02-13 02:55:33

如果我正确理解您，您想基于page_url_df创建新的数据框架：

import requests
import pandas as pd
from bs4 import BeautifulSoup


page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = (
    "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
)
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)


def get_template_1(url):
    data = []
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    for e in soup.select("#tabs-publications em"):
        data.append(
            {
                "author": e.previous.get_text(strip=True)[:-1],
                "title": e.get_text(strip=True),
                "journal": e.next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


def get_template_2(url):
    data = []
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    for a in soup.find_all("span", {"class": "fac_citation"}):
        data.append(
            {
                "author": a.find("b").get_text(),
                "title": a.find("i").get_text(strip=True),
                "journal": a.find("i").next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


all_data = []
for _, row in page_url_df.iterrows():
    print("Getting", row["url"])
    if row["template"] == 1:
        all_data.extend(get_template_1(row["url"]))
    elif row["template"] == 2:
        all_data.extend(get_template_2(row["url"]))


df_out = pd.DataFrame(all_data)

# print sample data
print(df_out.head().to_markdown())

prints：prints：

	作者	标题	杂志	资料来源
0	Hantsoo Liisa，Kornfield Sara，Anguera Montserrat C，Epperson C Neill	炎症：拟议中的孕产妇压力与后代神经精神上的风险。 [PMID30314641]	生物精神病学85（2）：97-106，2019年1月。	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratories/research-laboratory/research-laboratory/research-laboratory/anguera-laboratory
1	sierra Isabel，Anguera Montserrat C	享受沉默：体细胞中的X染色体灭活多样性。[PMID31108425]	遗传学中的当前意见开发55：26-31，2019年5月	。 > https://www.vet.upenn.edu/research/centers-laboratories/research-laboratories/research-laboratory/research-laboratory/anguera-laboratory
2	Syrett Camille M，Anguera Montserrat C	“ - 来自两个X染色体和雌性自身免疫性的连接基因剂量。 [PMID31125996]	白细胞生物学杂志2019年5月	。 ">https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
3	Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao -Mejia Jorge	长期未编码的RNA对病毒感染的响应调节CD8 T细胞。[PMID31138702]	美国国家科学院美国国家科学院的会议记录116（24）：11916-11925，2019年6月，2019年6月	。 https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/research-laboratory/anguera-laboratory“ rel =“ nofollow noreferrer”> https：///www.vet.upenn.edu/research /中心实验室/研究劳动/研究 - 实验室/anguera-Laboratory
4	Syrett Camille M，Paneru Bam，Sandoval-Heglund Donavon，Wang Jianle，Wang Jianle，Banerjee Sarmistha，Banerjee Sarmistha，Sindhava Vishal，Sindhava Vishal，Behrens Edward Edward M，Atchison Michael，Anguera，Anguera，Anguera，Anguera，Anguera，Anguera，Anguera，Anguera，Anguera蒙特塞拉特c	改变了T细胞中的X染色体灭活可能会促进性偏见的自身免疫性疾病。 [PMID30944248	JCI Insight 4（7），2019年4月。https	://www.vet.upenn.edu/research/centers-laboratories/research-laboratories/research-laboratory/research-laboratory/research-laboratory/anguera-laboratory/anguera-laboratory

If I understand you right, you want to create new dataframe based on page_url_df:

import requests
import pandas as pd
from bs4 import BeautifulSoup


page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = (
    "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
)
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)


def get_template_1(url):
    data = []
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    for e in soup.select("#tabs-publications em"):
        data.append(
            {
                "author": e.previous.get_text(strip=True)[:-1],
                "title": e.get_text(strip=True),
                "journal": e.next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


def get_template_2(url):
    data = []
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    for a in soup.find_all("span", {"class": "fac_citation"}):
        data.append(
            {
                "author": a.find("b").get_text(),
                "title": a.find("i").get_text(strip=True),
                "journal": a.find("i").next_sibling.get_text(strip=True),
                "source": url,
            }
        )
    return data


all_data = []
for _, row in page_url_df.iterrows():
    print("Getting", row["url"])
    if row["template"] == 1:
        all_data.extend(get_template_1(row["url"]))
    elif row["template"] == 2:
        all_data.extend(get_template_2(row["url"]))


df_out = pd.DataFrame(all_data)

# print sample data
print(df_out.head().to_markdown())

Prints:

	author	title	journal	source
0	Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill	Inflammation: A Proposed Intermediary Between Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641]	Biological psychiatry 85(2): 97-106, Jan 2019.	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
1	Sierra Isabel, Anguera Montserrat C	Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425]	Current opinion in genetics & development 55: 26-31, May 2019.	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
2	Syrett Camille M, Anguera Montserrat C	When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996]	Journal of leukocyte biology May 2019.	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
3	Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge	The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702]	Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019.	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
4	Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C	Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248	JCI insight 4(7), Apr 2019.	https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory

回复收藏 0 原文

流年里的时光 2025-02-13 02:55:32

您需要外循环中的一个值得一提的地方。一种方法是从您现有的数据帧列中生成元组列表并循环。然后，您可以在循环内将有条件的逻辑简化。

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []

with requests.Session() as s:
    for template, url in zip(
        page_url_df["template"].to_list(), page_url_df["url"].to_list()
    ):
        r = s.get(url)
        soup = BeautifulSoup(r.text, "lxml")

        if template == 1:
           
            for e in soup.select("#tabs-publications em"):
                data.append(
                    {
                        "author": e.previous.get_text(strip=True)[:-1],
                        "title": e.get_text(strip=True),
                        "journal": e.next_sibling.get_text(strip=True),
                        "source": url,
                    }
                )
        else:

            for a in soup.find_all("span", {"class": "fac_citation"}):
                data.append(
                    {
                        "author": a.find("b").get_text(),
                        "title": a.find("i").get_text(strip=True),
                        "journal": a.find("i").next_sibling.get_text(strip=True),
                        "source": url,
                    }
                )
print(data)

You need an iterable in an outer loop. One way would be to generate a tuple list from your existing dataframe columns and loop that. You can then have your conditional logic, simplified, within the loop.

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
page_url2 = "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []

with requests.Session() as s:
    for template, url in zip(
        page_url_df["template"].to_list(), page_url_df["url"].to_list()
    ):
        r = s.get(url)
        soup = BeautifulSoup(r.text, "lxml")

        if template == 1:
           
            for e in soup.select("#tabs-publications em"):
                data.append(
                    {
                        "author": e.previous.get_text(strip=True)[:-1],
                        "title": e.get_text(strip=True),
                        "journal": e.next_sibling.get_text(strip=True),
                        "source": url,
                    }
                )
        else:

            for a in soup.find_all("span", {"class": "fac_citation"}):
                data.append(
                    {
                        "author": a.find("b").get_text(),
                        "title": a.find("i").get_text(strip=True),
                        "journal": a.find("i").next_sibling.get_text(strip=True),
                        "source": url,
                    }
                )
print(data)

回复收藏 0 原文

~没有更多了~