带有不同模板的多个URL的美丽套件
我想用2个不同的HTML模板刮擦多个URL。我可以毫无问题地刮擦每个HTML,但是在尝试将两个刮板组合起来时,我遇到了一个问题。下面是我的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)
data = []
if page_url_df['template'] == 1:
for url in page_url_df['url']:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True),
'source': url
})
else:
for url_2 in page_url_df['url']:
r_2 = requests.get(url_2)
soup_2 = BeautifulSoup(r_2.text, 'lxml')
for a in soup_2.find_all('span',{'class':'fac_citation'}):
data.append({
'author':a.find('b').get_text(),
'title':a.find('i').get_text(strip=True),
'journal':a.find('i').next_sibling.get_text(strip=True),
'source': url_2
})
如果列“模板”返回值1,则在此处的逻辑,然后使用第一个模板提取数据,否则使用第二个模板提取数据。但是,此代码返回此错误:系列的真实值模棱两可。使用A.Empty,A.Bool(),A.Item(),A.Any()或a.all()。
预先感谢您!
I want to scrape multiple URLs with 2 different HTML templates. I can scrape each HTML by itself without issue, but I ran into a problem when trying to combine the two scrapers. Below is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)
data = []
if page_url_df['template'] == 1:
for url in page_url_df['url']:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True),
'source': url
})
else:
for url_2 in page_url_df['url']:
r_2 = requests.get(url_2)
soup_2 = BeautifulSoup(r_2.text, 'lxml')
for a in soup_2.find_all('span',{'class':'fac_citation'}):
data.append({
'author':a.find('b').get_text(),
'title':a.find('i').get_text(strip=True),
'journal':a.find('i').next_sibling.get_text(strip=True),
'source': url_2
})
The logic here if the column 'template' returns a value of 1, then extract the data using the first template, else extract the data using the second template. However, this code return this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thank you in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我正确理解您,您想基于
page_url_df
创建新的数据框架:prints:prints:
If I understand you right, you want to create new dataframe based on
page_url_df
:Prints:
您需要外循环中的一个值得一提的地方。一种方法是从您现有的数据帧列中生成元组列表并循环。然后,您可以在循环内将有条件的逻辑简化。
You need an iterable in an outer loop. One way would be to generate a tuple list from your existing dataframe columns and loop that. You can then have your conditional logic, simplified, within the loop.