在Wikipedia锚文本中刮擦标题时,如何忽略Infobox?
我试图在Wikipedia页面上刮擦前20个链接,但我想忽略右侧的Infobox。它具有“表”标签。这是我到目前为止所拥有的,任何帮助将不胜感激。
import requests
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")
all_links = {}
count = 0
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]
content = soup.find('div', {'class':'mw-parser-output'})
for link in content.find_all("a"):
if count <= 20:
url = link.get("title", "")
if not any(url.startswith(x) for x in IGNORE) and url != "":
count = count + 1
print(url)
else:
break
I am trying to scrape the first 20 links on a Wikipedia page but I want to ignore the infobox on the right side. It has a 'table' tag. Here is what I have so far, any help would be greatly appreciated.
import requests
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")
all_links = {}
count = 0
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]
content = soup.find('div', {'class':'mw-parser-output'})
for link in content.find_all("a"):
if count <= 20:
url = link.get("title", "")
if not any(url.startswith(x) for x in IGNORE) and url != "":
count = count + 1
print(url)
else:
break
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
选择您的元素更具体。
一种方法可能是使用
CSS选择器
和:not()
作为pseudo class
:示例输出,
One approach could be to select your elements more specific e.g. with
css selectors
and:not()
aspseudo class
:Example
Output