在Wikipedia锚文本中刮擦标题时,如何忽略Infobox?

发布于 2025-01-29 03:46:04 字数 718 浏览 3 评论 0原文

我试图在Wikipedia页面上刮擦前20个链接,但我想忽略右侧的Infobox。它具有“表”标签。这是我到目前为止所拥有的,任何帮助将不胜感激。

import requests

response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

content = soup.find('div', {'class':'mw-parser-output'})

for link in content.find_all("a"):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

I am trying to scrape the first 20 links on a Wikipedia page but I want to ignore the infobox on the right side. It has a 'table' tag. Here is what I have so far, any help would be greatly appreciated.

import requests

response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

content = soup.find('div', {'class':'mw-parser-output'})

for link in content.find_all("a"):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

她比我温柔 2025-02-05 03:46:05

选择您的元素更具体。

soup.select('div.mw-parser-output a:not(.infobox  a)')
一种方法可能是使用CSS选择器:not()作为pseudo class:示例
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

for link in soup.select('div.mw-parser-output a:not(.infobox  a)'):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break
输出,
Wales (disambiguation)
Welsh language
About this sound
Cymru.ogg
Countries of the United Kingdom
United Kingdom
England
Wales–England border
Severn Estuary
Bristol Channel
Irish Sea
Snowdon
Temperateness
Maritime climate
Cardiff
Welsh people
Celtic Britons
Roman withdrawal from Britain
Celtic nations
Llywelyn ap Gruffudd
Edward I of England

One approach could be to select your elements more specific e.g. with css selectors and :not() as pseudo class:

soup.select('div.mw-parser-output a:not(.infobox  a)')
Example
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

for link in soup.select('div.mw-parser-output a:not(.infobox  a)'):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break
Output
Wales (disambiguation)
Welsh language
About this sound
Cymru.ogg
Countries of the United Kingdom
United Kingdom
England
Wales–England border
Severn Estuary
Bristol Channel
Irish Sea
Snowdon
Temperateness
Maritime climate
Cardiff
Welsh people
Celtic Britons
Roman withdrawal from Britain
Celtic nations
Llywelyn ap Gruffudd
Edward I of England
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文