在Wikipedia锚文本中刮擦标题时，如何忽略Infobox？

发布于 2025-01-29 03:46:04 字数 718 浏览 3 评论 0原文

我试图在Wikipedia页面上刮擦前20个链接，但我想忽略右侧的Infobox。它具有“表”标签。这是我到目前为止所拥有的，任何帮助将不胜感激。

import requests

response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

content = soup.find('div', {'class':'mw-parser-output'})

for link in content.find_all("a"):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

原文

I am trying to scrape the first 20 links on a Wikipedia page but I want to ignore the infobox on the right side. It has a 'table' tag. Here is what I have so far, any help would be greatly appreciated.

import requests

response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

content = soup.find('div', {'class':'mw-parser-output'})

for link in content.find_all("a"):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

她比我温柔 2025-02-05 03:46:05

选择您的元素更具体。

soup.select('div.mw-parser-output a:not(.infobox  a)')

一种方法可能是使用`CSS选择器`和`：not（）`作为`pseudo class`：示例

import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

for link in soup.select('div.mw-parser-output a:not(.infobox  a)'):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

输出，

Wales (disambiguation)
Welsh language
About this sound
Cymru.ogg
Countries of the United Kingdom
United Kingdom
England
Wales–England border
Severn Estuary
Bristol Channel
Irish Sea
Snowdon
Temperateness
Maritime climate
Cardiff
Welsh people
Celtic Britons
Roman withdrawal from Britain
Celtic nations
Llywelyn ap Gruffudd
Edward I of England

One approach could be to select your elements more specific e.g. with css selectors and :not() as pseudo class:

soup.select('div.mw-parser-output a:not(.infobox  a)')

Example

import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")

all_links = {}
count = 0

IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
               "User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]

for link in soup.select('div.mw-parser-output a:not(.infobox  a)'):
    if count <= 20:
        url = link.get("title", "")
        if not any(url.startswith(x) for x in IGNORE) and url != "":
            count = count + 1
            print(url)
    else:
        break

Output

Wales (disambiguation)
Welsh language
About this sound
Cymru.ogg
Countries of the United Kingdom
United Kingdom
England
Wales–England border
Severn Estuary
Bristol Channel
Irish Sea
Snowdon
Temperateness
Maritime climate
Cardiff
Welsh people
Celtic Britons
Roman withdrawal from Britain
Celtic nations
Llywelyn ap Gruffudd
Edward I of England

回复收藏 0 原文

~没有更多了~