Wikipedia页面的刮擦小节

发布于 2025-02-09 23:55:39 字数 382 浏览 1 评论 0原文

我正在尝试使用Python在Wikipedia页面的小节中刮擦链接。例如,此处: https://en.wikipedia.org/wiki/wiki/lists_of_video_game_games 仅根据“类型”部分。

我试图使用BeautifulSoup,但我得到的信息太多了,我需要一种将我的回应仅限于该小节的方法。

如果我也可以获得小节标题,那就更好了,因此,例如“动作”中的所有链接,“体育”中的所有链接。.等。

任何帮助或指导都将不胜感激,

谢谢,

I am trying to scrape the links in a subsection of a wikipedia page using python. For example in this:
https://en.wikipedia.org/wiki/Lists_of_video_games I want to get all the links under the section "By genre" only.

I have tried to use beautifulsoup but i am getting too much info, I need a way to limit my response to only that subsection.

It would be better if I could also get the subsections title, so for example all the links in "action", all the links in "sports" .. etc.

Any help or guidance would be appreciated

Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏至、离别 2025-02-16 23:55:39

希望,以下示例将是您所需的输出

import requests
from bs4 import BeautifulSoup 
url='https://en.wikipedia.org/wiki/Lists_of_video_games'
res= requests.get(url)

soup = BeautifulSoup(res.content,'lxml')

links = soup.select('.toclevel-1.tocsection-41> ul > li > a')

for link in links:
    href= 'https://en.wikipedia.org/wiki/Lists_of_video_games' + link.get('href')
    print(href)

输出:

https://en.wikipedia.org/wiki/Lists_of_video_games#Action
https://en.wikipedia.org/wiki/Lists_of_video_games#Casual_and_puzzle
https://en.wikipedia.org/wiki/Lists_of_video_games#Role-playing
https://en.wikipedia.org/wiki/Lists_of_video_games#Simulation
https://en.wikipedia.org/wiki/Lists_of_video_games#Sports
https://en.wikipedia.org/wiki/Lists_of_video_games#Strategy
https://en.wikipedia.org/wiki/Lists_of_video_games#Other

Hope, the following example will be your desired output

import requests
from bs4 import BeautifulSoup 
url='https://en.wikipedia.org/wiki/Lists_of_video_games'
res= requests.get(url)

soup = BeautifulSoup(res.content,'lxml')

links = soup.select('.toclevel-1.tocsection-41> ul > li > a')

for link in links:
    href= 'https://en.wikipedia.org/wiki/Lists_of_video_games' + link.get('href')
    print(href)

Output:

https://en.wikipedia.org/wiki/Lists_of_video_games#Action
https://en.wikipedia.org/wiki/Lists_of_video_games#Casual_and_puzzle
https://en.wikipedia.org/wiki/Lists_of_video_games#Role-playing
https://en.wikipedia.org/wiki/Lists_of_video_games#Simulation
https://en.wikipedia.org/wiki/Lists_of_video_games#Sports
https://en.wikipedia.org/wiki/Lists_of_video_games#Strategy
https://en.wikipedia.org/wiki/Lists_of_video_games#Other
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文