BeautifulSoup-如何将某个元素排除在某个标签之下？

发布于 2025-02-08 16:19:42 字数 778 浏览 3 评论 0原文

我是网络刮擦的新手。我想排除属于“ P”标签的IMG元素。这是我的代码：

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'

response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
    file.write(str(content))

screenshot

所以，我想从'H1'和'p'（只有一个“ H1”，但许多“ P”）。问题在于“ P”。由于某种原因，图像源属于“ P”（路径：Paimg）。由于我想将文件作为HTML输出，因此图像（用于Go-Top按钮）是在场的。因此，我想问一下在这种情况下是否有一种方法可以排除IMG。提前致谢。

原文

I'm new to web scraping. I want to exclude img element which falls under 'p' tag. Here is my codes:

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'

response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
    file.write(str(content))

screenshot

So, I wanted to get all the texts from 'h1' and 'p' (there's only one 'h1', but many 'p'). The problem is with 'p'. For some reason, an image source falls under the 'p' (path: p.a.img). Since I wanted to output the file as html, the image (which is for the go-top button) is in the way. Therefore, I wanted to ask if there is a way to exclude that img in this case. Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眉黛浅 2025-02-15 16:19:42

有不同的方法可以使您的目标最简单的方法是使用.get_text（），因为它只会返回人类可读文本而不是html：

for content in contents:
    print(content.get_text())

另一种方法是.decompose（） ＆lt; img＆gt; - 因此，请选择所有有关＆lt; img＆gt; iTerate resultset iTerate 并分解每个内容。之后，您可以在这些清洁的汤上操作：

for e in soup.select('.go-top img'):
    e.decompose()

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

There are different ways to get your goal simplest one is to use .get_text(), cause it will only return human readable text not HTML:

for content in contents:
    print(content.get_text())

Another approach is to .decompose() the <img> - So select all the concerning <img> iterate over ResultSet and decompose each. After that you can operate on these cleaned soup:

for e in soup.select('.go-top img'):
    e.decompose()

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

回复收藏 0 原文

~没有更多了~

关于作者

穿透光

暂无简介

文章

29 人气

关注发私信

友情链接

文江博客

BeautifulSoup-如何将某个元素排除在某个标签之下？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

BeautifulSoup-如何将某个元素排除在某个标签之下？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。