BeautifulSoup-如何将某个元素排除在某个标签之下?

发布于 2025-02-08 16:19:42 字数 778 浏览 3 评论 0原文

我是网络刮擦的新手。我想排除属于“ P”标签的IMG元素。这是我的代码:

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'

response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
    file.write(str(content))

screenshot

所以,我想从'H1'和'p'(只有一个“ H1”,但许多“ P”)。问题在于“ P”。由于某种原因,图像源属于“ P”(路径:Paimg)。由于我想将文件作为HTML输出,因此图像(用于Go-Top按钮)是在场的。因此,我想问一下在这种情况下是否有一种方法可以排除IMG。提前致谢。

I'm new to web scraping. I want to exclude img element which falls under 'p' tag. Here is my codes:

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'

response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
    file.write(str(content))

screenshot

So, I wanted to get all the texts from 'h1' and 'p' (there's only one 'h1', but many 'p'). The problem is with 'p'. For some reason, an image source falls under the 'p' (path: p.a.img). Since I wanted to output the file as html, the image (which is for the go-top button) is in the way. Therefore, I wanted to ask if there is a way to exclude that img in this case. Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

眉黛浅 2025-02-15 16:19:42

有不同的方法可以使您的目标最简单的方法是使用.get_text(),因为它只会返回人类可读文本而不是html:

for content in contents:
    print(content.get_text())

另一种方法是.decompose() < img> - 因此,请选择所有有关< img> iTerate resultset iTerate 并分解每个内容。之后,您可以在这些清洁的上操作:

for e in soup.select('.go-top img'):
    e.decompose()

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)

There are different ways to get your goal simplest one is to use .get_text(), cause it will only return human readable text not HTML:

for content in contents:
    print(content.get_text())

Another approach is to .decompose() the <img> - So select all the concerning <img> iterate over ResultSet and decompose each. After that you can operate on these cleaned soup:

for e in soup.select('.go-top img'):
    e.decompose()

contents = soup.find_all(['h1', 'p'])
for content in contents:
    print(content)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文