BeautifulSoup-如何将某个元素排除在某个标签之下?
我是网络刮擦的新手。我想排除属于“ P”标签的IMG元素。这是我的代码:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'
response = requests.get(url)
soup = bs(response.content, 'html.parser')
contents = soup.find_all(['h1', 'p'])
for content in contents:
print(content)
content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
file.write(str(content))
所以,我想从'H1'和'p'(只有一个“ H1”,但许多“ P”)。问题在于“ P”。由于某种原因,图像源属于“ P”(路径:Paimg)。由于我想将文件作为HTML输出,因此图像(用于Go-Top按钮)是在场的。因此,我想问一下在这种情况下是否有一种方法可以排除IMG。提前致谢。
I'm new to web scraping. I want to exclude img element which falls under 'p' tag. Here is my codes:
from bs4 import BeautifulSoup as bs
import requests
url = 'https://chhouk-krohom.com/%E1%9E%91%E1%9E%B8%E1%9E%83%E1%9E%93%E1%9E%B7%E1%9E%80%E1%9E%B6%E1%9E%99%E1%9F%A1%E1%9F%A4/'
response = requests.get(url)
soup = bs(response.content, 'html.parser')
contents = soup.find_all(['h1', 'p'])
for content in contents:
print(content)
content = soup.prettify()
with open('sutta.html', 'wt', encoding='utf-8') as file:
file.write(str(content))
So, I wanted to get all the texts from 'h1' and 'p' (there's only one 'h1', but many 'p'). The problem is with 'p'. For some reason, an image source falls under the 'p' (path: p.a.img). Since I wanted to output the file as html, the image (which is for the go-top button) is in the way. Therefore, I wanted to ask if there is a way to exclude that img in this case. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有不同的方法可以使您的目标最简单的方法是使用
.get_text()
,因为它只会返回人类可读文本而不是html:另一种方法是
.decompose()
< img>
- 因此,请选择所有有关< img>
iTerateresultset
iTerate 并分解每个内容。之后,您可以在这些清洁的汤
上操作:There are different ways to get your goal simplest one is to use
.get_text()
, cause it will only return human readable text not HTML:Another approach is to
.decompose()
the<img>
- So select all the concerning<img>
iterate overResultSet
and decompose each. After that you can operate on these cleanedsoup
: