BeautifulSoup:剥离指定的属性,但保留标签及其内容
我正在尝试“defrontpagify”MS FrontPage 生成的网站的 html,并且我正在编写一个 BeautifulSoup 脚本来执行此操作。
但是,我陷入了尝试从包含特定属性(或列表属性)的文档中的每个标签中删除特定属性(或列表属性)的部分。代码片段:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
它运行时没有错误,但实际上并没有删除任何属性。当我在没有外部循环的情况下运行它时,只需对单个属性进行硬编码(soup.findAll('style'=True)),它就可以工作。
有人知道这里的问题吗?
PS - 我也不太喜欢嵌套循环。如果有人知道更实用的地图/过滤器风格,我很乐意看到它。
I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.
However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.
Anyone see know the problem here?
PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
只是ftr:这里的问题是,如果您将 HTML 属性作为关键字参数传递,则关键字就是属性的名称。因此,您的代码正在搜索具有名称
attribute
属性的标签,因为该变量不会扩展。这就是为什么对
要解决此问题,请将您要查找的属性作为
dict
传递:Hth某人在未来,
dtk
[0]:虽然在您的示例中需要是
find_all(style=True)
,但不带引号,因为SyntaxError: keywords can't be an expression
Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name
attribute
, as the variable does not get expanded.This is why
To fix the problem, pass the attribute you are looking for as a
dict
:Hth someone in the future,
dtk
[0]: Although it needs to be
find_all(style=True)
in your example, without the quotes, becauseSyntaxError: keyword can't be an expression
我使用这个:
或者
感谢 https://stackoverflow.com/a/22497855/1907997
I use this one:
or
Thanks to https://stackoverflow.com/a/22497855/1907997
我使用这种方法来删除属性列表,非常紧凑:
I use this method to remove a list of attributes, very compact :
该行
未找到任何
标签
。可能有一种方法可以使用findAll
,我不确定。但是,这是有效的(从 beautifulsoup 4.8.1 开始):
这是以前的代码,可能适用于旧版本的 beautifulsoup:
请注意,此代码仅适用于 Python 3。如果您需要它在 Python 2 中工作,请参阅诺拉的回答如下。
The line
does not find any
tag
s. There might be a way to usefindAll
, I'm not sure.However, this works (as of beautifulsoup 4.8.1):
This is previous code that may have worked with an older version of beautifulsoup:
Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.
这是 unutbu 答案的 Python 2 版本:
Here's a Python 2 version of unutbu's answer: