BeautifulSoup:剥离指定的属性,但保留标签及其内容

发布于 2024-12-29 14:48:40 字数 901 浏览 0 评论 0原文

我正在尝试“defrontpagify”MS FrontPage 生成的网站的 html,并且我正在编写一个 BeautifulSoup 脚本来执行此操作。

但是,我陷入了尝试从包含特定属性(或列表属性)的文档中的每个标签中删除特定属性(或列表属性)的部分。代码片段:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

它运行时没有错误,但实际上并没有删除任何属性。当我在没有外部循环的情况下运行它时,只需对单个属性进行硬编码(soup.findAll('style'=True)),它就可以工作。

有人知道这里的问题吗?

PS - 我也不太喜欢嵌套循环。如果有人知道更实用的地图/过滤器风格,我很乐意看到它。

I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.

However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.

Anyone see know the problem here?

PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

蓝海似她心 2025-01-05 14:48:41

只是ftr:这里的问题是,如果您将 HTML 属性作为关键字参数传递,则关键字就是属性的名称。因此,您的代码正在搜索具有名称 attribute 属性的标签,因为该变量不会扩展。

这就是为什么对

  1. 属性名称进行硬编码有效[0]
  2. 并且代码不会失败。搜索与任何标签都不匹配

要解决此问题,请将您要查找的属性作为dict传递:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth某人在未来,
dtk

[0]:虽然在您的示例中需要是 find_all(style=True) ,但不带引号,因为 SyntaxError: keywords can't be an expression

Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name attribute, as the variable does not get expanded.

This is why

  1. hard-coding your attribute name worked[0]
  2. the code does not fail. The search just doesn't match any tags

To fix the problem, pass the attribute you are looking for as a dict:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth someone in the future,
dtk

[0]: Although it needs to be find_all(style=True) in your example, without the quotes, because SyntaxError: keyword can't be an expression

夜未央樱花落 2025-01-05 14:48:41

我使用这个:

if "align" in div.attrs:
    del div.attrs["align"]

或者

if "align" in div.attrs:
    div.attrs.pop("align")

感谢 https://stackoverflow.com/a/22497855/1907997

I use this one:

if "align" in div.attrs:
    del div.attrs["align"]

or

if "align" in div.attrs:
    div.attrs.pop("align")

Thanks to https://stackoverflow.com/a/22497855/1907997

泪之魂 2025-01-05 14:48:41

我使用这种方法来删除属性列表,非常紧凑:

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]


I use this method to remove a list of attributes, very compact :

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]


高跟鞋的旋律 2025-01-05 14:48:40

该行

for tag in soup.findAll(attribute=True):

未找到任何标签。可能有一种方法可以使用 findAll,我不确定。

但是,这是有效的(从 beautifulsoup 4.8.1 开始):

import bs4
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = bs4.BeautifulSoup(doc)
for tag in soup.descendants:
    if isinstance(tag, bs4.element.Tag):
        tag.attrs = {key: value for key, value in tag.attrs.items()
                     if key not in REMOVE_ATTRIBUTES}
print(soup.prettify())

这是以前的代码,可能适用于旧版本的 beautifulsoup:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

请注意,此代码仅适用于 Python 3。如果您需要它在 Python 2 中工作,请参阅诺拉的回答如下。

The line

for tag in soup.findAll(attribute=True):

does not find any tags. There might be a way to use findAll, I'm not sure.

However, this works (as of beautifulsoup 4.8.1):

import bs4
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = bs4.BeautifulSoup(doc)
for tag in soup.descendants:
    if isinstance(tag, bs4.element.Tag):
        tag.attrs = {key: value for key, value in tag.attrs.items()
                     if key not in REMOVE_ATTRIBUTES}
print(soup.prettify())

This is previous code that may have worked with an older version of beautifulsoup:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.

晚雾 2025-01-05 14:48:40

这是 unutbu 答案的 Python 2 版本:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}

Here's a Python 2 version of unutbu's answer:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文