如何在python中使用lxml删除xml标签中的所有子标签,但将文本合并到括号中?

发布于 2024-11-18 20:16:04 字数 800 浏览 4 评论 0原文

如何告诉 etree.strip_tags() 从给定标签元素中删除所有可能的标签?

我是否必须自己绘制它们,例如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

也许有一种我不知道的更优雅的方法?

输入示例:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

期望输出:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

甚至更好:

This is some text with multiple tags and sometimes they are nested.

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?

Do I have to map them myself, like:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

Perhaps a more elegant approach I don't know of?

Example input:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

Desired Output:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

or even better:

This is some text with multiple tags and sometimes they are nested.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡忘如思 2024-11-25 20:16:04

您可以使用 lxml.html.clean 模块:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

You can use the lxml.html.clean module:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
吖咩 2024-11-25 20:16:04

这个答案有点晚了,但我想一个比 ars 最初的答案提供的解决方案更简单的解决方案可能会很方便为了安全起见。

简短回答

在调用 strip_tags() 时使用 "*" 参数来指定要删除的所有标签。

长答案

给定您的 XML 字符串,我们可以创建一个 lxml 元素

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

您可以像这样检查该实例:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

要删除除 parent 标记本身之外的所有标记,请使用 etree.strip_tags() 函数,例如你建议,但有一个"*" 参数:

>>> lxml.etree.strip_tags(parent_tag, "*")

检查显示所有子标签都消失了:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

这是您想要的输出。 请注意,这将修改 lxml Element 实例本身!为了使其更好(正如您所要求的:-)),只需获取 text 属性:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.

Short Answer

Use the "*" argument when you call strip_tags() to specify all tags to be stripped.

Long Answer

Given your XML string, we can create an lxml Element:

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

You can inspect that instance like so:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:

>>> lxml.etree.strip_tags(parent_tag, "*")

Inspection shows that all child tags are gone:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文