在lxml中,如何删除标签但保留所有内容?

发布于 2024-10-12 10:39:44 字数 1159 浏览 6 评论 0原文

问题是这样的:我有一个像这样的 XML 片段:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

对于结果,我想删除所有 - 和 - 标签,但是保留它们的(文本)内容和子节点。此外,-元素应保持不变。结果应该是这样的

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

暂时,我将恢复到一个非常肮脏的技巧:我将 etree.tostring 片段,通过正则表达式删除有问题的标签,并用 etree.fromstring 结果替换原始片段(不是真正的代码,但应该是这样的):

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

我知道我可能可以使用 xslt 来实现这一点,并且我知道 lxml 可以使用 xslt,但是必须有更多的 lxml 本机方法吗?

仅供参考:我尝试使用 lxml 的 element.replace 到达那里,但由于我想在之前有元素节点的位置插入文本,所以我认为我无法做到这一点。

The problem is this: I have an XML fragment like so:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

For the result, I want to remove all <a>- and <c>-Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b>-Element should be left untouched. The result should then look thus

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of this (not the real code, but should go something like this):

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

I know that I can probably use xslt to achieve this, and I know that lxml can make use of xslt, but there has to be a more lxml native approach?

For reference: I've tried getting there with lxml's element.replace, but since I want to insert text where there was an element node before, I don't think I can do that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧梦荧光笔 2024-10-19 10:39:44

试试这个:http://lxml.de/api/lxml.etree-module。 html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'

Try this: http://lxml.de/api/lxml.etree-module.html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'
你列表最软的妹 2024-10-19 10:39:44

使用lxml的Cleaner功能从html内容中删除标签。
下面是一个执行您想要的操作的示例。对于 HTML 文档,Cleaner 是比使用 strip_elements 更好的通用解决方案,因为在这种情况下,您想要删除的不仅仅是标签;您还想删除其他标签上的 onclick=function() 属性等内容。

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

要删除的标签列表。只有标签将被删除,它们的内容将被提取到父标签中。

Use Cleaner function of lxml to remove tags from html content.
Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the tag; you also want to get rid of things like onclick=function() attributes on other tags.

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文