在lxml中，如何删除标签但保留所有内容？

发布于 2024-10-12 10:39:44 字数 1159 浏览 6 评论 0原文

问题是这样的：我有一个像这样的 XML 片段：

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

对于结果，我想删除所有 - 和 - 标签，但是保留它们的（文本）内容和子节点。此外，-元素应保持不变。结果应该是这样的

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

暂时，我将恢复到一个非常肮脏的技巧：我将 etree.tostring 片段，通过正则表达式删除有问题的标签，并用 etree.fromstring 结果替换原始片段（不是真正的代码，但应该是这样的）：

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

我知道我可能可以使用 xslt 来实现这一点，并且我知道 lxml 可以使用 xslt，但是必须有更多的 lxml 本机方法吗？

仅供参考：我尝试使用 lxml 的 element.replace 到达那里，但由于我想在之前有元素节点的位置插入文本，所以我认为我无法做到这一点。

原文

The problem is this: I have an XML fragment like so:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

For the result, I want to remove all <a>- and <c>-Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b>-Element should be left untouched. The result should then look thus

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of this (not the real code, but should go something like this):

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

I know that I can probably use xslt to achieve this, and I know that lxml can make use of xslt, but there has to be a more lxml native approach?

For reference: I've tried getting there with lxml's element.replace, but since I want to insert text where there was an element node before, I don't think I can do that.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧梦荧光笔 2024-10-19 10:39:44

试试这个：http://lxml.de/api/lxml.etree-module。 html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'

Try this: http://lxml.de/api/lxml.etree-module.html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'

回复收藏 0 原文

你列表最软的妹 2024-10-19 10:39:44

使用lxml的Cleaner功能从html内容中删除标签。
下面是一个执行您想要的操作的示例。对于 HTML 文档，Cleaner 是比使用 strip_elements 更好的通用解决方案，因为在这种情况下，您想要删除的不仅仅是标签；您还想删除其他标签上的 onclick=function() 属性等内容。

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

要删除的标签列表。只有标签将被删除，它们的内容将被提取到父标签中。

Use Cleaner function of lxml to remove tags from html content.
Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the tag; you also want to get rid of things like onclick=function() attributes on other tags.

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.

回复收藏 0 原文

~没有更多了~

关于作者

坏尐絯

暂无简介

0 文章

0 评论

605 人气

关注发私信

友情链接

文江博客

在lxml中，如何删除标签但保留所有内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

在lxml中，如何删除标签但保留所有内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。