如何在python中使用lxml删除xml标签中的所有子标签,但将文本合并到括号中?
如何告诉 etree.strip_tags()
从给定标签元素中删除所有可能的标签?
我是否必须自己绘制它们,例如:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
也许有一种我不知道的更优雅的方法?
输入示例:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
期望输出:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
甚至更好:
This is some text with multiple tags and sometimes they are nested.
How can one tell etree.strip_tags()
to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
lxml.html.clean
模块:You can use the
lxml.html.clean
module:这个答案有点晚了,但我想一个比 ars 最初的答案提供的解决方案更简单的解决方案可能会很方便为了安全起见。
简短回答
在调用
strip_tags()
时使用"*"
参数来指定要删除的所有标签。长答案
给定您的 XML 字符串,我们可以创建一个 lxml 元素:
您可以像这样检查该实例:
要删除除
parent
标记本身之外的所有标记,请使用etree.strip_tags()
函数,例如你建议,但有一个"*"
参数:检查显示所有子标签都消失了:
这是您想要的输出。 请注意,这将修改 lxml Element 实例本身!为了使其更好(正如您所要求的:-)),只需获取
text
属性:This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the
"*"
argument when you callstrip_tags()
to specify all tags to be stripped.Long Answer
Given your XML string, we can create an lxml Element:
You can inspect that instance like so:
To strip out all the tags except the
parent
tag itself, use theetree.strip_tags()
function like you suggested, but with a"*"
argument:Inspection shows that all child tags are gone:
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the
text
property: