如何使用 python 的 lxml.etree 库从 xml 标签的所有嵌套标签中获取所有字符串?
我有一个 xml 文件,其中可能会发生以下情况:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
编辑:让我们假设,标签可以嵌套不止一个级别,这意味着
<a><b><c>...</c>...</b>...</a>
我使用 python lxml.etree 库想出了这个。
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
但不知怎的,它出了问题。
我想要的是整个字符串
"This is some text about some issue I have, parsing xml"
,但我只得到一个空字符串。有什么建议吗?谢谢!
I have an xml file in which it is possible that the following occurs:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
Edit: Let's assume, the tags could be nested more than only level, meaning
<a><b><c>...</c>...</b>...</a>
I came up with this using the python lxml.etree library.
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
But somehow it goes wrong.
What I want is the whole string
"This is some text about some issue I have, parsing xml"
But I only get an empty string. Any suggestions? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题已经被问过很多次了。
您可以使用
lxml.html.text_content()
方法。REF: 过滤掉 HTML 标签并解析 python 中的实体
或者使用
lxml.etree.strip_tags()
方法。REF:在lxml中,我该如何删除标签但保留所有内容?
This question has been asked many times.
You can use
lxml.html.text_content()
method.REF: Filter out HTML tags and resolve entities in python
OR use
lxml.etree.strip_tags()
method.REF: In lxml, how do I remove a tag but retain all contents?