具有嵌套元素的 Python LXML 迭代解析
我想检索 XML 文件中特定元素的内容。然而,在 XML 元素中,还有其他 XML 元素,这会破坏父标记中内容的正确提取。一个例子:
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''
context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
print element.text
其结果是:
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None
然而,例如,“使用的防护制服..”被遗漏。似乎“声明文本”的每个元素(具有其他内部元素)都被忽略了。我应该如何更改 XML 的解析才能获取所有声明?
谢谢,
我刚刚用“普通”SAX 解析器方法解决了这个问题:
class SimpleXMLHandler(object):
def __init__(self):
self.buffer = ''
self.claim = 0
def start(self, tag, attributes):
if tag == 'claim-text':
if self.claim == 0:
self.buffer = ''
self.claim = 1
def data(self, data):
if self.claim == 1:
self.buffer += data
def end(self, tag):
if tag == 'claim-text':
print self.buffer
self.claim = 0
def close(self):
pass
I would like to retrieve the content of a specific element within an XML file. However, within the XML element, there are other XML elements, which destroy the proper extraction of the content within the parent tag. An example:
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''
context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
print element.text
which results in:
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None
However, e.g., 'a protective uniform for use ..' is missed. It seems, that every element of 'claim-text', which has other inner-elements, is neglected. How should I change the parsing of the XML in order to fetch all claims?
Thanks
I've just solved it with an 'ordinary' SAX parser approach:
class SimpleXMLHandler(object):
def __init__(self):
self.buffer = ''
self.claim = 0
def start(self, tag, attributes):
if tag == 'claim-text':
if self.claim == 0:
self.buffer = ''
self.claim = 1
def data(self, data):
if self.claim == 1:
self.buffer += data
def end(self, tag):
if tag == 'claim-text':
print self.buffer
self.claim = 0
def close(self):
pass
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 xpath 查找并连接直接在每个
节点下的所有文本节点,如下所示:输出:
You could use an xpath to find and concatenate all the text nodes directly under each
<claim-text>
node, like this:which outputs: