python 中的 lxml iterparse 无法处理命名空间
from lxml import etree
import StringIO
data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration
工作正常,直到我将命名空间添加到根节点。关于我可以做些什么来解决这个问题,或者正确的方法,有什么想法吗? 由于文件非常大,我需要事件驱动。
from lxml import etree
import StringIO
data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration
Works fine untill I add the namespace to the root node. Any ideas as to what I can do as a work around, or the correct way of doing this?
I need to be event driven due to very large files.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当附加命名空间时,标签不是
a
,而是{http://some.random.schema}a
。试试这个(Python 3):或者,在Python 2中:
这会打印类似的内容:
正如@mihail-shcheglov指出的那样,也可以使用通配符
*
,它适用于任何或没有命名空间:请参阅lxml.etree 文档 了解更多信息。
When there is a namespace attached, the tag isn't
a
, it's{http://some.random.schema}a
. Try this (Python 3):or, in Python 2:
This prints something like:
As @mihail-shcheglov pointed out, a wildcard
*
can also be used, which works for any or no namespace:See lxml.etree docs for more.
为什么不使用正则表达式呢?
1)
使用 lxml 比使用正则表达式慢。
结果
0.000150298431784 / 2.40253998762e-05 为 6.25
lxml 比 regex 慢 6.25 倍
。
2)
如果命名空间:
结果没有问题
Why not with a regular expression ?
1)
Using lxml is slower than using a regex.
result
0.000150298431784 / 2.40253998762e-05 is 6.25
lxml is 6.25 times slower than a regex
.
2)
No problem if namespace:
result