python 中的 lxml iterparse 无法处理命名空间

发布于 2024-11-28 17:30:30 字数 656 浏览 2 评论 0原文

from lxml import etree
import StringIO

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
  File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration

工作正常,直到我将命名空间添加到根节点。关于我可以做些什么来解决这个问题,或者正确的方法,有什么想法吗? 由于文件非常大,我需要事件驱动。

from lxml import etree
import StringIO

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
  File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration

Works fine untill I add the namespace to the root node. Any ideas as to what I can do as a work around, or the correct way of doing this?
I need to be event driven due to very large files.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦纸 2024-12-05 17:30:30

当附加命名空间时,标签不是 a,而是 {http://some.random.schema}a。试试这个(Python 3):

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print(f'{event}: {elem}')

或者,在Python 2中:

from lxml import etree
from StringIO import StringIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = StringIO(xml)
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print event, elem

这会打印类似的内容:

end: <Element {http://some.random.schema}a at 0x10941e730>
end: <Element {http://some.random.schema}a at 0x10941e8c0>
end: <Element {http://some.random.schema}a at 0x10941e960>

正如@mihail-shcheglov指出的那样,也可以使用通配符*,它适用于任何或没有命名空间:

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{*}a')
for event, elem in docs:
    print(f'{event}: {elem}')

请参阅lxml.etree 文档 了解更多信息。

When there is a namespace attached, the tag isn't a, it's {http://some.random.schema}a. Try this (Python 3):

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print(f'{event}: {elem}')

or, in Python 2:

from lxml import etree
from StringIO import StringIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = StringIO(xml)
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
    print event, elem

This prints something like:

end: <Element {http://some.random.schema}a at 0x10941e730>
end: <Element {http://some.random.schema}a at 0x10941e8c0>
end: <Element {http://some.random.schema}a at 0x10941e960>

As @mihail-shcheglov pointed out, a wildcard * can also be used, which works for any or no namespace:

from lxml import etree
from io import BytesIO

xml = '''\
<root xmlns="http://some.random.schema">
  <a>One</a>
  <a>Two</a>
  <a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{*}a')
for event, elem in docs:
    print(f'{event}: {elem}')

See lxml.etree docs for more.

于我来说 2024-12-05 17:30:30

为什么不使用正则表达式呢?

1)

使用 lxml 比使用正则表达式慢。

from time import clock
import StringIO



from lxml import etree

times1 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    docs = etree.iterparse(data,tag='a')
    tf = clock()
    times1.append(tf-te)
print min(times1)

print [etree.tostring(y) for x,y in docs]




import re

regx = re.compile('<a>[\s\S]*?</a>')

times2 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    li = regx.findall(data.read())
    tf = clock()
    times2.append(tf-te)
print min(times2)

print li

结果

0.000150298431784
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
2.40253998762e-05
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']

0.000150298431784 / 2.40253998762e-05 为 6.25
lxml 比 regex 慢 6.25 倍

2)

如果命名空间:

import StringIO
import re

regx = re.compile('<a>[\s\S]*?</a>')

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
print regx.findall(data.read())

结果没有问题

['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']

Why not with a regular expression ?

1)

Using lxml is slower than using a regex.

from time import clock
import StringIO



from lxml import etree

times1 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    docs = etree.iterparse(data,tag='a')
    tf = clock()
    times1.append(tf-te)
print min(times1)

print [etree.tostring(y) for x,y in docs]




import re

regx = re.compile('<a>[\s\S]*?</a>')

times2 = []
for i in xrange(1000):
    data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
    te = clock()
    li = regx.findall(data.read())
    tf = clock()
    times2.append(tf-te)
print min(times2)

print li

result

0.000150298431784
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
2.40253998762e-05
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']

0.000150298431784 / 2.40253998762e-05 is 6.25
lxml is 6.25 times slower than a regex

.

2)

No problem if namespace:

import StringIO
import re

regx = re.compile('<a>[\s\S]*?</a>')

data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
print regx.findall(data.read())

result

['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文