如何在 lxml 中将 xml 命名空间与 find/findall 一起使用?
我正在尝试解析 OpenOffice ODS 电子表格中的内容。 ods 格式本质上只是一个包含许多文档的 zip 文件。电子表格的内容存储在“content.xml”中。
import zipfile
from lxml import etree
zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))
电子表格的内容位于单元格中:
table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')
我们也可以直接查找行:
rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')
各个元素了解命名空间:
>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'
如何在 find/findall 中直接使用命名空间?
显而易见的解决方案不起作用。
尝试从表中获取行:
>>> root.findall('.//table:table')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
return list(iterfind(elem, path))
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
selector = _build_path_iterator(path)
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
selector.append(ops[token[0]](_next, token))
KeyError: ':'
I'm trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in 'content.xml'.
import zipfile
from lxml import etree
zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))
The content of the spreadsheet is in a cell:
table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')
We can also go straight for the rows:
rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')
The individual elements know about the namespaces:
>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'
How do I use the namespaces directly in find/findall?
The obvious solution does not work.
Trying to get the rows from the table:
>>> root.findall('.//table:table')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
return list(iterfind(elem, path))
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
selector = _build_path_iterator(path)
File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
selector.append(ops[token[0]](_next, token))
KeyError: ':'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果
root.nsmap
包含table
命名空间前缀,那么您可以:findall(path)
接受{namespace}name
语法而不是namespace:name
。因此,在将path
传递给findall()
之前,应使用命名空间字典将其预处理为{namespace}name
形式。If
root.nsmap
contains thetable
namespace prefix then you could:findall(path)
accepts{namespace}name
syntax instead ofnamespace:name
. Thereforepath
should be preprocessed using namespace dictionary to the{namespace}name
form before passing it tofindall()
.也许首先要注意的是命名空间
是在元素级别定义的,而不是文档级别。
但大多数情况下,所有命名空间都在文档的
根元素(此处为
office:document-content
),这使我们无需解析所有内容以收集内部xmlns
范围。然后,元素 nsmap 包括:
None
前缀(并非总是)如果,正如 ChrisR 提到的,不支持默认命名空间,
您可以使用 字典理解 将其过滤掉
以更紧凑的表达方式。
xpath 的语法略有不同,
元素路径。
这是您可以用来获取第一个表的所有行的代码
(测试使用:
lxml=3.4.2
):Maybe the first thing to notice is that the namespaces
are defined at Element level, not Document level.
Most often though, all namespaces are declared in the document's
root element (
office:document-content
here), which saves us parsing it all to collect innerxmlns
scopes.Then an element nsmap includes :
None
prefix (not always)If, as ChrisR mentionned, the default namespace is not supported,
you can use a dict comprehension to filter it out
in a more compact expression.
You have a slightly different syntax for xpath and
ElementPath.
So here's the code you could use to get all your first table's rows
(tested with:
lxml=3.4.2
) :这是获取 XML 文档中所有名称空间的方法(假设不存在前缀冲突)。
我在解析 XML 文档时使用它,因为我事先知道命名空间 URL 是什么,并且只知道前缀。
Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).
I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.
如果 XML 文件中没有
xmlns
定义,Etree 将找不到命名空间元素。例如:有时这就是给您的数据。那么,当没有命名空间时你能做什么呢?
我的解决办法:添加一个。
Etree won't find namespaced elements if there are no
xmlns
definitions in the XML file. For instance:Sometimes that is the data you are given. So, what can you do when there is no namespace?
My solution: add one.