如何在 lxml 中将 xml 命名空间与 find/findall 一起使用？

发布于 2024-10-03 05:43:12 字数 1472 浏览 6 评论 0原文

我正在尝试解析 OpenOffice ODS 电子表格中的内容。 ods 格式本质上只是一个包含许多文档的 zip 文件。电子表格的内容存储在“content.xml”中。

import zipfile
from lxml import etree

zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))

电子表格的内容位于单元格中：

table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')

我们也可以直接查找行：

rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')

各个元素了解命名空间：

>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'

如何在 find/findall 中直接使用命名空间？

显而易见的解决方案不起作用。

尝试从表中获取行：

>>> root.findall('.//table:table')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
  File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
    return list(iterfind(elem, path))
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
    selector = _build_path_iterator(path)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
KeyError: ':'

原文

I'm trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in 'content.xml'.

import zipfile
from lxml import etree

zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))

The content of the spreadsheet is in a cell:

table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')

We can also go straight for the rows:

rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')

The individual elements know about the namespaces:

>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'

How do I use the namespaces directly in find/findall?

The obvious solution does not work.

Trying to get the rows from the table:

>>> root.findall('.//table:table')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
  File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
    return list(iterfind(elem, path))
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
    selector = _build_path_iterator(path)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
KeyError: ':'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱格式化 2024-10-10 05:43:12

如果 root.nsmap 包含 table 命名空间前缀，那么您可以：

root.xpath('.//table:table', namespaces=root.nsmap)

findall(path) 接受 {namespace}name语法而不是namespace:name。因此，在将 path 传递给 findall() 之前，应使用命名空间字典将其预处理为 {namespace}name 形式。

If root.nsmap contains the table namespace prefix then you could:

root.xpath('.//table:table', namespaces=root.nsmap)

findall(path) accepts {namespace}name syntax instead of namespace:name. Therefore path should be preprocessed using namespace dictionary to the {namespace}name form before passing it to findall().

回复收藏 0 原文

栀子花开つ 2024-10-10 05:43:12

也许首先要注意的是命名空间
是在元素级别定义的，而不是文档级别。

但大多数情况下，所有命名空间都在文档的
根元素（此处为 office:document-content），这使我们无需解析所有内容以收集内部 xmlns 范围。

然后，元素 nsmap 包括：

默认名称空间，带有 None 前缀（并非总是）
所有祖先名称空间，除非被覆盖。

如果，正如 ChrisR 提到的，不支持默认命名空间，
您可以使用字典理解将其过滤掉
以更紧凑的表达方式。

xpath 的语法略有不同，
元素路径。

这是您可以用来获取第一个表的所有行的代码
（测试使用：lxml=3.4.2）：

import zipfile
from lxml import etree

# Open and parse the document
zf = zipfile.ZipFile('spreadsheet.ods')
tree = etree.parse(zf.open('content.xml'))

# Get the root element
root = tree.getroot()

# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}

# use defined prefixes to access elements
table = tree.find('.//table:table', nsmap)
rows = table.findall('table:table-row', nsmap)

# or, if xpath is needed:
table = tree.xpath('//table:table', namespaces=nsmap)[0]
rows = table.xpath('table:table-row', namespaces=nsmap)

Maybe the first thing to notice is that the namespaces
are defined at Element level, not Document level.

Most often though, all namespaces are declared in the document's
root element (office:document-content here), which saves us parsing it all to collect inner xmlns scopes.

Then an element nsmap includes :

a default namespace, with None prefix (not always)
all ancestors namespaces, unless overridden.

If, as ChrisR mentionned, the default namespace is not supported,
you can use a dict comprehension to filter it out
in a more compact expression.

You have a slightly different syntax for xpath and
ElementPath.

So here's the code you could use to get all your first table's rows
(tested with: lxml=3.4.2) :

import zipfile
from lxml import etree

# Open and parse the document
zf = zipfile.ZipFile('spreadsheet.ods')
tree = etree.parse(zf.open('content.xml'))

# Get the root element
root = tree.getroot()

# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}

# use defined prefixes to access elements
table = tree.find('.//table:table', nsmap)
rows = table.findall('table:table-row', nsmap)

# or, if xpath is needed:
table = tree.xpath('//table:table', namespaces=nsmap)[0]
rows = table.xpath('table:table-row', namespaces=nsmap)

回复收藏 0 原文

心清如水 2024-10-10 05:43:12

这是获取 XML 文档中所有名称空间的方法（假设不存在前缀冲突）。

我在解析 XML 文档时使用它，因为我事先知道命名空间 URL 是什么，并且只知道前缀。

        doc = etree.XML(XML_string)

        # Getting all the name spaces.
        nsmap = {}
        for ns in doc.xpath('//namespace::*'):
            if ns[0]: # Removes the None namespace, neither needed nor supported.
                nsmap[ns[0]] = ns[1]
        doc.xpath('//prefix:element', namespaces=nsmap)

Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).

I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.

        doc = etree.XML(XML_string)

        # Getting all the name spaces.
        nsmap = {}
        for ns in doc.xpath('//namespace::*'):
            if ns[0]: # Removes the None namespace, neither needed nor supported.
                nsmap[ns[0]] = ns[1]
        doc.xpath('//prefix:element', namespaces=nsmap)

回复收藏 0 原文

若水般的淡然安静女子 2024-10-10 05:43:12

如果 XML 文件中没有 xmlns 定义，Etree 将找不到命名空间元素。例如：

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'

tree = etree.fromstring(xml_doc)

# finds nothing:
tree.find('.//ns:root', {'ns': 'foo'})
tree.find('.//{foo}root', {'ns': 'foo'})
tree.find('.//ns:root')
tree.find('.//ns:root')

有时这就是给您的数据。那么，当没有命名空间时你能做什么呢？

我的解决办法：添加一个。

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
xml_doc_with_ns = '<ROOT xmlns:ns="foo">%s</ROOT>' % xml_doc

tree = etree.fromstring(xml_doc_with_ns)

# finds what you're looking for:
tree.find('.//{foo}root')

Etree won't find namespaced elements if there are no xmlns definitions in the XML file. For instance:

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'

tree = etree.fromstring(xml_doc)

# finds nothing:
tree.find('.//ns:root', {'ns': 'foo'})
tree.find('.//{foo}root', {'ns': 'foo'})
tree.find('.//ns:root')
tree.find('.//ns:root')

Sometimes that is the data you are given. So, what can you do when there is no namespace?

My solution: add one.

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
xml_doc_with_ns = '<ROOT xmlns:ns="foo">%s</ROOT>' % xml_doc

tree = etree.fromstring(xml_doc_with_ns)

# finds what you're looking for:
tree.find('.//{foo}root')

回复收藏 0 原文

~没有更多了~