在 python 中从 XML 中提取项目列表

发布于 2024-11-05 23:34:37 字数 841 浏览 2 评论 0原文

在 python 中,从以下 xml 中提取项目列表的最佳方法是什么?

<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206" 
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>

我通常将 lxml 与 xpath 一起使用,但在这种情况下它不起作用。我认为我的问题是由于命名空间造成的。我不喜欢 lxml,并且愿意使用任何库。

我想要一个足够强大的解决方案,如果 xml 的一般结构发生变化,它不会失败。

In python, what is the best way to extract the list of items from the following xml?

<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206" 
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>

I usually use lxml with xpath, but it's not working in this case. I think my problems are due to namespaces. I'm not set on lxml and am open to using any library.

I would like a solution that is robust enough to fail if the general structure of the xml changes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

少女净妖师 2024-11-12 23:34:37

我不确定 lxml 但您可以使用 //*[local-name()="item"] 之类的表达式来提取 item 元素,无论其命名空间如何。

您可能还想查看 Amara 进行 XML 处理。

>>> import amara.bindery
>>> doc = amara.bindery.parse(
...     '''<iq xmlns="jabber:client" 
...          to="__anonymous__admin@localhost/8978528613056092673206"
...          from="conference.localhost" id="disco" type="result">
...          <query xmlns="http://jabber.org/protocol/disco#items">
...            <item jid="[email protected]" name="pgatt (1)"/>
...            <item jid="[email protected]" name="pgatt (1)"/>
...          </query>
...        </iq>''')
>>> for item in doc.iq.query.item:
...   print item.jid, item.name
...
[email protected] pgatt (1)
[email protected] pgatt (1)
>>>

一旦我发现了 Amara,我就再也不会考虑以任何其他方式处理 XML。

I'm not sure about lxml but you can use an expression like //*[local-name()="item"] to pull out the item elements regardless of their namespace.

You might also want to take a look at Amara for XML processing.

>>> import amara.bindery
>>> doc = amara.bindery.parse(
...     '''<iq xmlns="jabber:client" 
...          to="__anonymous__admin@localhost/8978528613056092673206"
...          from="conference.localhost" id="disco" type="result">
...          <query xmlns="http://jabber.org/protocol/disco#items">
...            <item jid="[email protected]" name="pgatt (1)"/>
...            <item jid="[email protected]" name="pgatt (1)"/>
...          </query>
...        </iq>''')
>>> for item in doc.iq.query.item:
...   print item.jid, item.name
...
[email protected] pgatt (1)
[email protected] pgatt (1)
>>>

Once I discovered Amara, I would never consider processing XML any other way.

蓝天白云 2024-11-12 23:34:37

我之前回答过一个类似的问题,关于如何解析和搜索 xml 数据。

全文搜索 XML 数据Python:最佳实践、优点和优点缺点

您需要查看 xml2json 函数。
该函数需要一个 minidom 对象。这就是我获取 xml 的方式,不知道你是如何做到的。

from xml.dom import minidom
x = minidom.parse(urllib.urlopen(url))
json = xml2json(x)

或者,如果您使用字符串而不是 url:

x = minidom.parseString(xml_string)
json = xml2json(x)

xml2json 函数将返回一个字典,其中包含在 xml 中找到的所有值。您可能需要尝试一下并打印输出以查看布局是什么样的。

I answered a similar question earlier about how to parse and search through xml data.

Full text searching XML data with Python: best practices, pros & cons

You'll want to look at the xml2json function.
The function expects a minidom object. This is how I got my xml, not sure how you do it.

from xml.dom import minidom
x = minidom.parse(urllib.urlopen(url))
json = xml2json(x)

Or if you use a string and not a url:

x = minidom.parseString(xml_string)
json = xml2json(x)

The xml2json function will then return a dictionary with all values found in the xml. You may have to try it out and print the output to see what the layout looks like.

枯叶蝶 2024-11-12 23:34:37

我错过了这趟船,但这是在关心命名空间的同时如何做到这一点的。

您可以在查询中将它们全部拼写出来,也可以自己制作一个名称空间映射并将其传递给 xpath 查询。

from lxml import etree

data = """<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206"
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>"""

nsmap = {
  'jc': "jabber:client",
  'di':"http://jabber.org/protocol/disco#items"
}

doc = etree.XML(data)

for item in doc.xpath('//jc:iq/di:query/di:item',namespaces=nsmap):
  print etree.tostring(item).strip()
  print "Name: %s\nJabberID: %s\n" % (item.attrib.get('name'),item.attrib.get('jid'))

生产:

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]

I've missed the boat, but here's how you do it while caring about namespaces.

You can either spell them all out in the query, or make yourself a namespace map which you pass to the xpath query.

from lxml import etree

data = """<iq xmlns="jabber:client" to="__anonymous__admin@localhost/8978528613056092673206"
 from="conference.localhost" id="disco" type="result">
    <query xmlns="http://jabber.org/protocol/disco#items">
        <item jid="[email protected]" name="pgatt (1)"/>
        <item jid="[email protected]" name="pgatt (1)"/>
    </query>
</iq>"""

nsmap = {
  'jc': "jabber:client",
  'di':"http://jabber.org/protocol/disco#items"
}

doc = etree.XML(data)

for item in doc.xpath('//jc:iq/di:query/di:item',namespaces=nsmap):
  print etree.tostring(item).strip()
  print "Name: %s\nJabberID: %s\n" % (item.attrib.get('name'),item.attrib.get('jid'))

Produces:

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]

<item xmlns="http://jabber.org/protocol/disco#items" jid="[email protected]" name="pgatt (1)"/>
Name: pgatt (1)
JabberID: [email protected]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文