ElementTree(1.3.0) Python中XML解析的高效方法

发布于 2024-12-06 04:00:32 字数 2816 浏览 0 评论 0原文

我正在尝试解析一个巨大的 XML 文件（20MB-3GB）。文件是来自不同仪器的样本。所以，我正在做的是从文件中查找必要的元素信息并将它们插入到数据库（Django）中。

我的文件样本的一小部分。命名空间存在于所有文件中。文件的有趣功能是它们具有比文本更多的节点属性

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

小但完整的文件是这里

所以到目前为止我所做的是对每个元素使用 findall 兴趣。

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"]

如何访问instrumentConfiguration（s）元素的所有子元素/孙元素？

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

我想要的示例

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

当命名空间存在时，是否有有效的方法来解析元素/子元素/子元素？或者我每次都必须使用 find/findall 来访问具有命名空间的树中的特定元素？这只是一个小例子，我必须解析更复杂的元素层次结构。

任何建议！

编辑

没有得到正确的答案，所以必须再次编辑！

原文

I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).

Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

Small but complete file is here

So what I have done till now is using findall for every element of interest.

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"]

How can I access all children/grandchildren of instrumentConfiguration (s) element?

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

Example of what I want

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.

Any suggestions!

Edit

Didn't got the correct answer so have to edit once more!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灯角 2024-12-13 04:00:32

这是一个脚本，可以在 40 秒（在我的机器上）内解析一百万个 /> 元素（967MB 文件），而不会消耗大量资源内存量。

吞吐量为24MB/s。 cElementTree 页面 (2005) 报告 47MB/ s。

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

输出

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

注意：代码很脆弱，它假设的前两个子项是和和所有值都可用作标记名称或属性。

在这种情况下，ElementTree 1.3 的性能

比 cElementTree 1.0.6 慢约 6 倍。

如果将 root.clear() 替换为 elem.clear()，则代码速度会加快约 10%，但内存会增加约 10 倍。 lxml.etree 与 elem.clear() 变体配合使用，性能与 cElementTree 相同，但消耗 20 (root .clear()) / 2 (elem.clear()) 倍内存 (500MB)。

Here's a script that parses one million <instrumentConfiguration/> elements (967MB file) in 40 seconds (on my machine) without consuming large amount of memory.

The throughput is 24MB/s. The cElementTree page (2005) reports 47MB/s.

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

Output

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/> are <cvParam/> and <componentList/> and all values are available as tag names or attributes.

On performance

ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.

If you replace root.clear() by elem.clear() then the code is ~10% faster but ~10 times more memory. lxml.etree works with elem.clear() variant, the performance is the same as for cElementTree but it consumes 20 (root.clear()) / 2 (elem.clear()) times as much memory (500MB).

回复收藏 0 原文

放飞的风筝 2024-12-13 04:00:32

如果这仍然是当前问题，您可以尝试 pymzML，mzML 文件的 python 接口。网站：
http://pymzml.github.com/

回复收藏 0 原文

叹倦 2024-12-13 04:00:32

在这种情况下，我将使用 findall 来查找所有 InstrumentList 元素。然后，在这些结果上，只需访问数据，就像instrumentList 和instrument 是数组一样，您可以获得所有元素，而不必搜索所有元素。

回复收藏 0 原文

赢得她心 2024-12-13 04:00:32

如果您的文件很大，请查看 iterparse() 函数。请务必阅读这篇文章
elementtree的作者，特别是关于“增量解析”的部分。

回复收藏 0 原文

裂开嘴轻声笑有多痛 2024-12-13 04:00:32

我知道这已经很旧了，但是我在进行 XML 解析时遇到了这个问题，因为我的 XML 文件非常大。

JF Sebastian的回答确实是正确的，但出现了以下问题。

我注意到，如果您迭代起始属性，有时 elem.text 中的值（如果您在 XML 中有值而不是属性）无法正确读取（有时不返回任何值）。我必须像这样迭代“结束”

it = imap(itemgetter(1),
          iter(etree.iterparse(filename, events=('end',))))
root = next(it) # get root element

如果有人想获取 xml 标记（而不是属性）内的文本，也许他应该迭代“结束”事件而不是“开始”。

但是，如果所有值都在属性中，那么 JF Sebastian 的答案中的代码更正确。

我的案例的 XML 示例：

<data>
<country>
    <name>Liechtenstein</name>
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
</country>
<country>
    <name>Singapore</name>
    <rank>4</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
</country>
<country>
    <name>Panama</name>
    <rank>68</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
</country>

I know that this is old, but I run into this issue while doing XML parsing, where my XML files where really large.

J.F. Sebastian's answer is indeed correct, but the following issue came up.

What I noticed, is that sometimes the values in elem.text ( if you have values inside XML and not as attributes) are not read correctly (sometimes None is returned) if you iterate through the start attributes. I had to iterate through the 'end' like this

it = imap(itemgetter(1),
          iter(etree.iterparse(filename, events=('end',))))
root = next(it) # get root element

If someone wants to get the text inside an xml tag (and not an attribute) maybe he should iterate through the 'end' events and not 'start'.

However, if all the values are in attributes, then the code in J.F. Sebastian's answer is more correct.

XML example for my case:

<data>
<country>
    <name>Liechtenstein</name>
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
</country>
<country>
    <name>Singapore</name>
    <rank>4</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
</country>
<country>
    <name>Panama</name>
    <rank>68</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
</country>

回复收藏 0 原文

~没有更多了~