ElementTree(1.3.0) Python中XML解析的高效方法
我正在尝试解析一个巨大的 XML 文件(20MB-3GB)。文件是来自不同仪器的样本。所以,我正在做的是从文件中查找必要的元素信息并将它们插入到数据库(Django)中。
我的文件样本的一小部分。命名空间存在于所有文件中。文件的有趣功能是它们具有比文本更多的节点属性
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
小但完整的文件是这里
所以到目前为止我所做的是对每个元素使用 findall 兴趣。
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute of instrument
print insattrib["id"]
如何访问instrumentConfiguration(s)元素的所有子元素/孙元素?
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
我想要的示例
InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector
当命名空间存在时,是否有有效的方法来解析元素/子元素/子元素?或者我每次都必须使用 find/findall 来访问具有命名空间的树中的特定元素?这只是一个小例子,我必须解析更复杂的元素层次结构。
任何建议!
编辑
没有得到正确的答案,所以必须再次编辑!
I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).
Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
Small but complete file is here
So what I have done till now is using findall for every element of interest.
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute of instrument
print insattrib["id"]
How can I access all children/grandchildren of instrumentConfiguration (s) element?
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
Example of what I want
InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector
Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.
Any suggestions!
Edit
Didn't got the correct answer so have to edit once more!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是一个脚本,可以在
40
秒(在我的机器上)内解析一百万个/>
元素(967MB
文件),而不会消耗大量资源内存量。吞吐量为
24MB/s
。cElementTree 页面 (2005)
报告47MB/ s
。输出
注意:代码很脆弱,它假设
的前两个子项是
和
和所有值都可用作标记名称或属性。在这种情况下,ElementTree 1.3 的性能
比 cElementTree 1.0.6 慢约 6 倍。
如果将
root.clear()
替换为elem.clear()
,则代码速度会加快约 10%,但内存会增加约 10 倍。lxml.etree
与elem.clear()
变体配合使用,性能与cElementTree
相同,但消耗 20 (root .clear()
) / 2 (elem.clear()
) 倍内存 (500MB)。Here's a script that parses one million
<instrumentConfiguration/>
elements (967MB
file) in40
seconds (on my machine) without consuming large amount of memory.The throughput is
24MB/s
. ThecElementTree page (2005)
reports47MB/s
.Output
Note: The code is fragile it assumes that the first two children of
<instrumentConfiguration/>
are<cvParam/>
and<componentList/>
and all values are available as tag names or attributes.On performance
ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.
If you replace
root.clear()
byelem.clear()
then the code is ~10% faster but ~10 times more memory.lxml.etree
works withelem.clear()
variant, the performance is the same as forcElementTree
but it consumes 20 (root.clear()
) / 2 (elem.clear()
) times as much memory (500MB).如果这仍然是当前问题,您可以尝试 pymzML,mzML 文件的 python 接口。网站:
http://pymzml.github.com/
If this is still a current issue, you might try pymzML, a python Interface to mzML Files. Website:
http://pymzml.github.com/
在这种情况下,我将使用 findall 来查找所有 InstrumentList 元素。然后,在这些结果上,只需访问数据,就像instrumentList 和instrument 是数组一样,您可以获得所有元素,而不必搜索所有元素。
In this case I would get findall to find all the instrumentList elements. Then on those results just access the data as if instrumentList and instrument were arrays, you get all the elements and don't have to search for them all.
如果您的文件很大,请查看
iterparse()
函数。请务必阅读这篇文章elementtree的作者,特别是关于“增量解析”的部分。
If your files are huge, have a look at the
iterparse()
function. Be sure to read this articleby elementtree's author, especially the part about "incremental parsing".
我知道这已经很旧了,但是我在进行 XML 解析时遇到了这个问题,因为我的 XML 文件非常大。
JF Sebastian的回答确实是正确的,但出现了以下问题。
我注意到,如果您迭代起始属性,有时 elem.text 中的值(如果您在 XML 中有值而不是属性)无法正确读取(有时不返回任何值)。我必须像这样迭代“结束”
如果有人想获取 xml 标记(而不是属性)内的文本,也许他应该迭代“结束”事件而不是“开始”。
但是,如果所有值都在属性中,那么 JF Sebastian 的答案中的代码更正确。
我的案例的 XML 示例:
I know that this is old, but I run into this issue while doing XML parsing, where my XML files where really large.
J.F. Sebastian's answer is indeed correct, but the following issue came up.
What I noticed, is that sometimes the values in elem.text ( if you have values inside XML and not as attributes) are not read correctly (sometimes None is returned) if you iterate through the start attributes. I had to iterate through the 'end' like this
If someone wants to get the text inside an xml tag (and not an attribute) maybe he should iterate through the 'end' events and not 'start'.
However, if all the values are in attributes, then the code in J.F. Sebastian's answer is more correct.
XML example for my case: