在python中从大量xml文件中提取信息最有效的方法是什么？

发布于 2024-07-09 16:55:24 字数 322 浏览 9 评论 0原文

我有一个完整的目录（~10³, 10⁴） XML 文件，我需要从中提取多个字段的内容。我测试了不同的 xml 解析器，因为我不需要验证内容（昂贵），所以我想简单地使用 xml.parsers.expat （最快的）来浏览文件，一一提取数据。

有更有效的方法吗？（简单的文本匹配不起作用）
我是否需要为每个新文件（或字符串）发出一个新的 ParserCreate() 或者我可以为每个文件重复使用相同的 ParserCreate() 吗？
有什么注意事项吗？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾城泪 2024-07-16 16:55:24

通常，我建议使用 ElementTree 的 iterparse，或者extra-speed，来自 lxml 的对应项。还可以尝试使用 Processing （2.6 内置）进行并行化。

iterparse 的重要一点是，您可以在解析元素（子）结构时获取它们。

import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()

在这种情况下，event 将始终是字符串 "end"，但您也可以初始化解析器，以便在解析新元素时告诉您有关新元素的信息。您无法保证此时所有子元素都已被解析，但如果您只对此感兴趣，则属性就在那里。

另一点是，您可以提前停止从迭代器读取元素，即在处理整个文档之前。

如果文件很大（是吗？），有一个常见的习惯用法可以保持内存使用量恒定，就像在流解析器中一样。

Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.

The important thing about iterparse is that you get the element (sub-)structures as they are parsed.

import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()

event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.

Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.

If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.

回复收藏 0 原文