在python中从大量xml文件中提取信息最有效的方法是什么?
我有一个完整的目录(~103, 104) XML 文件,我需要从中提取多个字段的内容。 我测试了不同的 xml 解析器,因为我不需要验证内容(昂贵),所以我想简单地使用 xml.parsers.expat (最快的)来浏览文件,一一提取数据。
- 有更有效的方法吗? (简单的文本匹配不起作用)
- 我是否需要为每个新文件(或字符串)发出一个新的 ParserCreate() 或者我可以为每个文件重复使用相同的 ParserCreate() 吗?
- 有什么注意事项吗?
谢谢!
I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields.
I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.
- Is there a more efficient way? (simple text matching doesn't work)
- Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
- Any caveats?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通常,我建议使用 ElementTree 的
iterparse
,或者extra-speed,来自 lxml 的对应项。 还可以尝试使用 Processing (2.6 内置)进行并行化。iterparse
的重要一点是,您可以在解析元素(子)结构时获取它们。在这种情况下,
event
将始终是字符串"end"
,但您也可以初始化解析器,以便在解析新元素时告诉您有关新元素的信息。 您无法保证此时所有子元素都已被解析,但如果您只对此感兴趣,则属性就在那里。另一点是,您可以提前停止从迭代器读取元素,即在处理整个文档之前。
如果文件很大(是吗?),有一个常见的习惯用法可以保持内存使用量恒定,就像在流解析器中一样。
Usually, I would suggest using ElementTree's
iterparse
, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.The important thing about
iterparse
is that you get the element (sub-)structures as they are parsed.event
will always be the string"end"
in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.
If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.
最快的方法是匹配字符串(例如,使用正则表达式)而不是解析 XML - 这实际上可以工作,具体取决于您的 XML。
但最重要的是:不要考虑多个选项,而只需实施它们并在一小部分时间上进行计时。 这将花费大致相同的时间,并且会给你真正的数字,推动你前进。
编辑:
The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.
But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.
EDIT:
如果您知道 XML 文件是使用相同的算法生成的,那么根本不进行任何 XML 解析可能会更有效。 例如,如果您知道数据位于第 3、4 和 5 行,您可以逐行读取文件,然后使用正则表达式。
当然,如果文件不是机器生成的,或者来自不同的生成器,或者生成器随着时间的推移而变化,那么该方法将会失败。 不过,我乐观地认为它会更加高效。
是否回收解析器对象在很大程度上是无关紧要的。 将会创建更多的对象,因此单个解析器对象实际上并没有多大意义。
If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.
Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.
Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.
您没有指出的一件事是您是否将 XML 读入某种 DOM。 我猜你可能不是,但万一你是,那就不要。 请改用 xml.sax。 使用 SAX 代替 DOM 将会显着提高性能。
One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.