将大型 XML 文件分割成小块
我有一个大型维基百科转储,我想将其切成不同的文件(每篇文章 1 个文件)。我写了一个 VB 应用程序来帮我做这件事,但它非常慢,并且在几个小时的剪辑后就崩溃了。我目前使用另一个应用程序将文件分割成更小的 50mb 块,但这需要很长时间(每个块 20-30 分钟)。如果我这样做,我应该能够将其中的每一个单独切割。
有人有什么建议可以更快地剪切这个文件吗?
I have a large wikipedia dump that I want to cut into different files (1 file for each article). I wrote a VB App to do it for me, but it was quite slow and crapped out after a few hours of cutting. Im currently splitting the file into smaller 50mb chunks using another app but thats taking a long time (20-30 minutes for each chunk). I should be able to cut each of these up individually if I do this.
Does anyone have any suggestions of a way to cut this file up quicker?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用 C# 执行此操作的最简单方法是使用 XmlReader。您可以单独使用 XmlReader 以获得最快的实现速度,也可以与新的 LINQ XNode 类结合使用以获得性能和易用性的良好组合。有关示例,请参阅此 MSDN 文章: http:// /msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx。
您应该能够修改该示例,使其一次仅在内存中保存一个文档的节点,然后将其作为文件写回。它应该表现良好并且适用于非常大的文件。
The easiest way to do this with C# is with an XmlReader. You can stay with the XmlReader alone for the fastest implementation or combine with the new LINQ XNode classes for a decent combination of performance and ease of use. See this MSDN article for an example: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx.
You should be able to modify the example to only hold the node for one document in memory at a time and then write it back out as a file. It should perform well and work for very large files.
我假设您正在使用 DOM 解析器。对于可能很大的文件,您应该始终使用 SAX 解析器。 DOM 解析器将整个文件读入内存,SAX 解析器一次读取尽可能少的内容,因此操作效率更高。 本教程介绍如何编写 C# SAX 解析器, VB应该很相似。
I'm assuming that you're using the DOM parser. For potentially large files you should always use SAX parsers. DOM parsers read the entire file into memory, SAX parsers read as little as possible at a time, and therefore operate much more efficiently. This tutorial describes how to write a C# SAX parser, VB should be very similar.
如果我在 Java 中执行此操作,我会使用 javax.xml.stream.XMLEventReader 和 javax.xml.stream.XMLEventWriter。
在某种伪代码中,我们假设
标记分隔每个维基百科文章,您无需担心嵌套的
标记,并且您有一个
openNewWriter()
函数来打开一个新的XMLEventWriter
,它会写入一个具有适合本文名称的新文件。那么我的代码将如下所示:
现在您需要做的就是在 .NET 中找到流式 XML 类。我认为它们是 system.xml.XMLReader 和system.xml.XMLWriter,但我的专业知识是' t 在 .NET 中,我无法从文档中判断它们的工作方式是否与我刚刚提供给您的 Java 版本完全相同。
(我在这里的目的更多的是向您展示如何解决问题,而不是告诉您所需的类的名称。)
If I were doing this in Java, I'd use javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter.
In some sort of pseudocode, let's assume an
<article>
tag delimits each wikipedia article, that you don't need to worry about nested<article>
tags, and you have anopenNewWriter()
function to open a newXMLEventWriter
that writes to a new file with a suitable name for this article.Then my code would look like something like this:
Now all you need to do is find the streaming XML classes in .NET. I think they're system.xml.XMLReader and system.xml.XMLWriter, but my expertise isn't in .NET, and I can't tell from the documentation whether they'll work quite same way as the Java version I just gave you.
(My purpose here is more to show you how to approach the problem than to tell you the names of the classes you need.)
您应该尝试 vtd-xml,有人告诉我们它在分割大型 XML 文件方面效果如何...... http://www.codeproject.com/KB/XML/xml_processing_future.aspx
我们还被告知 DOM 需要很长时间
You should try vtd-xml for that, we have got people telling us how well it works for splitting large XML files... http://www.codeproject.com/KB/XML/xml_processing_future.aspx
we were also told that DOM takes forever