将大型 XML 文件分割成小块

发布于 2024-10-14 20:22:20 字数 192 浏览 5 评论 0原文

我有一个大型维基百科转储,我想将其切成不同的文件(每篇文章 1 个文件)。我写了一个 VB 应用程序来帮我做这件事,但它非常慢,并且在几个小时的剪辑后就崩溃了。我目前使用另一个应用程序将文件分割成更小的 50mb 块,但这需要很长时间(每个块 20-30 分钟)。如果我这样做,我应该能够将其中的每一个单独切割。

有人有什么建议可以更快地剪切这个文件吗?

I have a large wikipedia dump that I want to cut into different files (1 file for each article). I wrote a VB App to do it for me, but it was quite slow and crapped out after a few hours of cutting. Im currently splitting the file into smaller 50mb chunks using another app but thats taking a long time (20-30 minutes for each chunk). I should be able to cut each of these up individually if I do this.

Does anyone have any suggestions of a way to cut this file up quicker?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

毁梦 2024-10-21 20:22:20

使用 C# 执行此操作的最简单方法是使用 XmlReader。您可以单独使用 XmlReader 以获得最快的实现速度,也可以与新的 LINQ XNode 类结合使用以获得性能和易用性的良好组合。有关示例,请参阅此 MSDN 文章: http:// /msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx

您应该能够修改该示例,使其一次仅在内存中保存一个文档的节点,然后将其作为文件写回。它应该表现良好并且适用于非常大的文件。

The easiest way to do this with C# is with an XmlReader. You can stay with the XmlReader alone for the fastest implementation or combine with the new LINQ XNode classes for a decent combination of performance and ease of use. See this MSDN article for an example: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx.

You should be able to modify the example to only hold the node for one document in memory at a time and then write it back out as a file. It should perform well and work for very large files.

扮仙女 2024-10-21 20:22:20

我假设您正在使用 DOM 解析器。对于可能很大的文件,您应该始终使用 SAX 解析器。 DOM 解析器将整个文件读入内存,SAX 解析器一次读取尽可能少的内容,因此操作效率更高。 本教程介绍如何编写 C# SAX 解析器, VB应该很相似。

I'm assuming that you're using the DOM parser. For potentially large files you should always use SAX parsers. DOM parsers read the entire file into memory, SAX parsers read as little as possible at a time, and therefore operate much more efficiently. This tutorial describes how to write a C# SAX parser, VB should be very similar.

紫瑟鸿黎 2024-10-21 20:22:20

如果我在 Java 中执行此操作,我会使用 javax.xml.stream.XMLEventReaderjavax.xml.stream.XMLEventWriter

在某种伪代码中,我们假设

标记分隔每个维基百科文章,您无需担心嵌套的

标记,并且您有一个 openNewWriter() 函数来打开一个新的 XMLEventWriter,它会写入一个具有适合本文名称的新文件。

那么我的代码将如下所示:

XMLEventReader r = // an XMLEventReader for the original wikipedia dump

XMLEventWriter w = null;

bool isInsideArticle = false;

while (r.hasNext()){
  XMLEvent e = r.nextEvent();

  if (e.isStartElement() &&
        e.asStartElement().getName().getLocalPart().equals("article")){
     w = openNewWriter();
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     w.add(e);
     isInsideArticle = true;
  } else if (e.isEndElement() &&
           e.asEndElement().getName().getLocalPart().equals("article")) {
     w.add(e);
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     isInsideArticle = false;
     w.close();
  } else if (isInsideArticle) {
     w.add(e);
  } else {
     // this tag gets dropped on the floor because it's not inside any article
  }
}

现在您需要做的就是在 .NET 中找到流式 XML 类。我认为它们是 system.xml.XMLReadersystem.xml.XMLWriter,但我的专业知识是' t 在 .NET 中,我无法从文档中判断它们的工作方式是否与我刚刚提供给您的 Java 版本完全相同。

(我在这里的目的更多的是向您展示如何解决问题,而不是告诉您所需的类的名称。)

If I were doing this in Java, I'd use javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter.

In some sort of pseudocode, let's assume an <article> tag delimits each wikipedia article, that you don't need to worry about nested <article> tags, and you have an openNewWriter() function to open a new XMLEventWriter that writes to a new file with a suitable name for this article.

Then my code would look like something like this:

XMLEventReader r = // an XMLEventReader for the original wikipedia dump

XMLEventWriter w = null;

bool isInsideArticle = false;

while (r.hasNext()){
  XMLEvent e = r.nextEvent();

  if (e.isStartElement() &&
        e.asStartElement().getName().getLocalPart().equals("article")){
     w = openNewWriter();
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     w.add(e);
     isInsideArticle = true;
  } else if (e.isEndElement() &&
           e.asEndElement().getName().getLocalPart().equals("article")) {
     w.add(e);
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     isInsideArticle = false;
     w.close();
  } else if (isInsideArticle) {
     w.add(e);
  } else {
     // this tag gets dropped on the floor because it's not inside any article
  }
}

Now all you need to do is find the streaming XML classes in .NET. I think they're system.xml.XMLReader and system.xml.XMLWriter, but my expertise isn't in .NET, and I can't tell from the documentation whether they'll work quite same way as the Java version I just gave you.

(My purpose here is more to show you how to approach the problem than to tell you the names of the classes you need.)

梦在深巷 2024-10-21 20:22:20

您应该尝试 vtd-xml,有人告诉我们它在分割大型 XML 文件方面效果如何...... http://www.codeproject.com/KB/XML/xml_processing_future.aspx
我们还被告知 DOM 需要很长时间

You should try vtd-xml for that, we have got people telling us how well it works for splitting large XML files... http://www.codeproject.com/KB/XML/xml_processing_future.aspx
we were also told that DOM takes forever

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文