将大型 XML 文件分割成小块

发布于 2024-10-14 20:22:20 字数 192 浏览 5 评论 0原文

我有一个大型维基百科转储，我想将其切成不同的文件（每篇文章 1 个文件）。我写了一个 VB 应用程序来帮我做这件事，但它非常慢，并且在几个小时的剪辑后就崩溃了。我目前使用另一个应用程序将文件分割成更小的 50mb 块，但这需要很长时间（每个块 20-30 分钟）。如果我这样做，我应该能够将其中的每一个单独切割。

有人有什么建议可以更快地剪切这个文件吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁梦 2024-10-21 20:22:20

使用 C# 执行此操作的最简单方法是使用 XmlReader。您可以单独使用 XmlReader 以获得最快的实现速度，也可以与新的 LINQ XNode 类结合使用以获得性能和易用性的良好组合。有关示例，请参阅此 MSDN 文章： http:// /msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx。

您应该能够修改该示例，使其一次仅在内存中保存一个文档的节点，然后将其作为文件写回。它应该表现良好并且适用于非常大的文件。

回复收藏 0 原文

扮仙女 2024-10-21 20:22:20

我假设您正在使用 DOM 解析器。对于可能很大的文件，您应该始终使用 SAX 解析器。 DOM 解析器将整个文件读入内存，SAX 解析器一次读取尽可能少的内容，因此操作效率更高。本教程介绍如何编写 C# SAX 解析器， VB应该很相似。

回复收藏 0 原文

紫瑟鸿黎 2024-10-21 20:22:20

如果我在 Java 中执行此操作，我会使用 javax.xml.stream.XMLEventReader 和 javax.xml.stream.XMLEventWriter。

在某种伪代码中，我们假设

标记分隔每个维基百科文章，您无需担心嵌套的

标记，并且您有一个 openNewWriter() 函数来打开一个新的 XMLEventWriter，它会写入一个具有适合本文名称的新文件。

那么我的代码将如下所示：

XMLEventReader r = // an XMLEventReader for the original wikipedia dump

XMLEventWriter w = null;

bool isInsideArticle = false;

while (r.hasNext()){
  XMLEvent e = r.nextEvent();

  if (e.isStartElement() &&
        e.asStartElement().getName().getLocalPart().equals("article")){
     w = openNewWriter();
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     w.add(e);
     isInsideArticle = true;
  } else if (e.isEndElement() &&
           e.asEndElement().getName().getLocalPart().equals("article")) {
     w.add(e);
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     isInsideArticle = false;
     w.close();
  } else if (isInsideArticle) {
     w.add(e);
  } else {
     // this tag gets dropped on the floor because it's not inside any article
  }
}

现在您需要做的就是在 .NET 中找到流式 XML 类。我认为它们是 system.xml.XMLReader 和system.xml.XMLWriter，但我的专业知识是' t 在 .NET 中，我无法从文档中判断它们的工作方式是否与我刚刚提供给您的 Java 版本完全相同。

（我在这里的目的更多的是向您展示如何解决问题，而不是告诉您所需的类的名称。）

If I were doing this in Java, I'd use javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter.

In some sort of pseudocode, let's assume an <article> tag delimits each wikipedia article, that you don't need to worry about nested <article> tags, and you have an openNewWriter() function to open a new XMLEventWriter that writes to a new file with a suitable name for this article.

Then my code would look like something like this:

XMLEventReader r = // an XMLEventReader for the original wikipedia dump

XMLEventWriter w = null;

bool isInsideArticle = false;

while (r.hasNext()){
  XMLEvent e = r.nextEvent();

  if (e.isStartElement() &&
        e.asStartElement().getName().getLocalPart().equals("article")){
     w = openNewWriter();
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     w.add(e);
     isInsideArticle = true;
  } else if (e.isEndElement() &&
           e.asEndElement().getName().getLocalPart().equals("article")) {
     w.add(e);
     // write the stuff that belongs outside the <article> tag
     // by synthesizing XMLEvents and using w.add() to add them
     isInsideArticle = false;
     w.close();
  } else if (isInsideArticle) {
     w.add(e);
  } else {
     // this tag gets dropped on the floor because it's not inside any article
  }
}

Now all you need to do is find the streaming XML classes in .NET. I think they're system.xml.XMLReader and system.xml.XMLWriter, but my expertise isn't in .NET, and I can't tell from the documentation whether they'll work quite same way as the Java version I just gave you.

(My purpose here is more to show you how to approach the problem than to tell you the names of the classes you need.)

回复收藏 0 原文