当前位置：文江博客话题详情

一个对大文件有效的轻量级 XML 解析器？

发布于 2024-07-24 00:01:55 字数 209 浏览 11 评论 0原文

我需要解析潜在的巨大 XML 文件，所以我猜这排除了 DOM 解析器。

是否有任何优秀的 C++ 轻量级 SAX 解析器，在占用空间上可与 TinyXML 相媲美？ XML的结构非常简单，不需要诸如命名空间和DTD之类的高级东西。只是元素、属性和 cdata。

我了解 Xerces，但它超过 50mb 的大小让我不寒而栗。

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绝情姑娘 2024-07-31 00:01:56

如果您的 XML 结构非常简单，您可以考虑构建一个基于 lex/yacc (flex/bison) 的简单词法分析器/扫描器。 W3C 的资源可能会给您带来启发：http://www.w3.org/XML /9707/parser.y 和 http://www.w3.org /XML/9707/scanner.l。

另请参阅 libxml 中的 SAX2 接口

回复收藏 0 原文

黑白记忆 2024-07-31 00:01:56

firstobject 的 CMarkup 是一个 C++ 类，用作轻量级大文件拉取解析器（我推荐拉取）解析器而不是 SAX），以及巨大的 XML 文件编写器。它为您的可执行文件添加了大约 250kb。当在内存中使用时，根据一个用户的报告，它的占用空间是tinyxml 的1/3。当用于大文件时，它只在内存中保留一个小缓冲区（如 16kb）。 CMarkup 目前是一个商业产品，因此它受到支持、记录和设计，可以通过单个 cpp 和 h 文件轻松添加到您的项目中。

最简单的尝试方法是在免费的 Firstobject XML 编辑器中使用脚本，如下所示：

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

从“文件”菜单中选择“新建程序”，将其粘贴到其中并针对您的元素和属性进行修改，按 F9 运行它或按 F10 运行它。逐行浏览它。

firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.

The easiest way to try it out is with a script in the free firstobject XML editor such as this:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.

回复收藏 0 原文

梦纸 2024-07-31 00:01:56

你可以尝试 https://github.com/thinlizzy/die-xml 。它似乎非常小并且易于使用，

这是一个最近开源的 C++0x XML SAX 解析器，作者愿意反馈

它解析输入流并在与

堆栈机使用的 std::function 兼容的回调上生成事件有限自动机作为后端，一些事件（开始标签和文本节点）使用迭代器来最小化缓冲，使其非常轻量级

回复收藏 0 原文

永言不败 2024-07-31 00:01:56

如果您想要小型且快速，我会考虑生成特定于 DTD/Schema 的解析器的工具。这些对于大型文档非常有用。

回复收藏 0 原文

被翻牌 2024-07-31 00:01:56

我强烈推荐 pugixml

pugixml 是一个轻量级的 C++ XML 处理库。

“pugixml 是一个 C++ XML 处理库，它由一个具有丰富遍历/修改功能的类 DOM 接口、一个极快的 XML 解析器（从 XML 文件/缓冲区构造 DOM 树）以及一个 XPath 1.0 实现组成。复杂的数据驱动树查询也可用，具有 Unicode 接口变体和不同 Unicode 编码之间的转换。”

在商业中选择和使用 pugixml 之前，我已经测试了一些 XML 解析器，包括一些昂贵的解析器。产品。

pugixml不仅是最快的解析器，而且拥有最成熟和友好的API。我强烈推荐它。是非常稳定的产品！我从0.8版本开始使用它。现在是1.7。

这个解析器的最大好处是 XPath 1.0 实现！对于任何更复杂的树查询，XPath 是上帝赐予的功能！

具有丰富遍历/修改功能的类似 DOM 的界面对于处理现实生活中的“重”XML 文件非常有用。

它是一个小而快速的解析器。如果您不介意链接 C++ 代码，即使对于 iOS 或 Android 应用程序来说，它也是不错的选择。

基准可以说明很多事情。请参阅：http://pugixml.org/benchmark.html

(x86) 的一些示例：

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

对于 ( x64) pugixml 是我所知道的最快的解析器。

还要检查 XML 解析器对内存的使用情况。有些解析器只会吞噬宝贵的内存！

I highly recommend pugixml

pugixml is a light-weight C++ XML processing library.

"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."

I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.

pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.

The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!

DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.

It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.

Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html

A few examples for (x86):

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

For (x64) pugixml is the fastest parser which I know.

Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

回复收藏 0 原文

擦肩而过的背影 2024-07-31 00:01:55

如果您使用 C，那么您可以使用 LibXML .gnome.org" rel="nofollow noreferrer">Gnome 项目。您可以选择文档的 DOM 和 SAX 接口，以及多年来开发的许多附加功能。如果您确实需要 C++，那么您可以使用 libxml++，它是 LibXML 的 C++ OO 包装器。

该库已经被一次又一次证明，具有高性能，并且可以在几乎任何您能找到的平台上进行编译。

回复收藏 0 原文

淡看悲欢离合 2024-07-31 00:01:55

我喜欢外籍人士
http://expat.sourceforge.net/

它是基于 C 的，但有几个 C++ 包装器可以提供帮助。

回复收藏 0 原文

感受沵的脚步 2024-07-31 00:01:55

RapidXML 是一个用 C++ 编写的 XML 快速解析器。

回复收藏 0 原文

安人多梦 2024-07-31 00:01:55

http://sourceforge.net/projects/wsdlpull 这是 java xmlpull api 的直接 C++ 端口(http://www.xmlpull.org/)

我强烈推荐这个解析器。我必须对其进行自定义，以便在我的嵌入式设备上使用（不支持 STL），但我发现它速度非常快，而且开销很小。我必须创建自己的字符串和向量类，即使使用这些类，它在 Windows 上编译到大约 60k。

我认为拉式解析比 SAX 之类的解析直观得多。该代码更接近地反映了 xml 文档，从而可以轻松地将两者关联起来。

一个缺点是它只是向前的，这意味着您需要在元素出现时对其进行解析。我们的读取配置文件的设计相当混乱，我需要解析整个子树，进行一些检查，然后设置一些默认值，然后再次解析。使用这个解析器，处理类似问题的唯一真正方法是复制状态，用它解析，然后继续处理原始状态。与我们旧的 DOM 解析器相比，它最终在资源方面仍然是一个巨大的胜利。

回复收藏 0 原文

~没有更多了~