当前位置：文江博客话题详情

我应该在 C++ 中使用什么 XML 解析器？

发布于 2025-01-08 10:02:45 字数 1539 浏览 0 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

往事风中埋 2025-01-15 10:02:45

就像标准库容器一样，您应该使用什么库取决于您的需求。这是一个方便的流程图：

在此处输入图像描述

所以第一个问题是：您需要什么？

我需要完整的 XML 合规性

好的，所以您需要处理 XML。不是玩具 XML，而是真实的 XML。您需要能够读取和写入所有 XML 规范，而不仅仅是底层、易于解析的部分。您需要命名空间、文档类型、实体替换等工作。完整的 W3C XML 规范。

下一个问题是：您的 API 需要符合 DOM 或 SAX 吗？

我需要精确的 DOM 和/或 SAX 一致性

，所以您确实需要 API 是 DOM 和/或 SAX。它不能只是一个 SAX 风格的推送解析器，或者一个 DOM 风格的保留解析器。在 C++ 允许的范围内，它必须是实际的 DOM 或实际的 SAX。

您已选择：

Xerces

这是您的选择。它几乎是唯一具有完全（或接近 C++ 允许的）DOM 和 SAX 一致性的 C++ XML 解析器/编写器。它还具有 XInclude 支持、XML Schema 支持以及大量其他功能。

它没有真正的依赖关系。它使用 Apache 许可证。

我不关心 DOM 和/或 SAX 一致性

您选择了：

LibXML2

LibXML2 提供C 风格的接口（如果这确实困扰您，请使用 Xerces），尽管该接口至少在某种程度上是基于对象的并且易于包装。它提供了很多功能，例如 XInclude 支持（带有回调，以便您可以告诉它从哪里获取文件）、XPath 1.0 识别器、RelaxNG 和 Schematron 支持（尽管错误消息留下了很多 需要），等等。

它确实依赖 iconv，但可以在没有该依赖项的情况下对其进行配置。尽管这确实意味着您将拥有一组更有限的可解析的可能文本编码。

它使用 MIT 许可证。

我不需要完全的 XML 合规性

好吧，所以完全的 XML 合规性对您来说并不重要。您的 XML 文档要么完全在您的控制之下，要么保证使用 XML 的“基本子集”：没有命名空间、实体等。

那么，什么对您来说重要呢？下一个问题是：在您的 XML 工作中对您来说最重要的事情是什么？

最大 XML 解析性能

您的应用程序需要尽可能快地将 XML 转换为 C++ 数据结构。

您已选择：

RapidXML

这个 XML 解析器正如其包装上所写的那样：快速 XML。它甚至不涉及将文件拉入内存；如何发生取决于你。它所做的就是将其解析为一系列您可以访问的 C++ 数据结构。它执行此操作的速度与逐字节扫描文件的速度一样快。

当然，天下没有免费的午餐。与大多数不关心 XML 规范的 XML 解析器一样，Rapid XML 不涉及名称空间、DocType、实体（字符实体和 6 个基本 XML 实体除外）等。基本上是节点、元素、属性等等。

而且，它是一个 DOM 风格的解析器。因此，它确实要求您阅读所有文本。但是，它不会复制任何文本（通常）。 RapidXML 获得大部分速度的方式是通过就地引用字符串。这需要您进行更多的内存管理（当 RapidXML 查看该字符串时，您必须保持该字符串处于活动状态）。

RapidXML 的 DOM 是简单的。您可以获得事物的字符串值。您可以按名称搜索属性。就是这样。没有方便的函数可以将属性转换为其他值（数字、日期等）。你只得到字符串。

RapidXML 的另一个缺点是编写 XML 很痛苦。它要求您对字符串名称进行大量显式内存分配才能构建其 DOM。它确实提供了一种字符串缓冲区，但这仍然需要您进行大量的显式工作。它确实很实用，但使用起来很痛苦。

它使用麻省理工学院的许可证。它是一个只有头文件的库，没有依赖项。

有一个 RapidXML“GitHub 补丁”，允许它也与命名空间一起使用。

我关心性能，但不太关心是的

，性能对您来说很重要。但也许您需要一些不那么简单的东西。也许可以处理更多的 Unicode，或者不需要那么多用户控制的内存管理。性能仍然很重要，但您想要一些不那么直接的东西。

您已选择：

PugiXML

从历史上看，这曾是 RapidXML 的灵感来源。但这两个项目有所不同，Pugi 提供更多功能，而 RapidXML 完全专注于速度。

PugiXML 提供 Unicode 转换支持，因此如果您有一些 UTF-16 文档并希望将它们读取为 UTF-8，Pugi 将提供。如果您需要此类东西，它甚至还有 XPath 1.0 实现。

但普吉的速度还是很快的。与 RapidXML 一样，它没有依赖性，并且根据 MIT 许可证分发。

阅读巨大的文档

您需要阅读大小为千兆字节的文档。也许您从标准输入获取它们，并由其他进程提供。或者您正在从大量文件中读取它们。或者无论如何。关键是，您需要的是不必一次将整个文件读入内存才能处理它。

您已选择：

LibXML2

Xerces 的 SAX 样式 API 将以此功能工作，但 LibXML2 在这里是因为它更容易使用。 SAX 风格的 API 是一种推送 API：它开始解析流并触发您必须捕获的事件。您被迫管理上下文、状态等等。读取 SAX 风格 API 的代码比人们想象的要分散得多。

LibXML2 的 xmlReader 对象是一个 pull-API。您要求转到下一个 XML 节点或元素；没人告诉你。这允许您按照您认为合适的方式存储上下文，以比一堆回调在代码中更具可读性的方式处理不同的实体。

替代方案

Expat

Expat 是一个著名的 C++ 解析器，它使用 pull-parser API。它的作者是詹姆斯·克拉克。

它的当前状态是活动的。最新版本是2.2.9，发布于（2019-09-25）。

LlamaXML

它是 StAX 风格 API 的实现。它是一个拉式解析器，类似于 LibXML2 的 xmlReader 解析器。

但它自 2005 年以来就没有更新过。所以，买者自负。

XPath 支持

XPath 是一个用于查询 XML 树中元素的系统。这是一种使用标准化语法通过公共属性有效命名元素或元素集合的便捷方法。许多 XML 库提供 XPath 支持。

这里实际上有三个选择：

LibXML2：它提供完整的 XPath 1.0 支持。再说一次，它是一个 C API，所以如果这让您烦恼，还有其他选择。
PugiXML：它还支持 XPath 1.0。如上所述，它更像是 C++ API，而不是 LibXML2，因此您可能会更习惯它。
TinyXML：它不支持 XPath，但有 TinyXPath 提供它的库。 TinyXML 正在转换为 2.0 版，这将显着改变 API，因此 TinyXPath 可能无法与新 API 一起使用。与 TinyXML 本身一样，TinyXPath 也是在 zLib 许可证下分发的。

完成工作

所以，您不必关心 XML 的正确性。性能对你来说不是问题。流媒体无关紧要。您所需要的只是某种东西能够将XML 存入内存并允许您将其再次粘回到磁盘上。您关心的是API。

您希望 XML 解析器体积小、易于安装、使用简单，并且小到与最终可执行文件的大小无关。

您已选择：

TinyXML

我将 TinyXML 放入此槽中是因为它与 XML 解析器一样简单易用。是的，它很慢，但它简单明了。它有很多方便的函数用于转换属性等。

在 TinyXML 中编写 XML 没有问题。您只需新建一些对象，将它们附加在一起，将文档发送到std::ostream，然后每个人都很高兴。

还有一些围绕 TinyXML 构建的生态系统，具有对迭代器更友好的 API，甚至在其之上分层的 XPath 1.0 实现。

TinyXML 使用 zLib 许可证，它或多或少是具有不同名称的 MIT 许可证。

Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:

enter image description here

So the first question is this: What do you need?

I Need Full XML Compliance

OK, so you need to process XML. Not toy XML, real XML. You need to be able to read and write all of the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.

The next question is: Does your API need to conform to DOM or SAX?

I Need Exact DOM and/or SAX Conformance

OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It must be the actual DOM or the actual SAX, to the extent that C++ allows.

You have chosen:

Xerces

That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.

It has no real dependencies. It uses the Apache license.

I Don't Care About DOM and/or SAX Conformance

You have chosen:

LibXML2

LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lot to be desired), and so forth.

It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.

It uses the MIT license.

I Do Not Need Full XML Compliance

OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.

So what does matter to you? The next question is: What is the most important thing to you in your XML work?

Maximum XML Parsing Performance

Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.

You have chosen:

RapidXML

This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.

Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.

Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copy any of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).

RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.

One other downside with RapidXML is that it is painful for writing XML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.

It uses the MIT licence. It is a header-only library with no dependencies.

There is a RapidXML "GitHub patch" that allows it to also work with namespaces.

I Care About Performance But Not Quite That Much

Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.

You have chosen:

PugiXML

Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.

PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.

But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.

Reading Huge Documents

You need to read documents that are measured in the gigabytes in size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to not have to read the entire file into memory all at once in order to process it.

You have chosen:

LibXML2

Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.

LibXML2's xmlReader object is a pull-API. You ask to go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.

Alternatives

Expat

Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.

It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).

LlamaXML

It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReader parser.

But it hasn't been updated since 2005. So again, Caveat Emptor.

XPath Support

XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.

There are effectively three choices here:

LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
TinyXML: It does not come with XPath support, but there is the TinyXPath library that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.

Just Get The Job Done

So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is something that gets XML into memory and allows you to stick it back onto disk again. What you care about is API.

You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.

You have chosen:

TinyXML

I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.

Writing XML is no problem in TinyXML. You just new up some objects, attach them together, send the document to a std::ostream, and everyone's happy.

There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.

TinyXML uses the zLib license, which is more or less the MIT License with a different name.

回复收藏 0 原文

谎言月老 2025-01-15 10:02:45

您可能需要考虑另一种处理 XML 的方法，称为 XML
数据绑定。特别是如果您已经有了 XML 词汇表的正式规范（例如 XML 模式）。

XML 数据绑定允许您使用 XML，而无需实际执行任何 XML 解析或序列化。数据绑定编译器自动生成所有低级代码，并将解析的数据呈现为与您的应用程序域相对应的 C++ 类。然后，您可以通过调用函数并使用 C++ 类型（int、double 等）来处理这些数据，而不是比较字符串和解析文本（这是使用低级 XML 访问 API（如 DOM 或 SAX）所做的事情）。

例如，请参阅我编写的开源 XML 数据绑定实现，
CodeSynthesis XSD 并且，对于
轻量级、无依赖版本，CodeSynthesis
XSD/e。

回复收藏 0 原文

长亭外，古道边 2025-01-15 10:02:45

在 Secured Globe, Inc. 中，我们使用 rapidxml。我们尝试了所有其他方法，但rapidxml似乎是我们的最佳选择。

这是一个例子：

 rapidxml::xml_document<char> doc;
    doc.parse<0>(xmlData);
    rapidxml::xml_node<char>* root = doc.first_node();

    rapidxml::xml_node<char>* node_account = 0;
    if (GetNodeByElementName(root, "Account", &node_account) == true)
    {
        rapidxml::xml_node<char>* node_default = 0;
        if (GetNodeByElementName(node_account, "default", &node_default) == true)
        {
            swprintf(result, 100, L"%hs", node_default->value());
            free(xmlData);
            return true;
        }
    }
    free(xmlData);

In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.

Here is an example:

 rapidxml::xml_document<char> doc;
    doc.parse<0>(xmlData);
    rapidxml::xml_node<char>* root = doc.first_node();

    rapidxml::xml_node<char>* node_account = 0;
    if (GetNodeByElementName(root, "Account", &node_account) == true)
    {
        rapidxml::xml_node<char>* node_default = 0;
        if (GetNodeByElementName(node_account, "default", &node_default) == true)
        {
            swprintf(result, 100, L"%hs", node_default->value());
            free(xmlData);
            return true;
        }
    }
    free(xmlData);

回复收藏 0 原文

有深☉意 2025-01-15 10:02:45

关于 Expat 的另一点注意事项：嵌入式系统工作值得关注。然而，您可能在网上找到的文档是古老且错误的。源代码实际上有相当详尽的函数级注释，但需要仔细阅读才能理解它们。

回复收藏 0 原文

长亭外，古道边 2025-01-15 10:02:45

好吧。我创建了一个新的列表，因为没有一个列表不能满足我的需求。

优点：

Pull 解析器 Streaming API 即解析器就像迭代器，没有回调或 DOM 树。即将 XML 读取到数据结构
异常和 RTTI 可以关闭通过编译器选项，可以通过 std::error_code 完成错误处理
内存使用限制，支持大文件（使用 100 mib XMark 文件进行测试，速度取决于硬件）。有一个有限 COLLADA 格式 3D 模型加载
UNICODE 支持，并自动检测输入源编码

项目主页

回复收藏 0 原文

长伴 2025-01-15 10:02:45

我的也放上来吧

http://www.codeproject.com /Articles/998388/XMLplusplus-version-The-Cplusplus-update-of-my-XML

没有 XML 验证功能，但速度很快。

回复收藏 0 原文

~没有更多了~

关于作者

梦里梦着梦中梦

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

我应该在 C++ 中使用什么 XML 解析器？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

我需要完整的 XML 合规性

我需要精确的 DOM 和/或 SAX 一致性

我不关心 DOM 和/或 SAX 一致性

我不需要完全的 XML 合规性

最大 XML 解析性能

我关心性能，但不太关心是的

阅读巨大的文档

替代方案

XPath 支持

完成工作

I Need Full XML Compliance

I Need Exact DOM and/or SAX Conformance

I Don't Care About DOM and/or SAX Conformance

I Do Not Need Full XML Compliance

Maximum XML Parsing Performance

I Care About Performance But Not Quite That Much

Reading Huge Documents

Alternatives

XPath Support

Just Get The Job Done

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

我应该在 C++ 中使用什么 XML 解析器？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

我需要完整的 XML 合规性

我需要精确的 DOM 和/或 SAX 一致性

我不关心 DOM 和/或 SAX 一致性

我不需要完全的 XML 合规性

最大 XML 解析性能

我关心性能，但不太关心 是的

阅读巨大的文档

替代方案

XPath 支持

完成工作

I Need Full XML Compliance

I Need Exact DOM and/or SAX Conformance

I Don't Care About DOM and/or SAX Conformance

I Do Not Need Full XML Compliance

Maximum XML Parsing Performance

I Care About Performance But Not Quite That Much

Reading Huge Documents

Alternatives

XPath Support

Just Get The Job Done

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

我关心性能，但不太关心是的