一个对大文件有效的轻量级 XML 解析器?

发布于 2024-07-24 00:01:55 字数 209 浏览 11 评论 0原文

我需要解析潜在的巨大 XML 文件,所以我猜这排除了 DOM 解析器。

是否有任何优秀的 C++ 轻量级 SAX 解析器,在占用空间上可与 TinyXML 相媲美? XML的结构非常简单,不需要诸如命名空间和DTD之类的高级东西。 只是元素、属性和 cdata。

我了解 Xerces,但它超过 50mb 的大小让我不寒而栗。

谢谢!

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.

Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint?
The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.

I know about Xerces, but its sheer size of over 50mb gives me shivers.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

绝情姑娘 2024-07-31 00:01:56

如果您的 XML 结构非常简单,您可以考虑构建一个基于 lex/yacc (flex/bison) 的简单词法分析器/扫描器。 W3C 的资源可能会给您带来启发:http://www.w3.org/XML /9707/parser.yhttp://www.w3.org /XML/9707/scanner.l

另请参阅 libxml 中的 SAX2 接口

If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.

See also the SAX2 interface in libxml

黑白记忆 2024-07-31 00:01:56

firstobject 的 CMarkup 是一个 C++ 类,用作轻量级大文件拉取解析器(我推荐拉取)解析器而不是 SAX),以及巨大的 XML 文件编写器。 它为您的可执行文件添加了大约 250kb。 当在内存中使用时,根据一个用户的报告,它的占用空间是tinyxml 的1/3。 当用于大文件时,它只在内存中保留一个小缓冲区(如 16kb)。 CMarkup 目前是一个商业产品,因此它受到支持、记录和设计,可以通过单个 cpp 和 h 文件轻松添加到您的项目中。

最简单的尝试方法是在免费的 Firstobject XML 编辑器中使用脚本,如下所示:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

从“文件”菜单中选择“新建程序”,将其粘贴到其中并针对您的元素和属性进行修改,按 F9 运行它或按 F10 运行它。逐行浏览它。

firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.

The easiest way to try it out is with a script in the free firstobject XML editor such as this:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.

梦纸 2024-07-31 00:01:56

你可以尝试 https://github.com/thinlizzy/die-xml 。 它似乎非常小并且易于使用,

这是一个最近开源的 C++0x XML SAX 解析器,作者愿意反馈

它解析输入流并在与

堆栈机使用的 std::function 兼容的回调上生成事件有限自动机作为后端,一些事件(开始标签和文本节点)使用迭代器来最小化缓冲,使其非常轻量级

you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use

this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks

it parses an input stream and generates events on callbacks compatible to std::function

the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight

永言不败 2024-07-31 00:01:56

如果您想要小型且快速,我会考虑生成特定于 DTD/Schema 的解析器的工具。 这些对于大型文档非常有用。

I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.

被翻牌 2024-07-31 00:01:56

我强烈推荐 pugixml

pugixml 是一个轻量级的 C++ XML 处理库。

“pugixml 是一个 C++ XML 处理库,它由一个具有丰富遍历/修改功能的类 DOM 接口、一个极快的 XML 解析器(从 XML 文件/缓冲区构造 DOM 树)以及一个 XPath 1.0 实现组成。复杂的数据驱动树查询也可用,具有 Unicode 接口变体和不同 Unicode 编码之间的转换。”

在商业中选择和使用 pugixml 之前,我已经测试了一些 XML 解析器,包括一些昂贵的解析器 。产品。

pugixml不仅是最快的解析器,而且拥有最成熟和友好的API。 我强烈推荐它。 是非常稳定的产品! 我从0.8版本开始使用它。 现在是1.7。

这个解析器的最大好处是 XPath 1.0 实现! 对于任何更复杂的树查询,XPath 是上帝赐予的功能!

具有丰富遍历/修改功能的类似 DOM 的界面对于处理现实生活中的“重”XML 文件非常有用。

它是一个小而快速的解析器。 如果您不介意链接 C++ 代码,即使对于 iOS 或 Android 应用程序来说,它也是不错的选择。

基准可以说明很多事情。 请参阅:http://pugixml.org/benchmark.html

(x86) 的一些示例:

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

对于 ( x64) pugixml 是我所知道的最快的解析器。

还要检查 XML 解析器对内存的使用情况。 有些解析器只会吞噬宝贵的内存!

I highly recommend pugixml

pugixml is a light-weight C++ XML processing library.

"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."

I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.

pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.

The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!

DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.

It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.

Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html

A few examples for (x86):

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

For (x64) pugixml is the fastest parser which I know.

Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

擦肩而过的背影 2024-07-31 00:01:55

如果您使用 C,那么您可以使用 LibXML .gnome.org" rel="nofollow noreferrer">Gnome 项目。 您可以选择文档的 DOM 和 SAX 接口,以及多年来开发的许多附加功能。 如果您确实需要 C++,那么您可以使用 libxml++,它是 LibXML 的 C++ OO 包装器。

该库已经被一次又一次证明,具有高性能,并且可以在几乎任何您能找到的平台上进行编译。

If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.

The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.

淡看悲欢离合 2024-07-31 00:01:55

我喜欢外籍人士
http://expat.sourceforge.net/

它是基于 C 的,但有几个 C++ 包装器可以提供帮助。

I like ExPat
http://expat.sourceforge.net/

It is C based but there are several C++ wrappers around to help.

感受沵的脚步 2024-07-31 00:01:55

RapidXML 是一个用 C++ 编写的 XML 快速解析器。

RapidXML is quite a fast parser for XML written in C++.

安人多梦 2024-07-31 00:01:55

http://sourceforge.net/projects/wsdlpull 这是 java xmlpull api 的直接 C++ 端口(http://www.xmlpull.org/)

我强烈推荐这个解析器。 我必须对其进行自定义,以便在我的嵌入式设备上使用(不支持 STL),但我发现它速度非常快,而且开销很小。 我必须创建自己的字符串和向量类,即使使用这些类,它在 Windows 上编译到大约 60k。

我认为拉式解析比 SAX 之类的解析直观得多。 该代码更接近地反映了 xml 文档,从而可以轻松地将两者关联起来。

一个缺点是它只是向前的,这意味着您需要在元素出现时对其进行解析。 我们的读取配置文件的设计相当混乱,我需要解析整个子树,进行一些检查,然后设置一些默认值,然后再次解析。 使用这个解析器,处理类似问题的唯一真正方法是复制状态,用它解析,然后继续处理原始状态。 与我们旧的 DOM 解析器相比,它最终在资源方面仍然是一个巨大的胜利。

http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)

I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.

I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.

The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文