使用 libxml2 进行 XML 解析导致重复

发布于 2024-10-01 18:49:18 字数 2206 浏览 6 评论 0原文

我正在使用 libxml2 解析以下 XML 字符串：

<?xml version=\"1.0\"?>
<note>
    <to>
        <name>Tove</name>
        <name>Tovi</name>
    </to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

格式化为 C 样式字符串：

"<?xml version=\"1.0\"?><note><to><name>Tove</name><name>Tovi</name></to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

这是基于 W3C 站点上关于 XML 的示例；我只在“收件人”字段中添加了嵌套名称。

我在 C++ 中有以下递归代码，将其解析为对象树：

RBCXMLNode * RBCXMLDoc::recursiveProcess(xmlNodePtr node) {
    RBCXMLNode *rNode = new RBCXMLNode();
    xmlNodePtr childIterator = node->xmlChildrenNode;

    const char *chars = (const char *)(node->name);
    string name(chars);
    const char *content = (const char *)xmlNodeGetContent(node);
    rNode->setName(name);
    rNode->setUTF8Data(content);
    cout << "Just parsed " << rNode->name() << ": " << rNode->stringData() << endl;
    while (childIterator != NULL) {
        RBCXMLNode *rNode2 = recursiveProcess(childIterator);
        rNode->addChild(rNode2);
        childIterator = childIterator->next;
    }
    return rNode;
}

因此，它为每个节点创建匹配的对象，设置其名称和内容，然后为其子节点递归。请注意，每个节点仅处理一次。但是，我得到以下输出（至少对我来说是无意义的）：

<块引用>
刚刚解析的注释：ToveToviJaniReminder这个周末不要忘记我！
刚刚解析为：ToveTovi
刚刚解析的名字：Tove
刚刚解析的文本：Tove
刚刚解析的名字：Tovi
刚刚解析的文本：Tovi
刚刚解析自：Jani
刚刚解析的文本：Jani
刚刚解析的标题：提醒
刚刚解析的文本：提醒
刚刚解析的正文：这个周末别忘记我！
刚刚解析的文本：这个周末别忘记我！

请注意，每个项目都会被解析两次；一次将名称指定为“文本”，然后将其指定为应有的名称。此外，“note”根节点的数据也被解析；这是不希望的。另请注意，该根节点不会像其他节点一样被解析两次。

所以我有两个问题：

如何避免解析根节点的数据，而只得到它的名称而不是它的内容？这也可能发生在嵌套更深的节点上。
如何避免其他节点上的重复解析？显然，我想保留正确命名的版本，同时保持节点实际上被命名为“文本”的（不太可能）可能性。此外，可能存在所需的重复节点，因此仅检查节点是否已被解析并不是一种选择。

提前致谢。

原文

I'm using libxml2 to parse the following XML string:

<?xml version=\"1.0\"?>
<note>
    <to>
        <name>Tove</name>
        <name>Tovi</name>
    </to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

Formatted as a C-style string:

"<?xml version=\"1.0\"?><note><to><name>Tove</name><name>Tovi</name></to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

This is based on the example from the W3C's site on XML; I only added the nested names in the "to" field.

I have the following recursive code in C++ to parse it into an object tree:

RBCXMLNode * RBCXMLDoc::recursiveProcess(xmlNodePtr node) {
    RBCXMLNode *rNode = new RBCXMLNode();
    xmlNodePtr childIterator = node->xmlChildrenNode;

    const char *chars = (const char *)(node->name);
    string name(chars);
    const char *content = (const char *)xmlNodeGetContent(node);
    rNode->setName(name);
    rNode->setUTF8Data(content);
    cout << "Just parsed " << rNode->name() << ": " << rNode->stringData() << endl;
    while (childIterator != NULL) {
        RBCXMLNode *rNode2 = recursiveProcess(childIterator);
        rNode->addChild(rNode2);
        childIterator = childIterator->next;
    }
    return rNode;
}

So for each node it creates the matching object, sets its name and content, then recurses for its children. Note that each node is only processed once. However, I get the following (nonsensical, to me at least) output:

Just parsed note: ToveToviJaniReminderDon't forget me this weekend!
Just parsed to: ToveTovi
Just parsed name: Tove
Just parsed text: Tove
Just parsed name: Tovi
Just parsed text: Tovi
Just parsed from: Jani
Just parsed text: Jani
Just parsed heading: Reminder
Just parsed text: Reminder
Just parsed body: Don't forget me this weekend!
Just parsed text: Don't forget me this weekend!

Note that each item is being parsed twice; once giving the name as "text" and one giving it as whatever it should be. Also, the "note" root node is having its data parsed as well; this is undesirable. Also note that this root node is not parsed twice, like the others are.

So I have two questions:

How do I avoid parsing the root node's data, and just have its name and not its content? This also will presumably happen with more deeply nested nodes as well.
How do I avoid the duplicate parsing on the other nodes? Obviously, I want to keep the properly named versions, while maintaining the (unlikely) possibility that a node actually is named "text". Also, there may be duplicate nodes that are desired, so just checking to see if the node has been parsed already is not an option.

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你列表最软的妹 2024-10-08 18:49:18

我在您的代码中看到的主要问题是您正在调用 xmlNodeGetContent()。这将返回标签内的整个文本及其结尾对应内容。

使用 libxml2 解析时，您会得到一些内容复杂的节点，因此您不能依赖 xmlNodeGetContent() 来检索内容。您必须以不同的方式执行递归函数。例如，您的函数的最快解决方案是仅打印非文本节点的节点名称（使用 xmlNodeIsText() 测试），并仅编写 xmlNodeGetContent() 对于文本节点。这将为您提供类似于以下内容的输出：

Just parsed note
Just parsed to
Just parsed name
Just parsed text: Tove
Just parsed name
Just parsed text: Tovi
...

请注意，现在您仅打印元素，并且当您具有文本元素类型时仅打印文本。

这在概念上也是有意义的，因为非文本节点（不是文本）的内容非常复杂，如何打印它？您只能打印其标签（名称）。但是，文本节点非常简单，您可以打印其内容。

The main problem I see in your code is that you're calling xmlNodeGetContent(). This is returning you the whole text inside the tag and its ending counterpart.

When parsing with libxml2 you get some nodes whose content is complex, so you cannot rely on xmlNodeGetContent() to retrieve the content. You have to do the recursive function differently. For instance, you the fastest solution to your function would be to only print the node name for nodes that are not text (tested with xmlNodeIsText()), and to write just the xmlNodeGetContent() for nodes that are text. This would give you an output something like:

Just parsed note
Just parsed to
Just parsed name
Just parsed text: Tove
Just parsed name
Just parsed text: Tovi
...

Note that now you only print elements, and only text when you have a text element type.

This also makes sense conceptually, because the content of a non-text node (not text) is so complex that how do you print it? You can only print its label (name). However, text nodes are so simple that you can print their content.

回复收藏 0 原文

~没有更多了~