当前位置：文江博客话题详情

XML xml-parsing d

如何从头开始创建/编写一个简单的 XML 解析器？

发布于 2024-11-13 10:13:55 字数 222 浏览 3 评论 0原文

如何从头开始创建/编写一个简单的 XML 解析器？

我想知道什么是简化的基本英语步骤，而不是代码示例。

一个好的解析器是如何设计的？我知道正则表达式不应该在解析器中使用，但是正则表达式在解析 XML 中的作用有多大？

推荐使用什么数据结构？我应该使用链表来存储和检索节点、属性和值吗？

我想学习如何创建一个 XML 解析器，以便我可以用 D 编程语言编写一个解析器。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（6）

謌踐踏愛綪 2024-11-20 10:13:55

如果您不知道如何编写解析器，那么您需要阅读一些内容。找到任何一本关于编译器编写的书（许多最好的书都是 30 或 40 年前写的，例如 Aho 和 Ullmann）并学习有关词法分析和语法分析的章节。 XML 本质上没有什么不同，只是词法阶段和语法阶段不像某些语言那样彼此明确隔离。

警告一下，如果您想编写一个完全一致的 XML 解析器，那么您 90% 的精力将花在处理规范中晦涩难懂的角落的边缘情况上，处理诸如大多数 XML 用户甚至不知道的参数实体之类的事情。意识到。

回复收藏 0 原文

冷情妓 2024-11-20 10:13:55

对于基于事件的解析器，用户需要向其传递一些函数（startNode(name,attrs)、endNode(name) 和 someText(txt) 可能通过接口）并在需要时调用它们，当您传递文件时，

解析器将有一个 while 循环，该循环将交替读取直到<和直到>并对参数类型进行正确的转换，

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}

您可以构建DOM 解析器位于基于事件的解析器之上，每个节点所需的基本功能是 getChildren 和 getParent getName 和 getAttributes （在构建时使用设置器;)）

使用上述方法的 dom 解析器的对象：

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

当解析时结束根节点将拥有 DOM 树的根

注意：我没有在其中放置任何验证代码以确保 xml 的正确性

编辑：属性的解析有一个错误，正则表达式比在空格上分割更好

for and event based parser the user need to pass it some functions (startNode(name,attrs), endNode(name) and someText(txt) likely through an interface) and call them when needed as you pass over the file

the parser will have a while loop that will alternate between reading until < and until > and do the proper conversions to the parameter types

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}

you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )

the object for the dom parser with the above described methods:

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

when the parsing ends the rootnode will have the root of the DOM tree

note: I didn't put any verification code in there to ensure correctness of the xml

edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that

回复收藏 0 原文

浴红衣 2024-11-20 10:13:55

解析器和节点列表之间存在差异。解析器是获取一堆纯文本 XML 并尝试确定其中有哪些节点的部分。然后有一个用于保存节点的内部结构。在该结构之上的一层中，您可以找到 DOM（文档对象模型）。这是构成 XML 文档的嵌套节点结构。解析器只需要知道通用 DOM 接口即可创建节点。

我不会使用正则表达式作为解析器。我认为最好的事情就是逐个字符地遍历字符串，并检查您得到的内容是否与您应该得到的内容匹配。

但为什么不使用任何现有的 XML 解析器呢？数据编码有多种可能性。许多例外。如果您的解析器不能管理所有这些内容，那么 XML 解析器的称号就不值一提了。

回复收藏 0 原文

抠脚大汉 2024-11-20 10:13:55

解析器必须满足您的输入语言的需求。在您的例子中，是简单的 XML。关于 XML，首先要了解的是，它是上下文无关的并且绝对不含糊，所有内容都包含在两个标记之间，这就是 XML 出名的原因：它易于解析。最后，XML 总是简单地用树结构表示。如前所述，您可以简单地解析 XML 并同时执行代码，或者解析 XML，生成树，然后根据该树执行代码。

D 提供了一种非常有趣的方法来非常轻松地编写 XML 解析器，例如：

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};

A parser must fit the needs of your input language. In your case, simple XML. The first thing to know about XML is that it is context-free and absolutely not ambiguous, everything is wrapped between two tokens, and this is what makes XML famous: it is easy to parse. Finally, XML is always simply represented by a tree structure. As stated, you can simply parse your XML and execute code in the meantime, or parse the XML, generating the tree, and then execute code according to this tree.

D provides a very interesting way to write an XML parser very easily, for example:

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};

回复收藏 0 原文

不醒的梦 2024-11-20 10:13:55

文档中的第一个元素应该是序言。这说明了 xml 版本、编码、文件是否独立以及可能还有其他一些内容。序言以 开头。

序言之后是带有元数据的标签。特殊标签，如注释、文档类型和元素定义应以 开头。处理指令以开头。这里可以有嵌套标签，因为标签可以有和标签dtd 样式 xml 文档 - 请参阅 Wikipedia 了解完整示例。

应该只有一个顶级元素。它是唯一一个前面没有 或的。顶级元素之后可能还有更多的元数据标签；首先处理这些。

对于显式解析：首先识别标签——它们都以 < 开头——然后确定它是什么类型的标签以及它的闭包是什么样的。

找到标签后，您将需要找到其结束标签。首先检查标签是否自闭合；否则，求其闭包。

对于数据结构：我建议使用树结构，其中每个元素都是一个节点，每个节点都有一个子元素的索引/映射列表。

显然，一个完整的解析器需要更多的研究；我希望这足以让您开始。

回复收藏 0 原文

谎言 2024-11-20 10:13:55

由于 D 与 Java 关系相当密切，因此可能会使用 ANTLR 生成 XML 解析器（因为很可能存在 XML < a href="http://en.wikipedia.org/wiki/EBNF" rel="nofollow">EBNF ANTLR 语法已经存在，您可以使用它们），然后将生成的 Java 解析器代码转换为D，可以是一个选择吗？至少这会给你一个起点，然后你可以付出一些努力来尝试专门针对 D 优化代码......

至少 ANTLR 根本不像许多人想象的那么难。在对它一无所知之后，我通过观看 3-4 个 ANTLR 上这套很棒的截屏视频< /a>.

顺便说一句，我发现 ANTLRWorks 使用起来很轻松（与 Eclipse 中使用的插件相反）截屏视频...但截屏视频内容仍然适用）。

只是我的 0.02c。

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

0 文章

0 评论

23 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

離殇

文章 0 评论 0

小姐丶请自重

文章 0 评论 0

Aik

文章 0 评论 0

国产ˉ祖宗

文章 0 评论 0

猥琐帝

文章 0 评论 0

半仙

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文