如何从头开始创建/编写一个简单的 XML 解析器?

发布于 2024-11-13 10:13:55 字数 222 浏览 3 评论 0原文

如何从头开始创建/编写一个简单的 XML 解析器?

我想知道什么是简化的基本英语步骤,而不是代码示例。

一个好的解析器是如何设计的?我知道正则表达式不应该在解析器中使用,但是正则表达式在解析 XML 中的作用有多大?

推荐使用什么数据结构?我应该使用链表来存储和检索节点、属性和值吗?

我想学习如何创建一个 XML 解析器,以便我可以用 D 编程语言编写一个解析器。

How to create/write a simple XML parser from scratch?

Rather than code samples, I want to know what are the simplified, basic steps in English.

How is a good parser designed? I understand that regex should not be used in a parser, but how much is regex's role in parsing XML?

What is the recommended data structure to use? Should I use linked lists to store and retrieve nodes, attributes, and values?

I want to learn how to create an XML parser so that I can write one in D programming language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

謌踐踏愛綪 2024-11-20 10:13:55

如果您不知道如何编写解析器,那么您需要阅读一些内容。找到任何一本关于编译器编写的书(许多最好的书都是 30 或 40 年前写的,例如 Aho 和 Ullmann)并学习有关词法分析和语法分析的章节。 XML 本质上没有什么不同,只是词法阶段和语法阶段不像某些语言那样彼此明确隔离。

警告一下,如果您想编写一个完全一致的 XML 解析器,那么您 90% 的精力将花在处理规范中晦涩难懂的角落的边缘情况上,处理诸如大多数 XML 用户甚至不知道的参数实体之类的事情。意识到。

If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.

One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.

冷情妓 2024-11-20 10:13:55

对于基于事件的解析器,用户需要向其传递一些函数(startNode(name,attrs)endNode(name)someText(txt) 可能通过接口)并在需要时调用它们,当您传递文件时,

解析器将有一个 while 循环,该循环将交替读取直到<和直到>并对参数类型进行正确的转换,

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}

您可以构建DOM 解析器位于基于事件的解析器之上,每个节点所需的基本功能是 getChildren 和 getParent getName 和 getAttributes (在构建时使用设置器;))

使用上述方法的 dom 解析器的对象:

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

当解析时结束根节点将拥有 DOM 树的根

注意:我没有在其中放置任何验证代码以确保 xml 的正确性

编辑:属性的解析有一个错误,正则表达式比在空格上分割更好

for and event based parser the user need to pass it some functions (startNode(name,attrs), endNode(name) and someText(txt) likely through an interface) and call them when needed as you pass over the file

the parser will have a while loop that will alternate between reading until < and until > and do the proper conversions to the parameter types

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}

you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )

the object for the dom parser with the above described methods:

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

when the parsing ends the rootnode will have the root of the DOM tree

note: I didn't put any verification code in there to ensure correctness of the xml

edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that

浴红衣 2024-11-20 10:13:55

解析器和节点列表之间存在差异。解析器是获取一堆纯文本 XML 并尝试确定其中有哪些节点的部分。然后有一个用于保存节点的内部结构。在该结构之上的一层中,您可以找到 DOM(文档对象模型)。这是构成 XML 文档的嵌套节点结构。解析器只需要知道通用 DOM 接口即可创建节点。

我不会使用正则表达式作为解析器。我认为最好的事情就是逐个字符地遍历字符串,并检查您得到的内容是否与您应该得到的内容匹配。

但为什么不使用任何现有的 XML 解析器呢?数据编码有多种可能性。许多例外。如果您的解析器不能管理所有这些内容,那么 XML 解析器的称号就不值一提了。

There is a difference between a parser and a nodelist. The parser is the piece that takes a bunch of plain text XML and tries to determine what nodes are in there. Then there is an internal structure you save the nodes in. In a layer over that structure you find the DOM, the Document Object Model. This is a structure of nested nodes that make up your XML document. The parser only needs to know the generic DOM interface to create nodes.

I wouldn't use regex as a parser for this. I think the best thing is just traverse the string char by char and check if what you get matches with what you should get.

But why not use any of the existing XML parsers? There are many possibilities in encoding data. Many exceptions. And if your parsers doesn't manage them all it is hardly worth the title of XML parser.

抠脚大汉 2024-11-20 10:13:55

解析器必须满足您的输入语言的需求。在您的例子中,是简单的 XML。关于 XML,首先要了解的是,它是上下文无关的并且绝对不含糊,所有内容都包含在两个标记之间,这就是 XML 出名的原因:它易于解析。最后,XML 总是简单地用树结构表示。如前所述,您可以简单地解析 XML 并同时执行代码,或者解析 XML,生成树,然后根据该树执行代码。

D 提供了一种非常有趣的方法来非常轻松地编写 XML 解析器,例如:

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};

A parser must fit the needs of your input language. In your case, simple XML. The first thing to know about XML is that it is context-free and absolutely not ambiguous, everything is wrapped between two tokens, and this is what makes XML famous: it is easy to parse. Finally, XML is always simply represented by a tree structure. As stated, you can simply parse your XML and execute code in the meantime, or parse the XML, generating the tree, and then execute code according to this tree.

D provides a very interesting way to write an XML parser very easily, for example:

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};
不醒的梦 2024-11-20 10:13:55

文档中的第一个元素应该是序言。这说明了 xml 版本、编码、文件是否独立以及可能还有其他一些内容。序言以 开头。

序言之后是带有元数据的标签。特殊标签,如注释、文档类型和元素定义应以 开头。处理指令以 开头。这里可以有嵌套标签,因为 标签可以有 标签dtd 样式 xml 文档 - 请参阅 Wikipedia 了解完整示例。

应该只有一个顶级元素。它是唯一一个前面没有 的。顶级元素之后可能还有更多的元数据标签;首先处理这些。

对于显式解析:首先识别标签——它们都以 < 开头——然后确定它是什么类型的标签以及它的闭包是什么样的。

找到标签后,您将需要找到其结束标签。首先检查标签是否自闭合;否则,求其闭包。

对于数据结构:我建议使用树结构,其中每个元素都是一个节点,每个节点都有一个子元素的索引/映射列表。

显然,一个完整的解析器需要更多的研究;我希望这足以让您开始。

The first element in the document should be the prolog. This states the xml version, the encoding, whether the file is standalone, and maybe some other stuff. The prolog opens with <?.

After the prolog, there's tags with metadata. The special tags, like comments, doctypes, and element definitions should start with <!. Processing instructions start with <?. It is possible to have nested tags here, as the <!DOCTYPE tag can have <!ELEMENT and <!ATTLIST tags in a dtd style xml document--see Wikipedia for a thorough example.

There should be exactly one top level element. It's the only one without a <! or a <? preceding it. There may be more metadata tags after the top level element; process those first.

For the explicit parsing: First identify tags--they all start with <--then determine what kind of tag it is and what its closure looks like. <!-- is a comment tag, and cannot have -- anywhere except for its end. <? ends with ?>. <! end with >. To repeat: <!DOCTYPE can have tags nested before its closure, and there may be other nested tags I don't know of.

Once you find a tag, you'll want to find its closing tag. Check if the tag is self closing first; otherwise, find its closure.

For data structures: I would recommend a tree structure, where each element is a node, and each node has an indexed/mapped list of subelements.

Obviously, a full parser will require a lot more research; I hope this is enough to get you started.

谎言 2024-11-20 10:13:55

由于 D 与 Java 关系相当密切,因此可能会使用 ANTLR 生成 XML 解析器(因为很可能存在 XML < a href="http://en.wikipedia.org/wiki/EBNF" rel="nofollow">EBNF ANTLR 语法已经存在,您可以使用它们),然后将生成的 Java 解析器代码转换为D,可以是一个选择吗?至少这会给你一个起点,然后你可以付出一些努力来尝试专门针对 D 优化代码......

至少 ANTLR 根本不像许多人想象的那么难。在对它一无所知之后,我通过观看 3-4 个 ANTLR 上这套很棒的截屏视频< /a>.

顺便说一句,我发现 ANTLRWorks 使用起来很轻松(与 Eclipse 中使用的插件相反)截屏视频...但截屏视频内容仍然适用)。

只是我的 0.02c。

Since D is rather closely related to Java, maybe generating an XML parser with ANTLR (since there are most probably XML EBNF grammars for ANTLR already, you could then use these), and then converting the generated Java parser code to D, could be an option? At least that would give you a starting point, and you could then put some efforts in trying optimizing the code specifically for D ...

At least ANTLR is not at all as hard as many seem to think. I got started after knowing nothing about it, by watching 3-4 of this great set of screencasts on ANTLR.

Btw, I found ANTLRWorks a breeze to work with (as opposed to the Eclipse plugin used in the screencast ... but the screencast content applies anyway).

Just my 0.02c.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文