在 Javascript 中解析组织模式文件

发布于 2024-11-15 20:59:39 字数 1568 浏览 5 评论 0原文

我已经有一段时间试图让自己用 Javascript 为 org-mode 编写一个解析器了。我在解析大纲时完全没有遇到任何问题（我在几分钟内就完成了），但解析实际内容要困难得多，例如，我在处理叠瓦式列表时遇到了麻烦。

* This is a heading
  P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
    or a greater one for that matter.

  + LI1.1 I am beginning a list here
  + LI1.2 Here begins another list item
    which continues here
      and also here
  P2 but is broken here (this line becomes a paragraph
  outside of the first list).
  + LI2.1 P1 Second list item.
    - LI2.1.1 Inner list with a simple item
    - LI2.1.2 P1 and with an item containing several paragraphs.
      Here is the second line in the item, and now

      LI2.1.2 P2 I begin a new paragraph still in the same item. 
        The indentation can be only higher
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1
    list item

    - LI 2.2.1 You get the picture
  P3 Just plain text outside of the list.

（在上面的示例中，PX 和 LIX.Y 仅用于明确显示新块的开始，它们不会出现在实际文档中。P 代表段落，LI 代表列表项。在 HTML 世界中，PX 是

标记的开头。只是为了帮助跟踪列表的嵌套和更改。）

我想知道解析这种重要的空白叠瓦块的策略，显然我可以逐行解析而无需任何回溯或什么都不做，所以它一定非常简单，但由于某种原因我无法做到这一点。我试图从 Markdown 解析器或类似的东西中获得灵感，这些东西应该具有类似的叠瓦特征，但在我看来（对于我所看到的）它们非常老套，充满了正则表达式，我希望我可以写一些更干净的东西（org当你想到它时，模式“语法”是相当巨大的，它会一点一点地增长，我希望整个事情是可维护的，并允许轻松插入新功能）。

有解析此类事情经验的人可以给我一些指示吗？

原文

It's been some time now that I am trying to get myself to write a parser in Javascript for org-mode. I had no trouble at all parsing the outline (which I did in a few minutes), but parsing the actual content is far more difficult, and I'm having trouble with imbricated lists, for example.

* This is a heading
  P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
    or a greater one for that matter.

  + LI1.1 I am beginning a list here
  + LI1.2 Here begins another list item
    which continues here
      and also here
  P2 but is broken here (this line becomes a paragraph
  outside of the first list).
  + LI2.1 P1 Second list item.
    - LI2.1.1 Inner list with a simple item
    - LI2.1.2 P1 and with an item containing several paragraphs.
      Here is the second line in the item, and now

      LI2.1.2 P2 I begin a new paragraph still in the same item. 
        The indentation can be only higher
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1
    list item

    - LI 2.2.1 You get the picture
  P3 Just plain text outside of the list.

(In the above example, the PX and LIX.Y are only there to show explicitly the beginning of new blocks, they would not be present in the actual document. P stand for paragraph and LI for list item. In the HTML world, PX would be the beginning of a <p> tag. The numbering are just to help keep track of the nesting and changes of list.)

I wondered about the strategy to parse this kind of significant white-space imbricated blocks, clearly I can parse line by line without any backtracking or nothing, so it must be quite simple, but for some reason I couldn't manage to do it. I tried to get inspiration from Markdown parsers, or such things that are supposed to have similar imbrication features but they appeared to me (for the ones I saw) to be very hacky, full of regexes and I hoped I could write something cleaner (org-mode "grammar" being quite huge when you come to think about it, it will grow little by little and I'd like the whole thing to be maintainable and allow to plug-in new features easily).

Can anyone with experience in parsing such things can give me some pointers?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青春有你 2024-11-22 20:59:39

我喜欢解析器和编译器理论，所以我（手动）编写了一个小型解析器，它能够将示例片段解析为
XML DOM 文档对象。应该可以对其进行修改，以便生成其他类型的树结构，例如自定义 AST（抽象语法树）。

我尽力使代码易于阅读，以便您可以了解这样的解析器是如何工作的。

问我是否需要更多解释，或者希望我稍微修改一下。

使用示例代码片段作为输入，语句 result = new OrgModParser().parse(input); result.xml 返回：

<org-mode-document indentLevel="-1">
    <section indentLevel="0">
        <header indentLevel="0">This is a heading</header>
            <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph>
            <list indentLevel="1">
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph>
                </list-item>
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph>
                </list-item>
            </list>
        <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph>
        <list indentLevel="1">
            <list-item indentLevel="1">
                <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph>
                    </list-item>
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph>
                        <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item.  The indentation can be only higher</paragraph>
                    </list-item>
                </list>
                <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item,  (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph>
                    </list-item>
                </list>
            </list-item>
        </list>
        <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph>
    </section>
</org-mode-document>

代码：

/*
 * File: orgmodparser.js
 * Basic usage: var object = new OrgModeParser().parse(input); 
 * Works on: JScript and JScript.Net.
 * - For other JavaScript platforms, just replace or override the .createRoot() method
 */

OrgModeParser = function (options) {
    if (typeof options == "object") {
        for (var i in options) {
            this[i] = options[i];
        }
    }
}

OrgModeParser.prototype = {

    "INDENT_WIDTH" :    2,  // Two spaces
    "LINE_SEPARATOR" :  "\r\n",

    /*
     * Each line in the input will be matched against this regexp.
     * Only spaces are allowed as indentation characters.
     * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space.
     * Add other symbols in this regexp if you want the parser to recognize them
     */
    "re" :    /^( *)([\+\-\*] +)?(.*)/,

    // This function must return a valid XML DOM document object
    createRoot :    function () {
        var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"];
        for (var i = 0; i < progIDs.length; i++) {
            try {
                return new ActiveXObject(progIDs[i]);
            }
            catch (err) {
            }
        }
        alert("Org-mode parser - Error - Failed to instantiate root object");
        return null;
    },

    parse : function (text) {

        function createNode (tagName, text) {
            var node = root.createElement(tagName);
            node.setAttribute("indentLevel", level);
            if (text) {
                var textNode = root.createTextNode(text);
                node.appendChild(textNode);
            }
            return node;
        }

        function getContainer () {
            if (lastNode.tagName == "section") { return lastNode; }
            var anc = lastNode.parentNode;
            while (anc) {
                if (modifier == "+" || modifier == "-") {
                    if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; }
                }
                if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; }
                anc = anc.parentNode;
            }
            alert("Org-mode parser - Internal error at line: "+i);return null;
        }

        if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; }

        var body;
        var content;     // The text of the current line, without its indentation and modifier
        var lastNode;    // The node being processed
        var indent;      // The indentation of the current line
        var isAfterDubbleLineBreak;  // Indicates if the current line follows a dubble line break
        var line;        // The current line being processed
        var level;       // The current indentation level; given by indent.length / this.INDENT_WIDTH. Not to confuse with the nesting level 
        var lines;       // Array. Empty lines are included.
        var match;
        var modifier;    // This can be "*", "+", "-" or ""
        var root;

        isAfterDubbleLineBreak = false;
        level = -1;      // Indentation level is -1 initially; it will be 0 for the first "*"-bloc
        lines = text.split(this.LINE_SEPARATOR);
        root = this.createRoot();
        body = root.appendChild(createNode("org-mode-document"));
        lastNode = body;

        for (var i = 0; i < lines .length; i++) {
            line = lines[i];
            match = line.match(this.re);
            if (match === null) { alert("org-mode parse error at line: " + i); return null; }
            indent = match[1];
            level = indent.length / this.INDENT_WIDTH;
            modifier = match[2] && match[2].charAt(0);
            content = match[3];

            // These conditions tell the parser what to do when encountering a line with a given modifer
            if (content === "") { dubbleLineBreak(); continue; }
            else if (modifier == "+" || modifier == "-") { plus(); }
            else if (modifier == "*") { star(); }
            else if (modifier == "+") { plus(); }
            else if (modifier == "-") { minus(); }
            else if (modifier == "") { noModifier(); }
            isAfterDubbleLineBreak = false;
        }
        return root;


        function star() {
            // The '*' modifier is not allowed on an indented line
            if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; }
            lastNode = body.appendChild(createNode("section"));
            // The div remains the current node
            lastNode.appendChild(createNode("header", content));
        }

        function plus() {
            var container = getContainer();
            var tn = container.tagName;
            if (tn == "section" || tn == "list-item") {
                lastNode = container.appendChild(createNode("list"));
                lastNode = lastNode.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            } else if (tn == "list") {
                lastNode = container.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            }
            else alert("Org-mode parser - Internal error - Bad container tag name: " + tn);
            lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1);
        }

        function minus() { plus(); }

        function noModifier() {
            if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) {
                lastNode.childNodes[0].appendData(" " + content);
            } else {
                var container = getContainer();
                lastNode = container.appendChild(createNode("paragraph", content));
            }
        }

        function dubbleLineBreak() {
            while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; }
            isAfterDubbleLineBreak = true;
        }

    }
};

I like parsers and compiler theory, so I have written a small parser (by hand) that is able to parse your example snippet into
a XML DOM Document object. It shoul be possible to modify it so that it produces an other type of tree structure, like a custom AST (abstract syntax tree).

I've tried to keep the code easy to read, so that you can see how such a parser works.

Ask me if you need some more explanations, or want me to modify it a little.

With your example snippet as input, the statement result = new OrgModParser().parse(input); result.xml returned:

<org-mode-document indentLevel="-1">
    <section indentLevel="0">
        <header indentLevel="0">This is a heading</header>
            <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph>
            <list indentLevel="1">
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph>
                </list-item>
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph>
                </list-item>
            </list>
        <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph>
        <list indentLevel="1">
            <list-item indentLevel="1">
                <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph>
                    </list-item>
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph>
                        <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item.  The indentation can be only higher</paragraph>
                    </list-item>
                </list>
                <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item,  (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph>
                    </list-item>
                </list>
            </list-item>
        </list>
        <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph>
    </section>
</org-mode-document>

The code:

/*
 * File: orgmodparser.js
 * Basic usage: var object = new OrgModeParser().parse(input); 
 * Works on: JScript and JScript.Net.
 * - For other JavaScript platforms, just replace or override the .createRoot() method
 */

OrgModeParser = function (options) {
    if (typeof options == "object") {
        for (var i in options) {
            this[i] = options[i];
        }
    }
}

OrgModeParser.prototype = {

    "INDENT_WIDTH" :    2,  // Two spaces
    "LINE_SEPARATOR" :  "\r\n",

    /*
     * Each line in the input will be matched against this regexp.
     * Only spaces are allowed as indentation characters.
     * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space.
     * Add other symbols in this regexp if you want the parser to recognize them
     */
    "re" :    /^( *)([\+\-\*] +)?(.*)/,

    // This function must return a valid XML DOM document object
    createRoot :    function () {
        var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"];
        for (var i = 0; i < progIDs.length; i++) {
            try {
                return new ActiveXObject(progIDs[i]);
            }
            catch (err) {
            }
        }
        alert("Org-mode parser - Error - Failed to instantiate root object");
        return null;
    },

    parse : function (text) {

        function createNode (tagName, text) {
            var node = root.createElement(tagName);
            node.setAttribute("indentLevel", level);
            if (text) {
                var textNode = root.createTextNode(text);
                node.appendChild(textNode);
            }
            return node;
        }

        function getContainer () {
            if (lastNode.tagName == "section") { return lastNode; }
            var anc = lastNode.parentNode;
            while (anc) {
                if (modifier == "+" || modifier == "-") {
                    if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; }
                }
                if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; }
                anc = anc.parentNode;
            }
            alert("Org-mode parser - Internal error at line: "+i);return null;
        }

        if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; }

        var body;
        var content;     // The text of the current line, without its indentation and modifier
        var lastNode;    // The node being processed
        var indent;      // The indentation of the current line
        var isAfterDubbleLineBreak;  // Indicates if the current line follows a dubble line break
        var line;        // The current line being processed
        var level;       // The current indentation level; given by indent.length / this.INDENT_WIDTH. Not to confuse with the nesting level 
        var lines;       // Array. Empty lines are included.
        var match;
        var modifier;    // This can be "*", "+", "-" or ""
        var root;

        isAfterDubbleLineBreak = false;
        level = -1;      // Indentation level is -1 initially; it will be 0 for the first "*"-bloc
        lines = text.split(this.LINE_SEPARATOR);
        root = this.createRoot();
        body = root.appendChild(createNode("org-mode-document"));
        lastNode = body;

        for (var i = 0; i < lines .length; i++) {
            line = lines[i];
            match = line.match(this.re);
            if (match === null) { alert("org-mode parse error at line: " + i); return null; }
            indent = match[1];
            level = indent.length / this.INDENT_WIDTH;
            modifier = match[2] && match[2].charAt(0);
            content = match[3];

            // These conditions tell the parser what to do when encountering a line with a given modifer
            if (content === "") { dubbleLineBreak(); continue; }
            else if (modifier == "+" || modifier == "-") { plus(); }
            else if (modifier == "*") { star(); }
            else if (modifier == "+") { plus(); }
            else if (modifier == "-") { minus(); }
            else if (modifier == "") { noModifier(); }
            isAfterDubbleLineBreak = false;
        }
        return root;


        function star() {
            // The '*' modifier is not allowed on an indented line
            if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; }
            lastNode = body.appendChild(createNode("section"));
            // The div remains the current node
            lastNode.appendChild(createNode("header", content));
        }

        function plus() {
            var container = getContainer();
            var tn = container.tagName;
            if (tn == "section" || tn == "list-item") {
                lastNode = container.appendChild(createNode("list"));
                lastNode = lastNode.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            } else if (tn == "list") {
                lastNode = container.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            }
            else alert("Org-mode parser - Internal error - Bad container tag name: " + tn);
            lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1);
        }

        function minus() { plus(); }

        function noModifier() {
            if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) {
                lastNode.childNodes[0].appendData(" " + content);
            } else {
                var container = getContainer();
                lastNode = container.appendChild(createNode("paragraph", content));
            }
        }

        function dubbleLineBreak() {
            while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; }
            isAfterDubbleLineBreak = true;
        }

    }
};

回复收藏 0 原文