Treetop 语法中的匹配标签对

发布于 2024-10-01 21:05:26 字数 1867 浏览 9 评论 0原文

我不想重复克苏鲁答案,但我想使用 Treetop 来匹配开始和结束 HTML 标记对。使用 此语法,我可以匹配开始标签和结束标签,但现在我想要一条规则将它们联系在一起。我已经尝试过以下操作,但是使用它会使我的解析器永远继续(无限循环):

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

我试图将其基于递归括号示例和否定前瞻示例 Treetop Github 页面。我引用的其他规则如下:

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
  }
end

我知道我需要允许匹配单个开始或结束标记,但如果存在一对 HTML 标记,我希望将它们作为一对放在一起。通过将它们与我的语法相匹配来做到这一点似乎是最干净的,但也许有更好的方法?

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
  }
end

I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

苏大泽ㄣ 2024-10-08 21:05:27

这是一个非常简单的语法,它使用语义谓词将结束标记与开始标记进行匹配。

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end

Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end
我恋#小黄人 2024-10-08 21:05:27

您只能使用每个 HTML 标记对的单独规则或使用语义谓词来执行此操作。也就是说,通过保存开始标签(在一个 semred 中),然后仅当它是相同标签时才接受(在另一个 semred 中)结束标签。在 Treetop 中做到这一点比应有的要困难得多,因为没有方便的地方来保存上下文,并且您无法查看解析器堆栈,但这是可能的。

顺便说一句,在解析 MIME 边界(以及 Markdown 中)时也会出现同样的问题。我还没有检查 Mikel 在 ActionMailer 中的实现(可能他使用了嵌套的 Mime 解析器),但在 Treetop 中是可能的。

http://github.com/cjheath/activefacts/ blob/master/lib/activefacts/cql/parser.rb 我将上下文保存在假输入流中 - 您可以看到它必须支持哪些方法 - 因为“输入”在所有 SyntaxNode 上都可用。我在那里使用 semreds 有不同的原因,但其中一些技术是适用的。

You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.

BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.

In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文