Treetop 语法中的匹配标签对

发布于 2024-10-01 21:05:26 字数 1867 浏览 9 评论 0原文

我不想重复克苏鲁答案，但我想使用 Treetop 来匹配开始和结束 HTML 标记对。使用此语法，我可以匹配开始标签和结束标签，但现在我想要一条规则将它们联系在一起。我已经尝试过以下操作，但是使用它会使我的解析器永远继续（无限循环）：

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

我试图将其基于递归括号示例和否定前瞻示例 Treetop Github 页面。我引用的其他规则如下：

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
  }
end

我知道我需要允许匹配单个开始或结束标记，但如果存在一对 HTML 标记，我希望将它们作为一对放在一起。通过将它们与我的语法相匹配来做到这一点似乎是最干净的，但也许有更好的方法？

原文

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
  }
end

I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苏大泽ㄣ 2024-10-08 21:05:27

这是一个非常简单的语法，它使用语义谓词将结束标记与开始标记进行匹配。

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end

Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end

回复收藏 0 原文

我恋#小黄人 2024-10-08 21:05:27

您只能使用每个 HTML 标记对的单独规则或使用语义谓词来执行此操作。也就是说，通过保存开始标签（在一个 semred 中），然后仅当它是相同标签时才接受（在另一个 semred 中）结束标签。在 Treetop 中做到这一点比应有的要困难得多，因为没有方便的地方来保存上下文，并且您无法查看解析器堆栈，但这是可能的。

顺便说一句，在解析 MIME 边界（以及 Markdown 中）时也会出现同样的问题。我还没有检查 Mikel 在 ActionMailer 中的实现（可能他使用了嵌套的 Mime 解析器），但在 Treetop 中是可能的。

在 http://github.com/cjheath/activefacts/ blob/master/lib/activefacts/cql/parser.rb 我将上下文保存在假输入流中 - 您可以看到它必须支持哪些方法 - 因为“输入”在所有 SyntaxNode 上都可用。我在那里使用 semreds 有不同的原因，但其中一些技术是适用的。

回复收藏 0 原文

~没有更多了~