Treetop 语法中的匹配标签对
我不想重复克苏鲁答案,但我想使用 Treetop 来匹配开始和结束 HTML 标记对。使用 此语法,我可以匹配开始标签和结束标签,但现在我想要一条规则将它们联系在一起。我已经尝试过以下操作,但是使用它会使我的解析器永远继续(无限循环):
rule html_tag_pair
html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
whitespace))+ html_close_tag <HTMLTagPair>
end
我试图将其基于递归括号示例和否定前瞻示例 Treetop Github 页面。我引用的其他规则如下:
rule newline
[\n\r] {
def content
:newline
end
}
end
rule tab
"\t" {
def content
:tab
end
}
end
rule whitespace
(newline / tab / [\s]) {
def content
:whitespace
end
}
end
rule text
[^<]+ {
def content
[:text, text_value]
end
}
end
rule html_open_tag
"<" html_tag_name attribute_list ">" <HTMLOpenTag>
end
rule html_empty_tag
"<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end
rule html_close_tag
"</" html_tag_name ">" <HTMLCloseTag>
end
rule html_tag_name
[A-Za-z0-9]+ {
def content
text_value
end
}
end
rule attribute_list
attribute* {
def content
elements.inject({}){ |hash, e| hash.merge(e.content) }
end
}
end
rule attribute
whitespace+ html_tag_name "=" quoted_value {
def content
{elements[1].content => elements[3].content}
end
}
end
rule quoted_value
('"' [^"]* '"' / "'" [^']* "'") {
def content
elements[1].text_value
end
}
end
我知道我需要允许匹配单个开始或结束标记,但如果存在一对 HTML 标记,我希望将它们作为一对放在一起。通过将它们与我的语法相匹配来做到这一点似乎是最干净的,但也许有更好的方法?
I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):
rule html_tag_pair
html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
whitespace))+ html_close_tag <HTMLTagPair>
end
I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:
rule newline
[\n\r] {
def content
:newline
end
}
end
rule tab
"\t" {
def content
:tab
end
}
end
rule whitespace
(newline / tab / [\s]) {
def content
:whitespace
end
}
end
rule text
[^<]+ {
def content
[:text, text_value]
end
}
end
rule html_open_tag
"<" html_tag_name attribute_list ">" <HTMLOpenTag>
end
rule html_empty_tag
"<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end
rule html_close_tag
"</" html_tag_name ">" <HTMLCloseTag>
end
rule html_tag_name
[A-Za-z0-9]+ {
def content
text_value
end
}
end
rule attribute_list
attribute* {
def content
elements.inject({}){ |hash, e| hash.merge(e.content) }
end
}
end
rule attribute
whitespace+ html_tag_name "=" quoted_value {
def content
{elements[1].content => elements[3].content}
end
}
end
rule quoted_value
('"' [^"]* '"' / "'" [^']* "'") {
def content
elements[1].text_value
end
}
end
I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个非常简单的语法,它使用语义谓词将结束标记与开始标记进行匹配。
Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.
您只能使用每个 HTML 标记对的单独规则或使用语义谓词来执行此操作。也就是说,通过保存开始标签(在一个 semred 中),然后仅当它是相同标签时才接受(在另一个 semred 中)结束标签。在 Treetop 中做到这一点比应有的要困难得多,因为没有方便的地方来保存上下文,并且您无法查看解析器堆栈,但这是可能的。
顺便说一句,在解析 MIME 边界(以及 Markdown 中)时也会出现同样的问题。我还没有检查 Mikel 在 ActionMailer 中的实现(可能他使用了嵌套的 Mime 解析器),但在 Treetop 中是可能的。
在 http://github.com/cjheath/activefacts/ blob/master/lib/activefacts/cql/parser.rb 我将上下文保存在假输入流中 - 您可以看到它必须支持哪些方法 - 因为“输入”在所有 SyntaxNode 上都可用。我在那里使用 semreds 有不同的原因,但其中一些技术是适用的。
You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.
BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.
In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.