解析标记流中的上下文无关语言
问题
给定一个具有任意规则的上下文无关语法和一个标记流,如何有效地识别与该语法匹配的流片段?
示例:
语法
S -> ASB | AB
A -> a
B -> b
(本质上是多个 a
后跟相同数量的 b
)
流:
aabaaabbc...
预期结果:
- 从位置 1 开始匹配: ab
- 从以下位置开始匹配位置4:aabb
当然关键是“有效”。而不需要在太长时间内测试太多无望的候选人。我对数据的唯一了解是,尽管语法是任意的,但实际上匹配序列将相对较短(<20 个终端),而流本身将相当长(>10000 个终端)。
理想情况下,我还想要一个语法树,但这并不是太重要,因为一旦识别了片段,我就可以在它上面运行普通的解析器来获取树。
我应该从哪里开始?哪种类型的解析器可以适应这种类型的工作?
The problem
Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?
Example:
Grammar
S -> ASB | AB
A -> a
B -> b
(So essentially, a number of a
s followed by an equal number of b
s)
Stream:
aabaaabbc...
Expected result:
- Match starting at position 1: ab
- Match starting at position 4: aabb
Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).
Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.
Where should I start? Which type of parser can be adapted to this type of work?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“任意语法”让我建议你看看 wberry 的评论。
这些语法有多复杂?是否有手动干预步骤?
我会尝试一下。如果我将您的示例语法从: 修改
为包括:
这样 G = 垃圾并且 S' 是许多 S 片段,中间有垃圾(我可能对我的生产规则不小心。您明白了),我想我们可以解决您的问题问题。你只需要一个解析器来匹配G之前的其他规则。你可能需要根据解析器修改这些产生式规则。我几乎可以保证,根据解析器的不同,规则顺序会发生变化。由于大多数解析器库将词法分析与解析分开,因此您可能需要一个包罗万象的词素,然后修改 G 以包含所有可能的词素。根据您的具体情况,这可能并不比在流中的每个位置开始每次尝试更好(效率方面)。
但是......假设我的生产规则是固定的(无论是为了正确性还是为了解析器的特定风格),这不仅应该匹配流中的片段,而且应该为您提供整个流的解析树。您只对以 S 类型节点为根的子树感兴趣。
"Arbitrary grammar" makes me suggest you look at wberry's comment.
How complex are these grammars? Is there a manual intervention step?
I'll make an attempt. If I modified your example grammar from:
to include:
So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.
But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.