实施“？”组合 GLR 解析器中的（惰性“”）正则表达式模式

发布于 2024-10-06 09:34:01 字数 1054 浏览 12 评论 0原文

我已经实现了组合 GLR 解析器。其中有：

char(·) 解析器，它消耗指定的字符或字符范围。
many(·) 组合器，它重复指定的解析器从零到无限次。

示例："char('a').many()" 将匹配具有任意数量的 "a"-s 的字符串。

但是 many(·) 组合器是贪婪的，因此，例如 char('{') >>字符（'{'）>> char('a'..'z').many() >>>字符（'}'）>> char('}') （其中 ">>" 是解析器的顺序链接）将成功消耗整个 "{{foo}}some{{bar} }" 字符串。

我想实现 many(·) 的惰性版本，在前面的示例中使用它，只会消耗 "{{foo}}" 。我怎样才能做到这一点？

编辑：

也许我把你们都搞糊涂了。在我的程序中，解析器是一个函数（或 C++ 中的“函子”），它接受“步骤”并返回“步骤”森林。 “步骤”可以是 OK 类型（这意味着解析器已成功消耗了部分输入）和 FAIL 类型（这意味着解析器遇到了错误）。步骤的种类较多，但都是辅助性的。

Parser = f(Step) -> Collection of TreeNodes of Steps.

因此，当我解析输入时，我：

组合简单的预定义解析器函数以获得表示所需语法的复杂解析器函数。
从输入形成初始步骤。
将初始 Step 赋予复杂解析器函数。
使用步骤过滤 TreeNodes，仅留下 OK 的节点（如果输入中有错误，则使用最小的 FAIL-s）。
从留下的步骤中收集信息。

原文

I have implemented combinatorial GLR parsers. Among them there are:

char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.

Example: "char('a').many()" will match a string with any number of "a"-s.

But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.

I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?

Edit:

May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.

Parser = f(Step) -> Collection of TreeNodes of Steps.

So when I parse input, I:

Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北斗星光 2024-10-13 09:34:01

我已经实现并使用 GLR 解析器作为程序转换系统的语言前端已有 15 年了。

我不知道什么是“组合 GLR 解析器”，而且我也不熟悉你的符号，所以我不太确定如何解释它。我认为这是某种柯里化函数符号？我想象你的组合器规则相当于根据终端字符定义语法，其中“char('a').many”对应于语法规则：

 char = "a" ;
 char = char "a" ;

GLR 解析器确实会生成所有可能的解析。 GLR 解析的关键见解是它对所有可能解析的伪并行处理。如果你的“组合器”可以提出多个解析（也就是说，它们产生与上面等价的语法规则），并且你确实将它们连接到 GLR 解析器，那么它们都会被尝试，并且只有那些平铺的产生式序列文本将继续存在（意味着所有有效的解析，例如不明确的解析）将继续存在。

如果您确实实现了 GLR 解析器，那么所有可能的解析的集合对您来说应该非常清楚。事实上，它并没有暗示你所实现的不是 GLR 解析器。

与任何其他解析技术一样，使用 GLR 解析器可以进行错误恢复。我们所做的是将实时解析集保留在错误点之前；当发现错误时，我们尝试（在伪并行中，如果正确弯曲，GLR 解析机制会使这变得容易）以下所有操作：a）删除有问题的标记，b）插入本质上是 FOLLOW(x) 的所有标记其中 x 是实时解析。本质上，删除标记，或插入实时解析所需的标记。然后我们再次松开 GLR 解析器。只有有效的解析（例如，修复）才能生存。如果无法处理当前令牌，则处理已删除令牌的流的解析器将继续存在。在最坏的情况下，GLR 解析器错误恢复最终会丢弃所有标记到 EOF。这样做的一个严重缺点是 GLR 解析器的运行时间在解析错误时急剧增长；如果一处有很多错误，则错误恢复时间可能会非常长。

I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.

I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:

 char = "a" ;
 char = char "a" ;

GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.

If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.

Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.

回复收藏 0 原文