实施“*?”组合 GLR 解析器中的(惰性“*”)正则表达式模式

发布于 2024-10-06 09:34:01 字数 1054 浏览 6 评论 0原文

我已经实现了组合 GLR 解析器。其中有:

  • char(·) 解析器,它消耗指定的字符或字符范围。
  • many(·) 组合器,它重复指定的解析器从零到无限次。

示例:"char('a').many()" 将匹配具有任意数量的 "a"-s 的字符串。

但是 many(·) 组合器是贪婪的,因此,例如 char('{') >>字符('{')>> char('a'..'z').many() >>>字符('}')>> char('}') (其中 ">>" 是解析器的顺序链接)将成功消耗整个 "{{foo}}some{{bar} }" 字符串。

我想实现 many(·) 的惰性版本,在前面的示例中使用它,只会消耗 "{{foo}}" 。我怎样才能做到这一点?

编辑:

也许我把你们都搞糊涂了。在我的程序中,解析器是一个函数(或 C++ 中的“函子”),它接受“步骤”并返回“步骤”森林。 “步骤”可以是 OK 类型(这意味着解析器已成功消耗了部分输入)和 FAIL 类型(这意味着解析器遇到了错误)。步骤的种类较多,但都是辅助性的。

Parser = f(Step) -> Collection of TreeNodes of Steps.

因此,当我解析输入时,我:

  • 组合简单的预定义解析器函数以获得表示所需语法的复杂解析器函数。

  • 从输入形成初始步骤。

  • 将初始 Step 赋予复杂解析器函数。

  • 使用步骤过滤 TreeNodes,仅留下 OK 的节点(如果输入中有错误,则使用最小的 FAIL-s)。

  • 从留下的步骤中收集信息。

I have implemented combinatorial GLR parsers. Among them there are:

  • char(·) parser which consumes specified character or range of characters.
  • many(·) combinator which repeats specified parser from zero to infinite times.

Example: "char('a').many()" will match a string with any number of "a"-s.

But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.

I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?

Edit:

May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.

Parser = f(Step) -> Collection of TreeNodes of Steps.

So when I parse input, I:

  • Compose simple predefined Parser functions to get complex Parser function representing required grammar.

  • Form initial Step from the input.

  • Give the initial Step to the complex Parser function.

  • Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).

  • Gather information from Steps which were left.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

北斗星光 2024-10-13 09:34:01

我已经实现并使用 GLR 解析器作为程序转换系统的语言前端已有 15 年了。

我不知道什么是“组合 GLR 解析器”,而且我也不熟悉你的符号,所以我不太确定如何解释它。我认为这是某种柯里化函数符号?我想象你的组合器规则相当于根据终端字符定义语法,其中“char('a').many”对应于语法规则:

 char = "a" ;
 char = char "a" ;

GLR 解析器确实会生成所有可能的解析。 GLR 解析的关键见解是它对所有可能解析的伪并行处理。如果你的“组合器”可以提出多个解析(也就是说,它们产生与上面等价的语法规则),并且你确实将它们连接到 GLR 解析器,那么它们都会被尝试,并且只有那些平铺的产生式序列文本将继续存在(意味着所有有效的解析,例如不明确的解析)将继续存在。

如果您确实实现了 GLR 解析器,那么所有可能的解析的集合对您来说应该非常清楚。事实上,它并没有暗示你所实现的不是 GLR 解析器。

与任何其他解析技术一样,使用 GLR 解析器可以进行错误恢复。我们所做的是将实时解析集保留在错误点之前;当发现错误时,我们尝试(在伪并行中,如果正确弯曲,GLR 解析机制会使这变得容易)以下所有操作:a)删除有问题的标记,b)插入本质上是 FOLLOW(x) 的所有标记其中 x 是实时解析。本质上,删除标记,或插入实时解析所需的标记。然后我们再次松开 GLR 解析器。只有有效的解析(例如,修复)才能生存。如果无法处理当前令牌,则处理已删除令牌的流的解析器将继续存在。在最坏的情况下,GLR 解析器错误恢复最终会丢弃所有标记到 EOF。这样做的一个严重缺点是 GLR 解析器的运行时间在解析错误时急剧增长;如果一处有很多错误,则错误恢复时间可能会非常长。

I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.

I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:

 char = "a" ;
 char = char "a" ;

GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.

If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.

Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.

魂ガ小子 2024-10-13 09:34:01

GLR 解析器不会生成所有可能的输入解析吗?然后解决歧义就是选择您喜欢的解析。为此,我认为解析森林的元素需要根据生成它们的组合器类型(急切的或惰性的)进行标记。 (一般来说,在看到所有输入之前,您无法逐步解决歧义问题。)

(这个答案基于我模糊的记忆和对 GLR 解析的模糊可能的误解。希望有专家能够过来。)

Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)

(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)

我很坚强 2024-10-13 09:34:01

考虑正则表达式 <.*?> 和输入 bcef。这应该找到 ,并且没有其他匹配,对吧?

现在考虑具有相同输入的正则表达式<.*?>e。这应该找到 bce,对吧?

这就造成了一个两难的境地。为了用户的利益,我们希望根据其两个操作数来理解组合器 >> 的行为。然而,没有办法根据第一个解析器的发现来产生第二个解析器的行为。

一个答案是每个解析器生成所有解析的序列,按偏好排序,而不是所有解析器的无序集合。贪婪匹配将返回从最长到最短排序的匹配;非贪婪,从最短到最长。

Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?

Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?

This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.

One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.

挽你眉间 2024-10-13 09:34:01

非贪婪功能只不过是一种消歧机制。如果你确实有一个广义的解析器(不需要消除歧义来产生其结果),那么“非贪婪”是没有意义的;无论运算符是否“非贪婪”,都会返回相同的结果。

非贪婪消歧行为可以应用于广义解析器提供的完整结果集。从左到右工作,过滤与非贪婪运算符相对应的不明确子组,以使用最短匹配,这仍然导致剩余输入的成功解析。

Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".

Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文