具有错误恢复功能的 GLR 解析器:输入出现错误时有太多选择
序言
我编写了具有错误恢复功能的 GLR 解析器。当遇到错误时,它会分为以下几种选择:
- 将预期的元素插入输入中(可能是用户刚刚错过了它)并照常继续。
- 将错误的元素替换为预期的元素(可能是用户刚刚输入错误)并照常继续。
- 跳过错误的元素,如果下一个元素也错误,则转到#2。
但是,如果输入有很多错误(例如,用户错误地将 JPEG 文件提供给解析器),则替代方案的数量会呈指数级增长。
示例
这样的解析器对应于以下语法:
Program -> Identifier WS Identifier WS '=' WS Identifier
Identifier -> ('a'..'z' | 'A'..'Z' | '0'..'9')*
WS -> ' '*
应用于以下文本:
x = "abc\"def"; y = "ghi\"jkl";
在中等现代的台式计算机上失败并显示“内存不足”。
问题
如果输入错误,如何减少备选数量?
Preamble
I have written GLR-parser with error recovery. When it encounters an error it splits into following alternatives:
- Insert the expected element into input (may be the user just missed it) and proceed as usual.
- Replace the erroneous element with expected one (may be the user just made a mistype) and proceed as usual.
- Skip erroneous element and if next element is also erroneous then go to #2.
But if input has a lot of errors (for example, user has by mistake given JPEG file to the parser) a number of alternatives grows exponentially.
Example
Such a parser corresponding to the following grammar:
Program -> Identifier WS Identifier WS '=' WS Identifier
Identifier -> ('a'..'z' | 'A'..'Z' | '0'..'9')*
WS -> ' '*
applied to the following text:
x = "abc\"def"; y = "ghi\"jkl";
fails with "out of memory" on moderately modern desktop computer.
Question
How to reduce number of alternatives in case of errors in input?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在字符级别进行 GLR(因此解析)错误纠正是可能的,但会加剧您的问题。
我们使用的 GLR 错误恢复过程对令牌进行操作,因此它并没有那么糟糕。
但当输入有大量错误时,就很难恢复。更复杂的错误恢复方案基本上使用解析器来识别输入中语言的有效子字符串,然后尝试将子字符串修补在一起以获得结果。这是相当雄心勃勃的。
我已经构建了具有错误恢复功能的 GLR 解析器。我并没有那么雄心勃勃。一般来说
当活动解析器的数量超过“大量”(例如,10,000)或遇到的语法错误数量超过阈值(例如,10或20)时,解析器通常会中止。如果解析器在最后一秒没有推进输入流,您可能会考虑中止解析器,这是一个间接标志,它有太多实时解析器。
Doing GLR (parsing therefore) error correction at the character level is possible but aggravates your problem.
The GLR error recovery procedure we use operates on tokens, so it isn't as bad.
But when the input has a huge number of errors, it is pretty hard to recover. More sophisticated error recovery schemes basically use the parser to identify valid substrings of the language in the input, and then attempts to patch the substrings together to get the result. That's pretty ambitious.
I've built GLR parsers with error recovery. I wasn't that ambitious. In general
the parser mostly just aborts when the number of live parsers gets above "a large number" (e.g., 10,000) or the number of syntax errors encountered exceed a threshold (e.g, 10 or 20). You might consider aborting the parser if it hasn't advanced the input stream in the last second, which is an indirect sign it has too many live parsers.