如何进一步改进基于 Scala 解析器组合器的解析器中的错误消息？

发布于 2024-09-03 02:34:41 字数 1336 浏览 11 评论 0 原文

我已经编写了一个基于 Scala 解析器组合器的解析器：

class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
    [...]
    lazy val document: PackratParser[AstNodeDocument] =
        ((procinst | element | comment | cdata | whitespace | text)*) ^^ {
            AstNodeDocument(_)
        }
    [...]
}
object SxmlParser {
    def parse(text: String): AstNodeDocument = {
        var ast = AstNodeDocument()
        val parser = new SxmlParser()
        val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
        result match {
            case parser.Success(x, _) => ast = x
            case parser.NoSuccess(err, next) => {
                tool.die("failed to parse SXML input " +
                    "(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
                    err + "\n" +
                    next.pos.longString)
            }
        }
        ast
    }
}

通常生成的解析错误消息相当不错。但有时，

sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
^

如果引号字符未关闭且解析器到达 EOT，就会发生这种情况。我想在这里看到的是（1）当解析器期望 '"' （我有多个）时，解析器处于什么生成式中，以及（2）此生成式在输入中开始解析的位置（这是一个指示符，其中有谁知道如何改进错误消息并在错误发生时包含有关实际内部解析状态的更多信息（也许可以在此处合理给出类似生产规则堆栈跟踪或其他内容以更好地识别）。顺便说一句，上面的“第 32 行，第 1 列”实际上是 EOT 位置，因此当然在这里没有用。

原文

I've coded a parser based on Scala parser combinators:

class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
    [...]
    lazy val document: PackratParser[AstNodeDocument] =
        ((procinst | element | comment | cdata | whitespace | text)*) ^^ {
            AstNodeDocument(_)
        }
    [...]
}
object SxmlParser {
    def parse(text: String): AstNodeDocument = {
        var ast = AstNodeDocument()
        val parser = new SxmlParser()
        val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
        result match {
            case parser.Success(x, _) => ast = x
            case parser.NoSuccess(err, next) => {
                tool.die("failed to parse SXML input " +
                    "(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
                    err + "\n" +
                    next.pos.longString)
            }
        }
        ast
    }
}

Usually the resulting parsing error messages are rather nice. But sometimes it becomes just

sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
^

This happens if a quote characters is not closed and the parser reaches the EOT. What I would like to see here is (1) what production the parser was in when it expected the '"' (I've multiple ones) and (2) where in the input this production started parsing (which is an indicator where the opening quote is in the input). Does anybody know how I can improve the error messages and include more information about the actual internal parsing state when the error happens (perhaps something like a production rule stacktrace or whatever can be given reasonably here to better identify the error location). BTW, the above "line 32, column 1" is actually the EOT position and hence of no use here, of course.

分享到QQ

分享到微博