Scala,树结构数据的解析器组合器
如何使用解析器来解析跨多行的记录?我需要解析树数据(并最终将其转换为树数据结构)。我在下面的代码中遇到了难以追踪的解析错误,但不清楚这是否是 Scala 解析器的最佳方法。问题实际上更多的是关于解决问题的方法,而不是调试现有代码。
EBNF-ish 语法是:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
尝试实现和测试语法:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)
How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.
The EBNF-ish grammer is:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
An attempt to implement and test the grammer:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如丹尼尔所说,你应该更好地让解析器处理空格跳过,以最小化你的代码。但是,您可能需要调整
whitespace
值,以便可以明确匹配行尾。我在下面这样做是为了防止解析器在没有定义记录值的情况下移动到下一行。如果您想匹配字母单词,请尽可能尝试使用
JavaTokenParsers
中定义的解析器,例如ident
。为了简化错误跟踪,请对
parseAll
执行NoSuccess
匹配,以便您可以看到解析器在什么时候失败。As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the
whitespace
value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.As much as possible, try to use the parsers defined in
JavaTokenParsers
likeident
if you want to match alphabetic words.To ease your error tracing, perform a
NoSuccess
match onparseAll
so you can see at what point the parser failed.显式处理空格并不是一个特别好的主意。当然,使用
get
意味着您会丢失错误消息。在这个特定的示例中:这实际上非常清楚,尽管问题是为什么它需要一个空格。现在,这显然是在处理
recordBegin
规则,该规则的定义如下:因此,它解析零,然后解析空格,然后必须根据
解析
。现在,fruit
变量名varName
的定义如下:另一个空格!因此,
fruit
应该以空格开头。Handling whitespace explicitly is not a particularly good idea. And, of course, using
get
means you lose the error message. In this particular example:Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a
recordBegin
rule, which is defined thusly:So, it parsers the zero, then the space, and then
fruit
must be parsed againstvarName
. Now,varName
is defined like this:Another space! So,
fruit
should have began with a space.