Scala，树结构数据的解析器组合器

发布于 2024-11-10 16:34:46 字数 1578 浏览 3 评论 0原文

如何使用解析器来解析跨多行的记录？我需要解析树数据（并最终将其转换为树数据结构）。我在下面的代码中遇到了难以追踪的解析错误，但不清楚这是否是 Scala 解析器的最佳方法。问题实际上更多的是关于解决问题的方法，而不是调试现有代码。

EBNF-ish 语法是：

SP          = " "
CRLF        = "\r\n"
level       = "0" | "1" | "2" | "3"
varName     = {alphanum}
varValue    = {alphnum}
recordBegin = "0", varName
recordItem  = level, varName, [varValue]
record      = recordBegin, {recordItem}
file        = {record}

尝试实现和测试语法：

import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val skipWhitespace = false
  def CRLF = "\r\n" | "\n"
  def BOF = "\\A".r
  def EOF = "\\Z".r
  def TXT = "[^\r\n]*".r
  def TXTNOSP = "[^ \r\n]*".r
  def SP = "\\s".r
  def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
  def varName: Parser[String] = SP ~> TXTNOSP
  def varValue: Parser[String] = SP ~> TXT
  def recordBegin: Parser[Any] =  "0" ~ SP ~ varName ~ CRLF
  def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
    {case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
  def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
  def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
  def parse(input: String) = parseAll(file, input)
}

val result = TreeParser.parse(input).get
result.foreach(println)

原文

How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.

The EBNF-ish grammer is:

SP          = " "
CRLF        = "\r\n"
level       = "0" | "1" | "2" | "3"
varName     = {alphanum}
varValue    = {alphnum}
recordBegin = "0", varName
recordItem  = level, varName, [varValue]
record      = recordBegin, {recordItem}
file        = {record}

An attempt to implement and test the grammer:

import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val skipWhitespace = false
  def CRLF = "\r\n" | "\n"
  def BOF = "\\A".r
  def EOF = "\\Z".r
  def TXT = "[^\r\n]*".r
  def TXTNOSP = "[^ \r\n]*".r
  def SP = "\\s".r
  def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
  def varName: Parser[String] = SP ~> TXTNOSP
  def varValue: Parser[String] = SP ~> TXT
  def recordBegin: Parser[Any] =  "0" ~ SP ~ varName ~ CRLF
  def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
    {case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
  def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
  def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
  def parse(input: String) = parseAll(file, input)
}

val result = TreeParser.parse(input).get
result.foreach(println)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅一夜美梦 2024-11-17 16:34:46

正如丹尼尔所说，你应该更好地让解析器处理空格跳过，以最小化你的代码。但是，您可能需要调整 whitespace 值，以便可以明确匹配行尾。我在下面这样做是为了防止解析器在没有定义记录值的情况下移动到下一行。

如果您想匹配字母单词，请尽可能尝试使用 JavaTokenParsers 中定义的解析器，例如 ident。

为了简化错误跟踪，请对 parseAll 执行 NoSuccess 匹配，以便您可以看到解析器在什么时候失败。

import util.parsing.combinator._

val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val whiteSpace = """[ \t]+""".r

  val level = """[1-3]{1}""".r

  val value = """[a-zA-Z0-9_, ]*""".r
  val eol = """[\r?\n]+""".r

  def recordBegin = "0" ~ ident <~ eol

  def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
    case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
  }

  def record = recordBegin ~> rep1(recordItem)

  def file = rep1(record)

  def parse(input: String) = parseAll(file, input) match {
    case Success(result, _) => result
    case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
  }
}

val result = TreeParser.parse(input)
result.foreach(println)

As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.

As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.

To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.

import util.parsing.combinator._

val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""

object TreeParser extends JavaTokenParsers {
  override val whiteSpace = """[ \t]+""".r

  val level = """[1-3]{1}""".r

  val value = """[a-zA-Z0-9_, ]*""".r
  val eol = """[\r?\n]+""".r

  def recordBegin = "0" ~ ident <~ eol

  def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
    case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
  }

  def record = recordBegin ~> rep1(recordItem)

  def file = rep1(record)

  def parse(input: String) = parseAll(file, input) match {
    case Success(result, _) => result
    case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
  }
}

val result = TreeParser.parse(input)
result.foreach(println)

回复收藏 0 原文

水溶 2024-11-17 16:34:46

显式处理空格并不是一个特别好的主意。当然，使用 get 意味着您会丢失错误消息。在这个特定的示例中：

[1.3] failure: string matching regex `\s' expected but `f' found

0 fruit

  ^

这实际上非常清楚，尽管问题是为什么它需要一个空格。现在，这显然是在处理 recordBegin 规则，该规则的定义如下：

"0" ~ SP ~ varName ~ CRLF

因此，它解析零，然后解析空格，然后必须根据 解析 fruit变量名。现在，varName 的定义如下：

SP ~> TXTNOSP

另一个空格！因此，fruit 应该以空格开头。

Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:

[1.3] failure: string matching regex `\s' expected but `f' found

0 fruit

  ^

Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly: