秒差距 - 错误“组合符‘许多’”应用于接受空字符串的解析器”

发布于 2024-12-09 15:33:55 字数 1478 浏览 2 评论 0原文

我正在尝试使用 Parsec 编写一个解析器来解析有文字的 Haskell 文件,如下所示:

The classic 'Hello, world' program.

\begin{code}

main = putStrLn "Hello, world"

\end{code}

More text.

我写了以下内容,有点受到 RWH 中示例的启发:

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)


parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input



literateFile
    = many codeOrProse

codeOrProse
    = code <|> prose

code
    = do eol
         string "\\begin{code}"
         eol
         content <- many anyChar
         eol
         string "\\end{code}"
         eol
         return $ Haskell content

prose
    = do content <- many anyChar
         return $ Text content

eol
    =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

我希望这会产生类似的结果of:(

[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]

允许空白等)。

这编译得很好,但是当运行时,我收到错误:

*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string

任何人都可以阐明这一点,并可能帮助提供解决方案吗?

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:

The classic 'Hello, world' program.

\begin{code}

main = putStrLn "Hello, world"

\end{code}

More text.

I've written the following, sort-of-inspired by the examples in RWH:

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)


parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input



literateFile
    = many codeOrProse

codeOrProse
    = code <|> prose

code
    = do eol
         string "\\begin{code}"
         eol
         content <- many anyChar
         eol
         string "\\end{code}"
         eol
         return $ Haskell content

prose
    = do content <- many anyChar
         return $ Text content

eol
    =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

Which I hoped would result in something along the lines of:

[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]

(allowing for whitespace etc).

This compiles fine, but when run, I get the error:

*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string

Can anyone shed any light on this, and possibly help with a solution please?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

心如狂蝶 2024-12-16 15:33:55

正如 sth 所指出的,many anyChar 是问题所在。但不仅在散文中,而且在代码中也是如此。 code 的问题是,content <- 许多anyChar 将消耗所有内容:换行符和 \end{code} 标记。

因此,您需要有某种方法来区分散文和代码。一个简单(但可能太天真)的方法是查找反斜杠:

literateFile = many codeOrProse <* eof

code = do string "\\begin{code}"
          content <- many $ noneOf "\\"
          string "\\end{code}"
          return $ Haskell content

prose = do content <- many1 $ noneOf "\\"
           return $ Text content

现在,您还没有完全获得所需的结果,因为 Haskell 部分也将包含换行符,但您可以很容易过滤掉这些(给定一个函数 filterNewlines 你可以说 `content <- filterNewlines <$> (many $ noneOf "\\"))。

编辑

好的,我想我找到了一个解决方案(需要最新的 Parsec 版本,因为 lookAhead):

import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)    

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "" input

literateFile
    = many codeOrProse

codeOrProse = code <|> prose

code = do string "\\begin{code}\n"
          c <- untilP (string "\\end{code}\n")
          string "\\end{code}\n"
          return $ Haskell c

prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
           return $ Text t

untilP p = do s <- many $ noneOf "\n"
              newline
              s' <- try (lookAhead p >> return "") <|> untilP p
              return $ s ++ s'

untilP p 解析一行,然后检查如果 p 可以成功解析下一行的开头。如果是,则返回空字符串,否则继续。 lookAhead 是必需的,因为否则 begin\end-tags 将被消耗并且 code 无法识别它们。

我想它仍然可以变得更简洁(即不必在 code 内重复 string "\\end{code}\n" )。

As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.

So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:

literateFile = many codeOrProse <* eof

code = do string "\\begin{code}"
          content <- many $ noneOf "\\"
          string "\\end{code}"
          return $ Haskell content

prose = do content <- many1 $ noneOf "\\"
           return $ Text content

Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).

Edit

Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):

import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)    

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "" input

literateFile
    = many codeOrProse

codeOrProse = code <|> prose

code = do string "\\begin{code}\n"
          c <- untilP (string "\\end{code}\n")
          string "\\end{code}\n"
          return $ Haskell c

prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
           return $ Text t

untilP p = do s <- many $ noneOf "\n"
              newline
              s' <- try (lookAhead p >> return "") <|> untilP p
              return $ s ++ s'

untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.

I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).

烟花易冷人易散 2024-12-16 15:33:55

我还没有测试过,但是:

  • many anyChar 可以匹配空字符串
  • 因此 prose 可以匹配空字符串
  • 因此 codeOrProse 可以匹配空字符串string
  • 因此,literateFile 可以永远循环,匹配无限多个空字符串

更改 prose 以匹配 Many1 字符可能会解决此问题。

(我对 Parsec 不太熟悉,但是 prose 如何知道它应该匹配多少 个字符?它可能会消耗整个输入,而不会给出 code 解析器有第二次机会寻找新代码段的开头,或者它可能在每次调用中只匹配一个字符,从而使 many/many1 中。没啥用。)

I haven't tested it, but:

  • many anyChar can match an empty string
  • Therefore prose can match an empty string
  • Therefore codeOrProse can match an empty string
  • Therefore literateFile can loop forever, matching infinitely many empty strings

Changing prose to match many1 characters might fix this problem.

(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)

街道布景 2024-12-16 15:33:55

作为参考,这是我提出的另一个版本(稍微扩展以处理其他情况):

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "test.tex"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    | Section String
    deriving (Show)

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input

literateFile
    = do es <- many elements
         eof
         return es

elements
    = try section
  <|> try quotedBackslash
  <|> try code
  <|> prose

code
    = do string "\\begin{code}"
         c <- anyChar `manyTill` try (string "\\end{code}")
         return $ Haskell c

quotedBackslash
    = do string "\\\\"
         return $ Text "\\\\"

prose
    = do t <- many1 (noneOf "\\")
         return $ Text t

section
    = do string "\\section{"
         content <- many1 (noneOf "}")
         char '}'
         return $ Section content

For reference, here's another version I came up with (slightly expanded to handle other cases):

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "test.tex"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    | Section String
    deriving (Show)

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input

literateFile
    = do es <- many elements
         eof
         return es

elements
    = try section
  <|> try quotedBackslash
  <|> try code
  <|> prose

code
    = do string "\\begin{code}"
         c <- anyChar `manyTill` try (string "\\end{code}")
         return $ Haskell c

quotedBackslash
    = do string "\\\\"
         return $ Text "\\\\"

prose
    = do t <- many1 (noneOf "\\")
         return $ Text t

section
    = do string "\\section{"
         content <- many1 (noneOf "}")
         char '}'
         return $ Section content
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文