使用 Parsec 解析正则表达式

发布于 2024-12-29 12:27:23 字数 594 浏览 0 评论 0原文

我正在尝试通过实现一个小型正则表达式解析器来学习秒差距。在 BNF 中，我的语法看起来像：

EXP  : EXP *
     | LIT EXP
     | LIT

我尝试在 Haskell 中实现这一点：

expr = try star
       <|> try litE
       <|> lit

litE  = do c <- noneOf "*"
           rest <- expr
           return (c : rest)

lit   = do c <- noneOf "*"
           return [c]

star = do content <- expr
          char '*'
          return (content ++ "*")

虽然这里有一些无限循环（例如 expr -> star -> expr 而不消耗任何标记），这使得解析器永远循环。我不太确定如何修复它，因为 star 的本质是它在最后消耗其强制令牌。

有什么想法吗？

原文

I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:

EXP  : EXP *
     | LIT EXP
     | LIT

I've tried to implement this in Haskell as:

expr = try star
       <|> try litE
       <|> lit

litE  = do c <- noneOf "*"
           rest <- expr
           return (c : rest)

lit   = do c <- noneOf "*"
           return [c]

star = do content <- expr
          char '*'
          return (content ++ "*")

There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.

Any thoughts?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橘虞初梦 2025-01-05 12:27:23

您应该使用Parsec.Expr.buildExprParser；它非常适合此目的。您只需描述您的运算符、它们的优先级和结合性，以及如何解析原子，组合器就会为您构建解析器！

您可能还想添加用括号对术语进行分组的功能，以便您可以将 * 应用于多个文字。

这是我的尝试（为了更好的衡量，我加入了 |、+ 和 ?）：

import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr

data Term = Literal Char
          | Sequence [Term]
          | Repeat (Int, Maybe Int) Term
          | Choice [Term]
  deriving ( Show )

term :: Parser Term
term = buildExpressionParser ops atom where

  ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
          , Postfix (Repeat (1, Nothing) <$ char '+')
          , Postfix (Repeat (0, Just 1)  <$ char '?')
          ]
        , [ Infix (return sequence) AssocRight
          ]
        , [ Infix (choice <$ char '|') AssocRight
          ]
        ]

  atom = msum [ Literal <gt; lit
              , parens term
              ]

  lit = noneOf "*+?|()"
  sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
  choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
  parens = between (char '(') (char ')')

  seqTerms (Sequence ts) = ts
  seqTerms t = [t]

  choiceTerms (Choice ts) = ts
  choiceTerms t = [t]

main = parseTest term "he(llo)*|wor+ld?"

You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!

You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.

Here's my attempt (I threw in |, +, and ? for good measure):

import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr

data Term = Literal Char
          | Sequence [Term]
          | Repeat (Int, Maybe Int) Term
          | Choice [Term]
  deriving ( Show )

term :: Parser Term
term = buildExpressionParser ops atom where

  ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
          , Postfix (Repeat (1, Nothing) <$ char '+')
          , Postfix (Repeat (0, Just 1)  <$ char '?')
          ]
        , [ Infix (return sequence) AssocRight
          ]
        , [ Infix (choice <$ char '|') AssocRight
          ]
        ]

  atom = msum [ Literal <gt; lit
              , parens term
              ]

  lit = noneOf "*+?|()"
  sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
  choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
  parens = between (char '(') (char ')')

  seqTerms (Sequence ts) = ts
  seqTerms t = [t]

  choiceTerms (Choice ts) = ts
  choiceTerms t = [t]

main = parseTest term "he(llo)*|wor+ld?"

回复收藏 0 原文

眉黛浅 2025-01-05 12:27:23

您的语法是左递归的，这与 try 配合不佳，因为 Parsec 会反复回溯。有几种方法可以解决这个问题。也许最简单的方法就是在另一条规则中使 * 可选：

lit :: Parser (Char, Maybe Char)
lit = do
  c <- noneOf "*"
  s <- optionMaybe $ char '*'
  return (c, s)

当然，无论如何，您最终可能会将内容包装在数据类型中，并且有很多方法可以实现这一点。这是我突然想到的一个：

import Control.Applicative ((<gt;))

data Term = Literal Char
          | Sequence [Term]
          | Star Term

expr :: Parser Term
expr = Sequence <gt; many term

term :: Parser Term
term = do
  c <- lit
  s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
  return $ if isNothing s
    then Literal c
    else Star $ Literal c

也许更有经验的 Haskeller 会提出更好的解决方案。

Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:

lit :: Parser (Char, Maybe Char)
lit = do
  c <- noneOf "*"
  s <- optionMaybe $ char '*'
  return (c, s)

Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:

import Control.Applicative ((<gt;))

data Term = Literal Char
          | Sequence [Term]
          | Star Term

expr :: Parser Term
expr = Sequence <gt; many term

term :: Parser Term
term = do
  c <- lit
  s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
  return $ if isNothing s
    then Literal c
    else Star $ Literal c

Maybe a more experienced Haskeller will come along with a better solution.

回复收藏 0 原文

~没有更多了~