将标记列表解析为表达式树

发布于 2024-10-26 23:06:21 字数 666 浏览 4 评论 0原文

我想解析典型 Haskell 源代码中的表达式。我得到一个输入流,它已经被标记化并带有固定性和优先级注释。运算符集在编译时未知并且可以是任意的。输出应该是表示表达式的树。以下是我尝试过的一些内容:

-- A single token of the input stream
data Token a
  = Prefix a
  | Infix a Int Fixity -- The Int parameter represents the precedence
  | LBrace
  | RBrace
  deriving (Show,Eq)

data Fixity
  = InfixL
  | InfixR
  | InfixC
  deriving (Show,Eq)

data Expression a
  = Atom a
  | Apply a a
  deriving Show

-- Wrapped into either, if expression is malformed.
exprToTree :: [Token a] -> Either String (Expression a)
exprToTree = undefined

为了简单起见,我并不将 lambda 视为特殊的,它们只是原子。

但现在,我完全迷失了。如何将原子流转换为树?有人可以向我指出一种算法或帮助我找到一种算法吗?

I want to parse expressions like those in typical Haskell source. I get an input stream, which is already tokenized and annotated with fixity and precedence. The set of operators is not known at compile time and may be arbitrary. The output should be a tree representing the expression. Here's a bit of what I tried:

-- A single token of the input stream
data Token a
  = Prefix a
  | Infix a Int Fixity -- The Int parameter represents the precedence
  | LBrace
  | RBrace
  deriving (Show,Eq)

data Fixity
  = InfixL
  | InfixR
  | InfixC
  deriving (Show,Eq)

data Expression a
  = Atom a
  | Apply a a
  deriving Show

-- Wrapped into either, if expression is malformed.
exprToTree :: [Token a] -> Either String (Expression a)
exprToTree = undefined

For the sake of simpleness, I don't treat lambdas special, they are just atoms.

But now, I am completely lost. How can I convert the stream of atoms into a tree? Can somebody please point me to an algorithm or help me with finding one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

や莫失莫忘 2024-11-02 23:06:21

简而言之,即使您有一个标记列表,您仍然需要一个解析器。

Parsec 可以处理替代令牌流,但您可能必须参考手册 - Daan Leijen 的“legacy”主页上提供的 PDF - http://legacy.cs.uu.nl/daan/download/parsec/parsec.pdf。您可以在不使用组合器库的情况下推出自己的解析器,但您将重新实现 Parsec 的某些部分。据我记得 UU_parsing 期望与单独的扫描仪一起使用,因此它是另一种选择。

尽管它不处理解析,但您可能会发现 Lennart Augustsson 的“Lambda 演算四种方法”对其他事情很有帮助 - http://www.augustsson.net/Darcs/Lambda/top.pdf

编辑 - 这是一个部分制定的计划,说明如何使用秒差距来实现它,有关详细信息,您必须查阅手册的第 2.11 节。

假设您有具体“内部”标记的数据类型:

data InternalTok = Ident String
                 | BinOpPlus
                 | BinOpMinus
                 | UnaryOpNegate
                 | IntLiteral Int
  deriving (Show)

然后您最终获得 Parsec 标记的这些类型并解析:

type MyToken = Token InternalTok

type MyParser a = GenParser MyToken () a

根据 Parsec 手册定义一个辅助函数 - 该函数处理 show 和 pos,因此各个定义更短 cf.第 19 页的 mytoken 函数。

mytoken :: (MyToken -> Maybe a) -> MyParser a
mytoken test = token showToken posToken testToken
  where
    showToken tok = show tok
    posToken tok = no_pos
    testToken tok = test tok

目前您的令牌类型不跟踪源位置,因此:

no_pos :: SourcePos
no_pos = newPos "" 0 0 0 

对于每个终端,您必须定义一个令牌函数:

identifier :: MyParser MyToken
identifier =  mytoken (\tok -> case tok of
                         a@(Prefix (Ident _)) -> Just a
                         _                    -> Nothing)

intLiteral :: MyParser MyToken
intLiteral =  mytoken (\tok -> case tok of
                         a@(Prefix (IntLiteral _)) -> Just a
                         _                         -> Nothing)

binPlus :: MyParser MyToken
binPlus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpPlus _ _) -> Just a
                      _                       -> Nothing)


binMinus :: MyParser MyToken
binMinus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpMinus _ _) -> Just a
                      _                        -> Nothing)

unaryNegate :: MyParser MyToken
unaryNegate =  mytoken (\tok -> case tok of
                        a@(Prefix UnaryNegate _ _) -> Just a
                        _                          -> Nothing)

编辑 - 来处理自定义中缀运算符,您将需要这些标记解析器:

tokInfixL :: Int -> MyParser MyToken
tokInfixL n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixL) | i == n -> Just a
    _                             -> Nothing)


tokInfixR :: Int -> MyParser MyToken
tokInfixR n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixR) | i == n -> Just a
    _                             -> Nothing)

tokInfixC :: Int -> MyParser MyToken
tokInfixC n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixC) | i == n -> Just a
    _                             -> Nothing)


tokPrefix :: MyParser MyToken
tokPrefix = mytoken (\tok -> case tok of
                       a@(Prefix _) -> Just a
                       _            -> Nothing)

现在您可以定义解析器 - 您需要事先修复优先级的数量,没有办法绕过这个事实,因为您需要为每个级别编写一个解析器。

顶级表达式解析只是调用最高优先级解析器。

pExpression :: Parser Expersion
pExpression = expression10

对于每个优先级,您需要一个大致像这样的解析器,您必须自己计算出非关联。另外,您可能需要在 chainl / chainr 上做一些工作 - 我只用 UU_Parsing 编写了这种风格的解析器,它对于 Parsec 可能略有不同。注意“应用”通常处于优先级最高级别。

expression10 :: Parser Expression
expression10 = 
        Apply   <
gt; identifier <*> pExpression
    <|> Prefix  <
gt; tokPrefix  <*> pExpression
    <|> chainl (Infix <
gt; tokInfixL 10) expression9
    <|> chainr (Infix <
gt; tokInfixR 10) expression9

expression9 :: Parser Expression
expression9 = 
        Prefix  <
gt; tokPrefix  <*> pExpression
    <|> chainl (Infix <
gt; tokInfixL 9) expression8
    <|> chainr (Infix <
gt; tokInfixR 9) expression8

...

您必须扩展您的语法来处理优先级为 0 级的 IntLiterals 和标识符:

expression0 :: Parser Expression
expression0 = 
        IntLit  <
gt; intLiteral 
    <|> Ident   <
gt; identifier
    <|> ...

编辑 - 无限优先级 - 也许如果您只有应用程序和 Atom 也许这样的东西会起作用。请注意,您必须更改 tokInfixL 和 tokInfixR 解析器以不再匹配关联级别,并且您可能必须尝试替代方案的顺序。

expression :: Parser Expression
expression = 
        Apply   <
gt; identifier <*> expression
    <|> Prefix  <
gt; tokPrefix  <*> expression
    <|> chainl (Infix <
gt; tokInfixL) expression
    <|> chainr (Infix <
gt; tokInfixR) expression
    <|> intLiteral
    <|> identifier

intLiteral :: Parser Expression
intLiteral = Atom . convert <
gt; intLiteral
  where
    convert = ??

identifier :: Parser Expression
identifier = Atom . convert <
gt; intLiteral
  where
    convert = ??

In a nutshell then, even though you have a token list you still need a parser.

Parsec can handle alternative token streams, but you'll probably have to refer to the manual - a PDF available at Daan Leijen's "legacy" home page - http://legacy.cs.uu.nl/daan/download/parsec/parsec.pdf. You can roll your own parser without using a combinator library but you will be re-implementing some fraction of Parsec. As far as I remember UU_parsing expects to work with a separate scanner so its another option.

Although it doesn't handle parsing you might find Lennart Augustsson's "Lambda calculus cooked four ways" helpful for other things - http://www.augustsson.net/Darcs/Lambda/top.pdf

Edit - here is a partly worked out plan of how you can go about it with Parsec, for details you'll have to consult section 2.11 of the manual.

Suppose you have this data type for concrete "internal" tokens:

data InternalTok = Ident String
                 | BinOpPlus
                 | BinOpMinus
                 | UnaryOpNegate
                 | IntLiteral Int
  deriving (Show)

Then you end get these types for the Parsec token and parse:

type MyToken = Token InternalTok

type MyParser a = GenParser MyToken () a

Define a helper function as per the Parsec manual - this handles show and pos so individual definitions are shorter cf. the mytoken function on page 19.

mytoken :: (MyToken -> Maybe a) -> MyParser a
mytoken test = token showToken posToken testToken
  where
    showToken tok = show tok
    posToken tok = no_pos
    testToken tok = test tok

For the moment your token type does not track source position, so:

no_pos :: SourcePos
no_pos = newPos "" 0 0 0 

For each terminal you have to define a token function:

identifier :: MyParser MyToken
identifier =  mytoken (\tok -> case tok of
                         a@(Prefix (Ident _)) -> Just a
                         _                    -> Nothing)

intLiteral :: MyParser MyToken
intLiteral =  mytoken (\tok -> case tok of
                         a@(Prefix (IntLiteral _)) -> Just a
                         _                         -> Nothing)

binPlus :: MyParser MyToken
binPlus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpPlus _ _) -> Just a
                      _                       -> Nothing)


binMinus :: MyParser MyToken
binMinus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpMinus _ _) -> Just a
                      _                        -> Nothing)

unaryNegate :: MyParser MyToken
unaryNegate =  mytoken (\tok -> case tok of
                        a@(Prefix UnaryNegate _ _) -> Just a
                        _                          -> Nothing)

Edit - to handle custom infix operators you'll need these token parsers:

tokInfixL :: Int -> MyParser MyToken
tokInfixL n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixL) | i == n -> Just a
    _                             -> Nothing)


tokInfixR :: Int -> MyParser MyToken
tokInfixR n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixR) | i == n -> Just a
    _                             -> Nothing)

tokInfixC :: Int -> MyParser MyToken
tokInfixC n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixC) | i == n -> Just a
    _                             -> Nothing)


tokPrefix :: MyParser MyToken
tokPrefix = mytoken (\tok -> case tok of
                       a@(Prefix _) -> Just a
                       _            -> Nothing)

Now you can define the parser - you need to fix the number of levels of precedence beforehand, there is no way around that fact as you need to code a parser for each level.

The top-level expression parse is simply calls the highest precedence parser

pExpression :: Parser Expersion
pExpression = expression10

For each precendence level you need a parser roughly like this, you'll have to work out non-assoc for yourself. Also you might need to do some work on chainl / chainr - I've only written a parser in this style with UU_Parsing it might be slightly different for Parsec. Note Apply is usually at the precedence highest level.

expression10 :: Parser Expression
expression10 = 
        Apply   <
gt; identifier <*> pExpression
    <|> Prefix  <
gt; tokPrefix  <*> pExpression
    <|> chainl (Infix <
gt; tokInfixL 10) expression9
    <|> chainr (Infix <
gt; tokInfixR 10) expression9

expression9 :: Parser Expression
expression9 = 
        Prefix  <
gt; tokPrefix  <*> pExpression
    <|> chainl (Infix <
gt; tokInfixL 9) expression8
    <|> chainr (Infix <
gt; tokInfixR 9) expression8

...

You'll have to extend your syntax to handle IntLiterals and Identifiers which are at level 0 in precedence:

expression0 :: Parser Expression
expression0 = 
        IntLit  <
gt; intLiteral 
    <|> Ident   <
gt; identifier
    <|> ...

Edit - for unlimited precedence - maybe if you only have application and Atom maybe something like this would work. Note you'll have to change the tokInfixL and tokInfixR parsers to no longer match assoc-level and you may have to experiment with the order of alternatives.

expression :: Parser Expression
expression = 
        Apply   <
gt; identifier <*> expression
    <|> Prefix  <
gt; tokPrefix  <*> expression
    <|> chainl (Infix <
gt; tokInfixL) expression
    <|> chainr (Infix <
gt; tokInfixR) expression
    <|> intLiteral
    <|> identifier

intLiteral :: Parser Expression
intLiteral = Atom . convert <
gt; intLiteral
  where
    convert = ??

identifier :: Parser Expression
identifier = Atom . convert <
gt; intLiteral
  where
    convert = ??
梦屿孤独相伴 2024-11-02 23:06:21

在网上搜索另一个主题后,我发现这段很好的代码可以完全满足我的要求。看看:(

data Op     = Op String Prec Fixity          deriving Eq
data Fixity = Leftfix | Rightfix | Nonfix    deriving Eq
data Exp    = Var Var | OpApp Exp Op Exp     deriving Eq
type Prec   = Int
type Var    = String

data Tok = TVar Var | TOp Op

parse :: [Tok] -> Exp
parse (TVar x : rest) = fst (parse1 (Var x) (-1) Nonfix rest)

parse1 :: Exp -> Int -> Fixity -> [Tok] -> (Exp, [Tok])
parse1 e p f [] = (e, [])
parse1 e p f inp@(TOp op@(Op _ p' f') : TVar x : rest) 
  | p' == p && (f /= f' || f == Nonfix)
  = error "ambiguous infix expression"
  | p' < p  ||  p' == p && (f == Leftfix || f' == Nonfix)
  = (e, inp)
  | otherwise
  = let (r,rest') = parse1 (Var x) p' f' rest in
    parse1 (OpApp e op r) p f rest'

-- Printing

instance Show Exp where
  showsPrec _ (Var x) = showString x
  showsPrec p e@(OpApp l (Op op _ _) r) = 
        showParen (p > 0) $ showsPrec 9 l . showString op . showsPrec 9 r

-- Testing

plus   = TOp (Op "+" 6 Leftfix)
times  = TOp (Op "*" 7 Leftfix)
divide = TOp (Op "/" 7 Leftfix)
gt     = TOp (Op ">" 4 Nonfix)
ex     = TOp (Op "^" 8 Rightfix)

lookupop '+' = plus
lookupop '*' = times
lookupop '/' = divide
lookupop '>' = gt
lookupop '^' = ex

fromstr [x]     = [TVar [x]]
fromstr (x:y:z) = TVar [x] : lookupop y : fromstr z

test1 = fromstr "a+b+c"
test2 = fromstr "a+b+c*d"
test3 = fromstr "a/b/c"
test4 = fromstr "a/b+c"
test5 = fromstr "a/b*c"
test6 = fromstr "1^2^3+4"
test7 = fromstr "a/1^2^3"
test8 = fromstr "a*b/c"

我从这个页面获取它: http ://hackage.haskell.org/trac/haskell-prime/attachment/wiki/FixityResolution/resolve.hs

After searching the web for another topic, I found this nice piece of code to do exactly what I want. Have a look:

data Op     = Op String Prec Fixity          deriving Eq
data Fixity = Leftfix | Rightfix | Nonfix    deriving Eq
data Exp    = Var Var | OpApp Exp Op Exp     deriving Eq
type Prec   = Int
type Var    = String

data Tok = TVar Var | TOp Op

parse :: [Tok] -> Exp
parse (TVar x : rest) = fst (parse1 (Var x) (-1) Nonfix rest)

parse1 :: Exp -> Int -> Fixity -> [Tok] -> (Exp, [Tok])
parse1 e p f [] = (e, [])
parse1 e p f inp@(TOp op@(Op _ p' f') : TVar x : rest) 
  | p' == p && (f /= f' || f == Nonfix)
  = error "ambiguous infix expression"
  | p' < p  ||  p' == p && (f == Leftfix || f' == Nonfix)
  = (e, inp)
  | otherwise
  = let (r,rest') = parse1 (Var x) p' f' rest in
    parse1 (OpApp e op r) p f rest'

-- Printing

instance Show Exp where
  showsPrec _ (Var x) = showString x
  showsPrec p e@(OpApp l (Op op _ _) r) = 
        showParen (p > 0) $ showsPrec 9 l . showString op . showsPrec 9 r

-- Testing

plus   = TOp (Op "+" 6 Leftfix)
times  = TOp (Op "*" 7 Leftfix)
divide = TOp (Op "/" 7 Leftfix)
gt     = TOp (Op ">" 4 Nonfix)
ex     = TOp (Op "^" 8 Rightfix)

lookupop '+' = plus
lookupop '*' = times
lookupop '/' = divide
lookupop '>' = gt
lookupop '^' = ex

fromstr [x]     = [TVar [x]]
fromstr (x:y:z) = TVar [x] : lookupop y : fromstr z

test1 = fromstr "a+b+c"
test2 = fromstr "a+b+c*d"
test3 = fromstr "a/b/c"
test4 = fromstr "a/b+c"
test5 = fromstr "a/b*c"
test6 = fromstr "1^2^3+4"
test7 = fromstr "a/1^2^3"
test8 = fromstr "a*b/c"

(I took it from this page: http://hackage.haskell.org/trac/haskell-prime/attachment/wiki/FixityResolution/resolve.hs)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文