fslex 中的 Lua 长字符串

发布于 2024-10-06 10:22:27 字数 1254 浏览 0 评论 0原文

我在业余时间一直在研究 Lua fslex 词法分析器,使用 ocamllex 手册作为参考。

我在尝试正确标记长字符串时遇到了一些障碍。 “长字符串”由 '[' ('=')* '['']' ('=')* ']' 标记分隔; = 符号的数量必须相同。

在第一个实现中,词法分析器似乎无法识别 [[ 模式,尽管有最长匹配规则,但仍生成两个 LBRACKET 标记,而 [=[以及正确识别的变化。此外,正则表达式无法确保使用正确的结束标记,在第一个 ']' ('=')* ']' 捕获处停止,无论实际的长字符串“级别如何” ”。此外,fslex 似乎不支持正则表达式中的“as”结构。


let lualongstring =    '[' ('=')* '[' ( escapeseq | [^ '\\' '[' ] )* ']' ('=')* ']'

(* ... *)
    | lualongstring    { (* ... *) }
    | '['              { LBRACKET }
    | ']'              { RBRACKET }
(* ... *)


我一直在尝试用词法分析器中的另一条规则来解决这个问题:


rule tokenize = parse
    (* ... *)
    | '[' ('=')* '['   { longstring (getLongStringLevel(lexeme lexbuf)) lexbuf }
    (* ... *)

and longstring level = parse 
    | ']' ('=')* ']'   { (* check level, do something *) }
    | _                { (* aggregate other chars *) }

    (* or *)

    | _    {
               let c = lexbuf.LexerChar(0);
               (* ... *)           
           }

但是我被困住了,原因有两个:首先,我认为我不能“推送”,可以这么说,一次将令牌推到下一个规则我读完了长字符串;其次,我不喜欢逐字符读取直到找到正确的结束标记的想法,这使得当前的设计毫无用处。

如何在 fslex 中标记 Lua 长字符串?感谢您的阅读。

I've been working on a Lua fslex lexer in my spare time, using the ocamllex manual as a reference.

I hit a few snags while trying to tokenize long strings correctly. "Long strings" are delimited by '[' ('=')* '[' and ']' ('=')* ']' tokens; the number of = signs must be the same.

In the first implementation, the lexer seemed to not recognize [[ patterns, producing two LBRACKET tokens despite the longest match rule, whereas [=[ and variations where recognized correctly. In addition, the regular expression failed to ensure that the correct closing token is used, stopping at the first ']' ('=')* ']' capture, no matter the actual long string "level". Also, fslex does not seem to support "as" constructs in regular expressions.


let lualongstring =    '[' ('=')* '[' ( escapeseq | [^ '\\' '[' ] )* ']' ('=')* ']'

(* ... *)
    | lualongstring    { (* ... *) }
    | '['              { LBRACKET }
    | ']'              { RBRACKET }
(* ... *)


I've been trying to solve the issue with another rule in the lexer:


rule tokenize = parse
    (* ... *)
    | '[' ('=')* '['   { longstring (getLongStringLevel(lexeme lexbuf)) lexbuf }
    (* ... *)

and longstring level = parse 
    | ']' ('=')* ']'   { (* check level, do something *) }
    | _                { (* aggregate other chars *) }

    (* or *)

    | _    {
               let c = lexbuf.LexerChar(0);
               (* ... *)           
           }

But I'm stuck, for two reasons: first, I don't think I can "push", so to speak, a token to the next rule once I'm done reading the long string; second, I don't like the idea of reading char by char until the right closing token is found, making the current design useless.

How can I tokenize Lua long strings in fslex? Thanks for reading.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

回忆那么伤 2024-10-13 10:22:27

如果我回答我自己的问题,我很抱歉,但我想贡献自己的解决方案以供将来参考。

我使用 LexBuffer<_>.BufferLocalStore 属性来保持词法分析器函数调用的状态,该属性只是一个可写的 IDictionary 实例。

注意:长括号用于长字符串和多行注释。这常常是 Lua 语法中被忽视的部分。



let beginlongbracket =    '[' ('=')* '['
let endlongbracket =      ']' ('=')* ']'

rule tokenize = parse
    | beginlongbracket 
    { longstring (longBracketLevel(lexeme lexbuf)) lexbuf }

(* ... *)

and longstring level = parse
    | endlongbracket 
    { if longBracketLevel(lexeme lexbuf) = level then 
          LUASTRING(endLongString(lexbuf)) 
      else 
          longstring level lexbuf 
    }

    | _ 
    { toLongString lexbuf (lexeme lexbuf); longstring level lexbuf }

    | eof 
    { failwith "Unexpected end of file in string." }


以下是我用来简化将数据存储到 BufferLocalStore 中的函数:

let longBracketLevel (str : string) =
    str.Count(fun c -> c = '=')

let createLongStringStorage (lexbuf : LexBuffer<_>) =
    let sb = new StringBuilder(1000)
    lexbuf.BufferLocalStore.["longstring"] <- box sb
    sb

let toLongString (lexbuf : LexBuffer<_>) (c : string) =
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let storage = if hasString then (sb :?> StringBuilder) else (createLongStringStorage lexbuf)
    storage.Append(c.[0]) |> ignore

let endLongString (lexbuf : LexBuffer<_>) : string = 
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let ret = if not hasString then "" else (sb :?> StringBuilder).ToString()
    lexbuf.BufferLocalStore.Remove("longstring") |> ignore
    ret

也许它的功能不是很强大,但它似乎可以完成工作。

  • 使用 tokenize 规则直到找到长括号的开头
  • 切换到长字符串规则并循环直到找到相同级别的结束长括号将
  • 每个与相同级别的结束长括号不匹配的词素存储到 StringBuilder 中,它又被存储到 LexBuffer BufferLocalStore 中。
  • 一旦长字符串结束,就清除 BufferLocalStore。

编辑:您可以在 http://ironlua.codeplex.com 找到该项目。词法分析和解析应该没问题。我打算使用 DLR。欢迎评论和建设性批评。

Apologies if I answer my own question, but I'd like to contribute with my own solution to the problem for future reference.

I am keeping state across lexer function calls with the LexBuffer<_>.BufferLocalStore property, which is simply a writeable IDictionary instance.

Note: long brackets are used both by long string and multiline comments. This is often an overlooked part of the Lua grammar.



let beginlongbracket =    '[' ('=')* '['
let endlongbracket =      ']' ('=')* ']'

rule tokenize = parse
    | beginlongbracket 
    { longstring (longBracketLevel(lexeme lexbuf)) lexbuf }

(* ... *)

and longstring level = parse
    | endlongbracket 
    { if longBracketLevel(lexeme lexbuf) = level then 
          LUASTRING(endLongString(lexbuf)) 
      else 
          longstring level lexbuf 
    }

    | _ 
    { toLongString lexbuf (lexeme lexbuf); longstring level lexbuf }

    | eof 
    { failwith "Unexpected end of file in string." }


Here are the functions I use to simplify storing data into the BufferLocalStore:

let longBracketLevel (str : string) =
    str.Count(fun c -> c = '=')

let createLongStringStorage (lexbuf : LexBuffer<_>) =
    let sb = new StringBuilder(1000)
    lexbuf.BufferLocalStore.["longstring"] <- box sb
    sb

let toLongString (lexbuf : LexBuffer<_>) (c : string) =
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let storage = if hasString then (sb :?> StringBuilder) else (createLongStringStorage lexbuf)
    storage.Append(c.[0]) |> ignore

let endLongString (lexbuf : LexBuffer<_>) : string = 
    let hasString, sb = lexbuf.BufferLocalStore.TryGetValue("longstring")
    let ret = if not hasString then "" else (sb :?> StringBuilder).ToString()
    lexbuf.BufferLocalStore.Remove("longstring") |> ignore
    ret

Perhaps it's not very functional, but it seems to be getting the job done.

  • use the tokenize rule until the beginning of a long bracket is found
  • switch to the longstring rule and loop until a closing long bracket of the same level is found
  • store every lexeme that does not match a closing long bracket of the same level into a StringBuilder, which is in turn stored into the LexBuffer BufferLocalStore.
  • once the longstring is over, clear the BufferLocalStore.

Edit: You can find the project at http://ironlua.codeplex.com. Lexing and parsing should be okay. I am planning on using the DLR. Comments and constructive criticism welcome.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文