用于提取代码库中硬编码字符串的 Lua 模式匹配
我正在使用 C++ 代码库。现在,我正在使用调用 lua 脚本的 C++ 代码来查看整个代码库,并希望返回程序中使用的所有字符串的列表。
有问题的字符串前面总是有一个名为 TRANS 的 JUCE 宏。以下是一些应该提取字符串的示例
TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")
,我相信您可以想象大型代码库中可能出现的其他一些可能的字符串变体。我正在制作一个自动工具来生成 JUCE 翻译格式文件,以尽可能自动化该过程
,就目前而言,我已经完成了模式匹配,以便找到这些字符串。我已将源代码转换为 lua 字符串
path = ...
--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")
,并调用
for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end
它查找以 TRANS 开头、具有平衡括号的模式。这将为我提供完整的宏,包括括号,但从那里我认为很容易分离出我不需要的脂肪并只保留实际的字符串值。
然而,这对于导致括号不平衡的字符串不起作用。 例如 TRANS(")")
将返回 TRANS(")
,而不是 TRANS("(")
我将模式修改为
for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end
,该模式应该以 TRANS 开头,然后是 0 个或多个空格,然后应该有一个 ( 字符,后跟零个或多个空格。现在我们在括号内,我们应该有平衡数量的“”标记,后面跟着另一个。 0 个或多个空格,最后以 ) 结尾不幸的是,这在使用时不会返回单个值。但是...我认为即使它按我的预期工作...里面可能有一个 \"
,这会导致括号。 关于提取这些字符串有
什么建议吗?或者我应该尝试直接算法...你知道为什么我的第二个模式没有返回任何其他建议吗?希望覆盖 100% 的所有可能性,但接近 100% 会谢谢!
I'm working with a C++ code base. Right now I'm using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the program.
The strings in question are always preceded by a JUCE macro called TRANS. Here are some examples which should extract a string
TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")
And I'm sure you can imagine some other possible string varients that could occur in a large code base. I'm making an automatic tool to generate JUCE translation formatted files to automate the process as much as possible
I've gotten this far, as it stands, for pattern matching in order to find these strings. I've converted the source code into a lua string
path = ...
--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")
and called
for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end
which finds a pattern that starts with TRANS, has balanced parenthesis. This will get me the full Macro, including the brackets but from there I figured it would be pretty easy to split off the fat I don't need and just keep the actual string value.
However this doesn't work for strings which cause a parenthesis imbalance.
e.gTRANS(")")
will return TRANS(")
, instead of TRANS("(")
I revised my pattern to
for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end
where, the pattern should start with a TRANS, then 0 or many spaces. Then it should have a ( character followed by zero or more spaces. Now that we are inside the brackets, we should have a balanced number of "" marks, followed by another 0 or many spaces, and finally ended by a ) . Unfortunately, this does not return a single value when used. But... I think even IF it worked as I expected it to... There can be a \"
inside, which causes the bracket imbalance.
Any advice on extracting these strings? Should I continue to try and find a pattern matching sequence? or should I try a direct algorithm... Do you know why my second pattern returned no strings? Any other advice! I'm not looking to cover 100% of all possibilities, but being close to 100% would be awesome. Thanks! :D
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我和其他人一样喜欢 Lua 模式,但你是带着刀去枪战。这是您确实不想将解决方案编码为正则表达式的问题之一。要正确处理双引号和反斜杠转义,您需要一个真正的解析器,并且 LPEG 将很好地满足您的需求。
I love Lua patterns as much as anyone, but you're bringing a knife to a gun fight. This is one of those problems where you really don't want to code the solution as regular expressions. To deal correctly with doublequote marks and backslash escapes, you want a real parser, and LPEG will manage your needs nicely.
在第二种情况下,您忘记转义括号。尝试
In the second case, you forgot to escape parentheses. Try