轻量正则表达式优化
我有一个正则表达式,它是计算机程序的输出。它具有
(((2)|(9)))*
人类无疑会写成的
[29]*
东西,所以我想要一个可以进行简单转换的程序,使正则表达式更具可读性。到目前为止,我一直在使用一个快速脚本
$r =~ s/\(([0-9])\)/$1/g;
$r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g;
$r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g;
$r =~ s/\((\[[0-9]+\]\*)\)/$1/g;
$r =~ s/\|\(([^()]+)\)\|/|$1|/g;
来缩短长度,但结果仍然包含类似的片段
(ba*b)|ba*c|ca*b|ca*c
,应该简化为
[bc]a*[bc]
我搜索 CPAN 并找到 Regexp::List、Regexp::Assemble 和 Regexp::Optimizer。前两个不适用,第三个有问题。首先,它不会通过测试,因此除非我在 cpan 中强制安装 Regexp::Optimizer
,否则我无法使用它。其次,即使我这样做了,它的表达也会让人窒息。
注意:除了 [regex] 之外,我还标记了这个 [regular-language],因为正则表达式仅使用连接、交替和 Kleene 星号,所以它实际上是正则表达式。
I have a regular expression that was the output of a computer program. It has things like
(((2)|(9)))*
which a human would undoubtedly write as
[29]*
So I'd like a program that can make simple transformations that make the regular expression more readable. So far I've been using a quick script
$r =~ s/\(([0-9])\)/$1/g;
$r =~ s/\(([0-9])\|([0-9])\)/[$1$2]/g;
$r =~ s/\(([0-9]|\[[0-9]+\])\)\*/$1*/g;
$r =~ s/\((\[[0-9]+\]\*)\)/$1/g;
$r =~ s/\|\(([^()]+)\)\|/|$1|/g;
that bring down the length, but the result still contains pieces like
(ba*b)|ba*c|ca*b|ca*c
that should be simplified to
[bc]a*[bc]
I searched CPAN and found Regexp::List, Regexp::Assemble, and Regexp::Optimizer. The first two don't apply and the third has issues. First, it won't pass its tests so I can't use it unless I force install Regexp::Optimizer
in cpan. Second, even once I do that it chokes on the expression.
Note: I tagged this [regular-language] in addition to [regex] because the regexp uses only concatenation, alternation, and the Kleene star, so it is in fact regular.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我觉得可能有一种方法可以做到这一点,将正则表达式转换为语法,将语法放入乔姆斯基范式,合并常见的非终结符,并使用一些比较启发式寻找模式。如果你不把它放在“真正的”CNF 中,你甚至可能会得到更简洁的答案......我会把 lambdas/epsilons 留在里面。
此时,也许您会发现一个识别
回代的启发式方法,
在子表达式上重复此操作,例如,
现在认识到 S -> (B|C)(A),我们可以得到
对于 Then 的最终解决方案
,你可以只寻找多余的括号来删除(注意串联是关联的,这本质上会将事物优化为串联范式,只需删除所有括号即可不只包含 | 分隔的选项列表...所以上面的
技巧是启发式的,这可能无法完成所有可能的优化,我不知道它会如何。不过,这可能是值得考虑的事情。
I feel like there might be a way to do this by converting the regular expression into a grammar, putting the grammar into Chomsky Normal Form, merging common nonterminals, and looking for patterns using some comparison heurstic. You might even get more concise answers if you don't put it in "real" CNF... I'd leave the lambdas/epsilons inside.
At this point, maybe you find a heuristic that recognizes
Backsubstituting,
Repeat this on subexpressions, e.g.,
Now recognizing that S -> (B|C)(A), we can get
For a final solution of
Then, you can just look for excess parentheses to remove (noting that concatenation is associative, and this will essentially optimize things into concatenative normal form, just drop all parentheses that don't enclose only a |-delimited list of options... so the above becomes
The trick is coming up with the heuristic, and this may not do all the optimizations which are possible. I don't know how it will perform. Still, it might be something to consider.