如何从给定的字符串列表自动生成正则表达式?
给您 2 个字符串列表 - A 和 B。找到与 A 中的所有字符串匹配且 B 中没有字符串匹配的最短正则表达式。请注意,此正则表达式可以匹配/不匹配不在 A 和 B 中的其他字符串。简单起见,我们可以假设我们的字母表大小只有 2 个字符 - 0 和 1。而且只允许使用以下运算符:
* - 0 或更多
? - 0 或 1
+ - 1 个或多个
() - 括号
为简单起见,不允许使用正则表达式 not 运算符。我不知道允许或运算符 (|) 是否会简化问题。 A 和 B 当然没有共同元素。以下是一些示例:
A=[00,01,10]
B=[11]
answer = 1*0+1*
A=[00,01,11]
B=[10]
answer = 0*1*
You are given 2 lists of Strings - A and B. Find the shortest regex that matches all strings in A and none in B. Note that this regex can match/not-match other strings that are not in A and not in B. For simplicity, we can assume the that our alphabet size is just 2 characters - 0 and 1. Also only these operators are allowed:
* - 0 or more
? - 0 or 1
+ - 1 or more
() - brackets
For simplicity the regex not operator is not allowed. I don't know if allowing the or operator (|) would simplify the problem or not. A and B ofcourse would have no common elements. Here are some examples:
A=[00,01,10]
B=[11]
answer = 1*0+1*
A=[00,01,11]
B=[10]
answer = 0*1*
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
解决这个问题的一种方法是使用遗传算法。我碰巧有一个遗传求解器,所以我应用了它使用以下算法解决您的问题:
它匹配了多少不需要的东西
这是我在 C# 中的实现
及其应用于示例的结果:
输出:
第二个示例:
输出:
One way to solve this is with a genetic algorithm. I happen to have a genetic solver laying around so I applied it to your problem with the following algorithm:
how many undesired things it matches
Here's my implementation in C#
And the result of its application to your samples:
output:
second sample:
output:
如果这是一个家庭作业问题,那就像是“一份家庭作业,全班得 A”类型。
我认为该问题中的某处缺少“或”运算符。
有一个明显的解决方案是 A0|A1|A2|...,但在尝试找到最短的时似乎更难。
我建议使用递归来尝试缩短正则表达式,但这不是一个理想的解决方案。
If this was a homework problem, it would be like "one homework, get an A in the class" type.
I think there is "or" operator missing somewhere in that question.
There is an obvious solution that is A0|A1|A2|..., but seems like much harder solution when trying to find the shortest.
I would suggest using recursion to try to shorten the regex, but that is not an ideal solution.
该项目从给定的单词列表生成一个正则表达式:
https://github.com/bwagner/wordhierarchy
但是,它只使用“
|”、非捕获组“
(?:)
”和选项“?
”。使用示例:
This project generates a regexp from a given list of words:
https://github.com/bwagner/wordhierarchy
However, it only uses "
|
", non-capturing group "(?:)
" and option "?
".Sample usage:
“当有疑问时,请使用暴力。”
这会产生与第一个不同但同样好的答案:
0*1?0*
。它查看 1241 个试验正则表达式来解决两个测试用例(总计)。搜索有大小限制——因为这个问题的通用正则表达式版本是 NP 困难的,任何针对它的程序都会在足够复杂的输入上遇到麻烦。我承认自己没有真正考虑过这个简化的问题。我很想看到一些简洁的不太明显的答案。
"When in doubt, use brute force."
This produces a different but equally good answer for the first one:
0*1?0*
. It looks at 1241 trial regexes to solve the two test cases (total).The search has a size limit -- since the general-regex version of this problem is NP-hard, any program for it is going to run into trouble on complex-enough inputs. I'll cop to not having really thought about this simplified problem. I'd love to see some neat less-obvious answers.