生成与某些输入集匹配的正则表达式是一个可解决的问题吗?
我提供了一些输入集,其中包含已知的分隔数量的文本块。
我想制作一个程序,自动生成 1 个或多个正则表达式,每个正则表达式与输入集中的每个文本块匹配。
我看到了一些相对简单的方法来实现强力搜索。但我不是编译器理论方面的专家。这就是为什么我很好奇:
1)这个问题可以解决吗?或者有一些原则上不可能做出这样的算法?
2)该算法是否可以实现多项式复杂度并避免暴力破解?
I provide some input set which contains known separated number of text blocks.
I want to make a program that automatically generate 1 or more regular expressions each of which matches every text block in the input set.
I see some relatively easy ways to implement a brute-force search. But I'm not an expert in compilers theory. That's why I'm curious:
1) is this problem solvable? or there are some principle impossibility to make such algorithm?
2) is it possible to achieve polynomial complexity for this algorithm and avoid brute forcing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
http://txt2re.com/ 可能就是您想要的。
http://txt2re.com/ might be what you want.
“.*”是一个或多个正则表达式,它将匹配输入集中的每个文本块。 ;-)
".*" is one-or-more regexp that will match every text block in the input set. ;-)
问题是,有大量的正则表达式(实际上是无限多个)将匹配给定的一组输入。它们的范围从非常“贪婪”的表达式(将匹配所有内容)
到非常非“贪婪”的表达式(将与输入集完全匹配)
在这两者之间,您可以通过多种方式改变表达式以使其变得更加贪婪或不那么贪婪(例如,用匹配任何数字的表达式替换特定数字等)。
您必须告诉我们更多有关该问题的信息,以便我们知道在这一系列可能的答案中哪些是正确的答案;)
The problem is, there are a huge number of regular expressions (actually, an infinite number) that will match a given set of inputs. They range from very "greedy" expressions that will match everything
To very non "greedy" expression that will match exactly the input set
In between those two you can vary the expression in a variety of ways to make it more and less greedy (eg, replace specific digits with an expression which matches any digit, etc).
You'll have to tell us a little more about the problem for us to know where in that range of possible answers is the correct one ;)
好的,在评论中,您澄清了您想要匹配所有等于输入集中的字符串之一的字符串,或者仅在给定“变化”级别内与其中一个字符串不同的字符串。由于您没有准确定义“变化”,我将使用 Levenshtein 距离。
给定一个字符串
s
和一个整数n
,您可以构建正则表达式,它完全匹配编辑距离为n
或更小的所有字符串到该字符串,如下所示:首先,我们编写一个函数,给定
s
和n
返回一个简单正则表达式列表,这些正则表达式一起匹配距离为的所有字符串n
或更少至s
。这里的“简单正则表达式”是指仅包含文字字符和通配符的正则表达式。对于
n=0
,该函数仅返回[s]
。否则,它会计算n-1
的列表,然后遍历其中的每个项目。对于每个项目r
和每个位置i
,其中0 <= i
length(r)
,它将以下正则表达式添加到列表中:.
已添加到r
位置i
的正则表达式>。r
的第i
字符已被删除。r
的第i
字符已替换为.
。现在,为了计算一组给定字符串和给定 n 值的正则表达式,我们计算每个字符串的列表,然后将所有正则表达式或所有正则表达式一起合并为一个大正则表达式。
请注意,这很快就会导致非常大的正则表达式。
Ok, in a comment you clarified that you want to match all strings which are either equal to one of the strings in the input set, or only differ from one of them within a given level of 'variation'. Since you didn't define 'variation' exactly, I'm going to use the Levenshtein distance.
Given a string
s
and an integern
, you can build the regex, which matches exactly all strings that have a Levenshtein distance ofn
or less to that string, like this:First we write a function that given
s
andn
returns a list of simple regexen which when together match all strings with a distance ofn
or less tos
. Here "simple regex" means a regex which only contains literal characters and wildcards.For
n=0
that function just returns[s]
. Otherwise it computes the list forn-1
and then goes through each item in it. For each itemr
and each positioni
where0 <= i < length(r)
, it adds the following regexen to the list:.
has been added tor
at positioni
.i
th character ofr
has been deleted.i
th character ofr
has been replaced by a.
.Now to calculate the regex for a given set of strings and a given value for
n
, we calculate the list for each string and thenor
all the regexen together into one big regex.Note that this will lead to very big regexen very soon.