有没有办法按特殊性对正则表达式列表进行排序?
我正在寻找允许我对正则表达式列表进行排序的东西, 或一些文档和研究,
根据其特异性/严格性
/[a-z]+/ // most strict
/[a-z0-9]+/
/[a-z0-9èòà]+/ // less strict
/.*/
呢
/[a-z]+ABC/
/[a-z0-9]+/
,但哪一个比另一个不太具体
?先感谢您
I'm looking for something that allows me to sort a list of regular expression,
or some documentation and research,
according to their specificity/strictness
/[a-z]+/ // most strict
/[a-z0-9]+/
/[a-z0-9èòà]+/ // less strict
/.*/
but how about
/[a-z]+ABC/
/[a-z0-9]+/
which one is less specific than the other?
thank you in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
人们可以将正则表达式等同于它匹配的一组字符串(称为“正则语言”)。如果我们的正则表达式名为
E
,我们就将其匹配字符串称为L(E)< /代码>。
您上面提到的严格性就变成了子集关系:如果
L(A)
,则定义 REA
比 REB
更严格是L(B)
的真子集。这消除了诸如“相同”RE 的同义词之类的歧义:它们之所以相同,正是因为它们具有相同的常规语言。正如 @yi_H 指出的,RE 语言(在某些常见字母表上)的子集关系形成了部分排序。听起来你想要一个完整的订单。如果是这样,您可以规定可接受的全排序应嵌入由子集关系表示的部分排序。
对于如何构建总排序,我没有明确的答案,但我想到了两种方法。
第一个是利用抽引理。事实证明,对于任何 RE,如果它匹配足够长的字符串,那么它还必须匹配可通过重复某些小节从第一个字符串构造的更长字符串。您可以询问没有任何此类重复段的最长匹配字符串的长度是多少,并将其作为您的指标。也许这尊重(嵌入)部分排序,也许不尊重。
另一个是考虑 RE 状态机上的图转换。我怀疑(但我没有任何参考资料)如果 RE
A
比 REB
更严格,那么B
的自动机将可以通过折叠状态或一些类似的简化操作从A
进行计算。您可以将度量定义为 RE 最小自动机中的状态数。One can equate a regular expression to the set of strings it matches (called a 'regular language'.) If our regular expression is named
E
, let's call its matching stringsL(E)
.Strictness in the sense you are alluding to above then becomes the subset relation: define RE
A
to be stricter than REB
ifL(A)
is a proper subset ofL(B)
. This puts to rest ambiguities like synonyms for the "same" RE: they are the same precisely because they have the same regular language.As @yi_H points out, the subset relation over RE languages (over some common alphabet) forms a partial ordering. You sound like you want a total ordering. If so, you can stipulate that an acceptable total ordering should embed the partial ordering represented by the subset relation.
I don't have a clear answer for how to build that total ordering, but two approaches come to mind.
The first is to exploit the pumping lemma. It turns out that for any RE, if it matches a sufficiently long string, then it must also match a longer string constructible from the first by repeating some subsection. You could ask what is the length of the longest matching string that does not have any such repeating segments, and make that your metric. Maybe that respects (embeds) the partial ordering, maybe it doesn't.
The other is to consider graph transformations on the RE's state machine. I suspect (but I don't have any reference) that if RE
A
is properly stricter than REB
, thenB
's automaton will be calculable fromA
's by collapsing states or some similar simplifying action. You could define your metric to be the number of states in the RE's smallest automaton.正如您的第二个示例所示,您不能对正则表达式进行总排序,只能使用 部分顺序 。
更糟糕的是,您可以通过多种方式编写相同的正则表达式:
[ab]b
与(ab|bb)
、aa*< /code> 与
a+
。因此,即使确定两个正则表达式是否等效也不是一项简单的任务。As your second example shows you cannot have a total ordering of regular expressions, only a partial order is possible.
To make things even worse, there are dozens of ways you can write the same regular expression:
[ab]b
vs(ab|bb)
,aa*
vsa+
. So even deciding whether two regexpes are equivalent is not a simple task.假设您正在谈论纯正则表达式,而不是疯狂的 Perl 东西,您可以根据它们接受的字符串集(即,查看正则表达式作为正则语言)。
鉴于常规语言的差异、交集和空性都是可判定的问题,这意味着有一些算法可以告诉您一个表达式是否接受另一个表达式接受的所有字符串。
Assuming you're talking about pure regular expressions, rather than the crazy perl stuff, you can define a partial order on regular expressions that matches your question, based on the set of strings they accept (i.e., view the regular expression as a regular language).
Given that the difference, intersection, and emptiness of regular languages are decidable problems, that means there are algorithms that will tell you if one of your expressions accepts all the strings another one does.