如何确定一个正则表达式是否与另一个正则表达式正交?
我想我的问题最好用一个(简化的)例子来解释。
正则表达式 1:
^\d+_[a-z]+$
正则表达式 2:
^\d*$
正则表达式 1 将从不匹配正则表达式 2 匹配的字符串。 因此,假设正则表达式 1 与正则表达式 2 正交。
正如许多人问我所说的正交是什么意思,我将尝试澄清它:
让 S1 em> 是正则表达式 1 匹配的(无限)字符串集。 S2 是正则表达式 2 匹配的字符串集。 正则表达式 2 与正则表达式 1 正交当 S1 和 S2 的交集为空。 正则表达式 ^\d_a$ 将不正交,因为字符串“2_a”位于集合 S1 和 S2 中。
如果两个正则表达式彼此正交,如何以编程方式确定它?
最好的情况是一些实现如下方法的库:
/**
* @return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* @return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
我所说的“正交”是指“交集是空集”吗?
我会构造交集的正则表达式,然后转换为正常形式的正则语法,看看它是否是空语言......
话又说回来,我是一个理论家......
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
这就像用大炮射麻雀一样。 为什么不直接构建产品自动机并检查是否可以从初始状态到达接受状态? 这也将立即在交集中为您提供一个字符串,而无需先构建正则表达式。
我只知道一种方法,涉及从正则表达式创建 DFA,这是指数时间(在退化情况下)。 它可以简化为停机问题,因为一切都是如此,但停机问题不能可以简化为它。
这是不正确的。 您可以拥有两个在边缘标记多图意义上非同构的 DFA (FSM),但接受相同的语言。 另外,如果不是这种情况,您的测试将检查两个正则表达式是否接受不相同的语言,而 OP 需要非重叠的语言(空交集)。
另外,请注意 \1、\2、...、\9 结构不是规则的:它不能用串联、并集和 *(Kleene 星号)来表示。 如果你想包括回替换,我不知道答案是什么。 同样有趣的是,上下文无关语言的相应问题是不可判定的:不存在采用两个上下文无关文法 G1 和 G2 并返回 true iff L(G1) ∩ L(g2) ≠ Ø 的算法。
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
这个问题发布已经两年了,但我很高兴地说,现在只需在此处调用“genex”程序即可确定:https://github.com/audreyt/regex-genex
空输出意味着没有与两个正则表达式匹配的字符串。 如果它们有任何重叠,它将输出整个重叠列表:
希望这有帮助!
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
Hope this helps!
fsmtools 可以在有限状态机上执行各种操作,这是您唯一的问题是将正则表达式的字符串表示形式转换为 fsmtools 可以使用的格式。 这对于简单的情况来说绝对是可能的,但在存在高级功能(例如look{ahead,behind})的情况下会很棘手。
您也可以看看 OpenFst,尽管我从未使用过它。 不过它支持交叉。
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
[我正在回复评论]
IIRC,
a^nb^ma^nb^m
不是上下文无关,因此(a\*)(b\*)\1\2
也不是,因为它是相同的。 ISTR <代码>{ ww | w ∈ L } 即使 L 是“nice”,也不是“nice”,因为nice 是常规
、上下文无关
之一。我修改我的声明:RE 中的所有内容都可以简化为停止问题;-)
[I'm replying to a comment]
IIRC,
a^n b^m a^n b^m
is not context free, and so(a\*)(b\*)\1\2
isn't either since it's the same. ISTR{ ww | w ∈ L }
not being "nice" even if L is "nice", for nice being one ofregular
,context-free
.I modify my statement: everything in RE is reducible to the halting problem ;-)
我终于找到了我正在寻找的库:
dk.brics.automaton
用法:
应该是注意到该实现不支持也不能支持复杂的正则表达式功能,例如反向引用。 请参阅博客文章“更快的 Java 正则表达式包”其中引入了dk.brics.automaton。
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
您也许可以使用类似 Regexp::Genex< /a> 生成测试字符串来匹配指定的正则表达式,然后使用第二个正则表达式上的测试字符串来确定这两个正则表达式是否正交。
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
在某些情况下,证明一个正则表达式与另一个正则表达式正交可能很简单,例如同一位置中互斥的字符组。 对于除最简单的正则表达式之外的任何正则表达式来说,这都是一个不平凡的问题。 对于带有分组和反向引用的严肃表达式,我什至会说这可能是不可能的。
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
我相信 kdgregory 是正确,您使用正交来表示补充。
这是正确的吗?
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
首先我要说的是,我不知道如何构建这样的算法,也不知道有任何库可以实现它。 然而,如果得知任意复杂度的一般正则表达式不存在这样的情况,我一点也不感到惊讶。
每个正则表达式都定义了可以由该表达式生成的所有字符串的正则语言,或者如果您愿意,也可以定义与该正则表达式“匹配”的所有字符串的正则语言。 将语言视为一组字符串。 在大多数情况下,该集合将无限大。 您的问题询问正则表达式给出的两个集合的交集是否为空。
至少对于第一个近似,我无法想象在不计算集合的情况下回答这个问题的方法,对于无限集合来说,这将花费比你更长的时间。 我认为可能有一种方法可以计算有限集并确定何时对模式进行详细说明超出其他正则表达式的要求,但这并不简单。
例如,只需考虑简单的表达式
(ab)*
和(aba)*b
。 什么算法将决定从第一个表达式生成abab
然后停止,而不检查ababab
、abababab
等,因为它们会从来不工作? 您不能只生成字符串并检查直到找到匹配项,因为当语言不相交时,这永远不会完成。 我无法想象在一般情况下有什么可行的方法,但是在这种事情上有人比我更好。总而言之,这是一个难题。 如果我知道存在多项式时间解决方案,我会感到有点惊讶,而如果我知道它相当于停止问题,我一点也不感到惊讶。 尽管考虑到正则表达式不是图灵完备的,但似乎至少有可能存在解决方案。
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions
(ab)*
and(aba)*b
. What is the algorithm that will decide to generateabab
from the first expression and then stop, without checkingababab
,abababab
, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
我将执行以下操作:
使用类似以下结构的内容将每个正则表达式转换为 FSA:
请注意,这并不简单,但对于简单的正则表达式来说应该不会那么困难。
创建一个新的组合节点,例如:
根据左侧和右侧相同的字符构建链接,您将获得两个 FSANode,这两个 FSANode 构成一个新的组合节点。
然后从CombinedNode(leftStart, rightStart)开始,找到生成集,如果有任何无效的CombinedNode,则该集不是“正交的”。
I would do the following:
convert each regex to a FSA, using something like the following structure:
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
将每个正则表达式转换为 DFA。 从一个 DFA 的接受状态创建到第二个 DFA 的开始状态的 epsilon 转换。 您实际上已经通过添加 epsilon 转换创建了 NFA。 然后将NFA转换为DFA。 如果起始状态不是接受状态,并且接受状态是可达的,则两个正则表达式不是“正交的”。 (因为它们的交集非空。)
存在将正则表达式转换为 DFA 以及将 NFA 转换为 DFA 的已知过程。 你可以看看Sipser的《Introduction to the Theory of Computation》之类的书来了解程序,或者直接在网上搜索。 毫无疑问,许多本科生和研究生必须在一门或另一门“理论”课程中这样做。
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
我说得太早了。 我在原来的帖子中所说的不会成功,但是如果您可以将正则表达式转换为 DFA 形式,那么您可以尝试执行以下操作。
您可以在我在第一篇文章中提到的书中找到该过程:Sipser 的“计算理论导论”第二版。 它位于第 46 页,脚注中有详细信息。
该过程将为您提供一个新的 DFA,它是两个 DFA 的交集。 如果新的 DFA 具有可达接受状态,则交集非空。
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.