I was reading through some of the responses in this question and saw that a few people said that recursive regular expressions were not strictly speaking regular expressions.
不是常规的,因为您不能强制 a 在 b 之前和之后出现相同的次数。至少不是用普通语言。对于上下文无关甚至上下文相关的语言,这是完全不同的事情。
然而,仅使用基本事物(例如各种量词、字符类等)的正则表达式通常仍然描述正则语言。
“Strictly” regular expressions describe regular languages. But many features, such as the usage of backreferences in the expression itself or recursion for example, can be used to write regular expressions that accept non-regular languages.
For example, the language described by
(a+)b+\1
isn't regular, as you can't force that a appears the same number of times before and after the bs. At least not in a regular language. With context-free or even context-sensitive languages, that's a completely different matter.
However, regular expressions that only use elementary things such as the various quantifiers, character classes, etc. usually still describe regular languages.
All regular languages can be recognized by a finite automaton. A finite automaton has a finite number of states, and consequently, finite memory (hence the name). A recursive "regular" expression requires a potentially infinite stack space to do the recursion, thus it is not possible to recognize it with a finite automaton, therefore it is not regular.
The strict definition of regular language from theoretical computer science may seem abstract with little practical benefit, but if you’re ever faced with the need to implement a state machine to recognize certain inputs, you can save yourself a lot of useless effort and hairpulling if you can prove up front that it can’t be done.
An informal way to express it is recognition of a regular language cannot require an arbitrary amount of memory. The pumping lemma for regular languages is useful for proving that a particular language (i.e., a set of strings) cannot be recognized by a deterministic finite automaton.
Let L be an infinite regular language. Then there exists some positive integer m such that any w ∈ L with |w| ≥ m can be decomposed as
w = xyz,
with
|xy| ≤ m,
and
|y| ≥ 1,
such that
wi = xyiz — Eq. (4.2)
is also in L for all i = 0, 1, 2, …
To recognize an infinite language, a finite automaton must “pump” or repeat some portion of its states, and that’s the function of yi (notation for some string y repeated i times).
Very nearly all proofs involving the pumping lemma involve proof by contradiction. On page 117, the author proves that the language L = { anbn : n ≥ 0 }—i.e., strings of the form aaa…bbb… where the all-a and all-b substrings are equal in length—is not regular:
Assume that L is regular, so that the pumping lemma must hold. We do not know the value of m, but whatever it is, we can always choose n = m. Therefore, the substring y must consist entirely of a's. Suppose |y| = k. Then the string obtained by using i = 0 in Equation (4.2) is
w0 = am-kbm
and is clearly not in L. This contradicts the pumping lemma and thereby indicates that the assumption that L is regular must be false.
Other examples of languages that are not regular:
L = { wwR : w ∈ Σ* } — i.e., palindromes
L = { w ∈ Σ* : na(w) < nb(w) } — i.e., number of as fewer than number of bs
L = { an! : n ≥ 0 }
L = { anbl : n ≠ l }
L = { anbl : n + l is a prime number }
It turns out that what we loosely call regular expressions are considerably more powerful: matching regular expressions with backreferences is NP-hard!
The basis of the other answers requires an understanding of the involved computation theory. If your only exposure to regular expressions is in a programming environment, you may not realize that regular expressions are mathematical constructs, as well. The wikipedia article on regular expressions may provide some background into the theoretical aspects of regular expressions.
发布评论
评论(4)
“严格”正则表达式描述正则语言。但是许多功能(例如在表达式本身中使用反向引用或递归)可用于编写接受非正则语言的正则表达式。
例如, 描述的语言
不是常规的,因为您不能强制
a
在b
之前和之后出现相同的次数。至少不是用普通语言。对于上下文无关甚至上下文相关的语言,这是完全不同的事情。然而,仅使用基本事物(例如各种量词、字符类等)的正则表达式通常仍然描述正则语言。
“Strictly” regular expressions describe regular languages. But many features, such as the usage of backreferences in the expression itself or recursion for example, can be used to write regular expressions that accept non-regular languages.
For example, the language described by
isn't regular, as you can't force that
a
appears the same number of times before and after theb
s. At least not in a regular language. With context-free or even context-sensitive languages, that's a completely different matter.However, regular expressions that only use elementary things such as the various quantifiers, character classes, etc. usually still describe regular languages.
所有常规语言都可以被有限自动机识别。有限自动机具有有限数量的状态,因此具有有限的内存(因此得名)。递归“正则”表达式需要潜在无限的堆栈空间来进行递归,因此不可能用有限自动机来识别它,因此它不是正则的。
All regular languages can be recognized by a finite automaton. A finite automaton has a finite number of states, and consequently, finite memory (hence the name). A recursive "regular" expression requires a potentially infinite stack space to do the recursion, thus it is not possible to recognize it with a finite automaton, therefore it is not regular.
理论计算机科学对常规语言的严格定义可能看起来很抽象,几乎没有什么实际好处,但如果您曾经面临过需要实现一个状态机来识别某些输入,那么您可以节省大量无用的精力和麻烦,如果你可以预先证明这是不可能的。
一种非正式的表达方式是,识别常规语言不能需要任意数量的内存。 正则语言的泵引理对于证明特定语言(即 /em>,一组字符串)无法被确定性有限自动机识别。
来自形式语言和自动机简介 Peter Linz(第 115 页,第 3 版):
为了识别无限语言,有限自动机必须“泵送”或重复它的状态,这就是 yi 的函数(表示某些字符串 y 重复 i 次)。
几乎所有涉及泵引理的证明都涉及反证法。在第 117 页,作者证明了语言 L = { anbn : n ≥ 0 }—ie,格式为 aaa...bbb... 的字符串,其中 all-a 和 all- b 子串长度相等——不规则:
其他非正则语言示例:
事实证明,我们宽松地称为正则表达式的功能要强大得多:将正则表达式与反向引用相匹配是 NP 难!
The strict definition of regular language from theoretical computer science may seem abstract with little practical benefit, but if you’re ever faced with the need to implement a state machine to recognize certain inputs, you can save yourself a lot of useless effort and hairpulling if you can prove up front that it can’t be done.
An informal way to express it is recognition of a regular language cannot require an arbitrary amount of memory. The pumping lemma for regular languages is useful for proving that a particular language (i.e., a set of strings) cannot be recognized by a deterministic finite automaton.
From An Introduction to Formal Languages and Automata by Peter Linz (pg. 115, 3rd ed.):
To recognize an infinite language, a finite automaton must “pump” or repeat some portion of its states, and that’s the function of yi (notation for some string y repeated i times).
Very nearly all proofs involving the pumping lemma involve proof by contradiction. On page 117, the author proves that the language L = { anbn : n ≥ 0 }—i.e., strings of the form aaa…bbb… where the all-a and all-b substrings are equal in length—is not regular:
Other examples of languages that are not regular:
It turns out that what we loosely call regular expressions are considerably more powerful: matching regular expressions with backreferences is NP-hard!
其他答案的基础需要理解所涉及的计算理论。如果您唯一接触正则表达式是在编程环境中,您可能没有意识到正则表达式也是数学构造。关于正则表达式的维基百科文章可能会为正则表达式的理论方面提供一些背景知识。
The basis of the other answers requires an understanding of the involved computation theory. If your only exposure to regular expressions is in a programming environment, you may not realize that regular expressions are mathematical constructs, as well. The wikipedia article on regular expressions may provide some background into the theoretical aspects of regular expressions.