为什么递归正则表达式不是正则表达式?

发布于 2024-08-21 12:54:08 字数 194 浏览 5 评论 0 原文

我正在阅读 this 中的一些回复 提问,看到有几个人说递归正则表达式严格来说不是正则表达式。

这是为什么呢?

I was reading through some of the responses in this question and saw that a few people said that recursive regular expressions were not strictly speaking regular expressions.

Why is this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

七色彩虹 2024-08-28 12:54:08

“严格”正则表达式描述正则语言。但是许多功能(例如在表达式本身中使用反向引用或递归)可用于编写接受非正则语言的正则表达式。

例如, 描述的语言

(a+)b+\1

不是常规的,因为您不能强制 ab 之前和之后出现相同的次数。至少不是用普通语言。对于上下文无关甚至上下文相关的语言,这是完全不同的事情。

然而,仅使用基本事物(例如各种量词、字符类等)的正则表达式通常仍然描述正则语言。

“Strictly” regular expressions describe regular languages. But many features, such as the usage of backreferences in the expression itself or recursion for example, can be used to write regular expressions that accept non-regular languages.

For example, the language described by

(a+)b+\1

isn't regular, as you can't force that a appears the same number of times before and after the bs. At least not in a regular language. With context-free or even context-sensitive languages, that's a completely different matter.

However, regular expressions that only use elementary things such as the various quantifiers, character classes, etc. usually still describe regular languages.

说不完的你爱 2024-08-28 12:54:08

所有常规语言都可以被有限自动机识别。有限自动机具有有限数量的状态,因此具有有限的内存(因此得名)。递归“正则”表达式需要潜在无限的堆栈空间来进行递归,因此不可能用有限自动机来识别它,因此它不是正则的。

All regular languages can be recognized by a finite automaton. A finite automaton has a finite number of states, and consequently, finite memory (hence the name). A recursive "regular" expression requires a potentially infinite stack space to do the recursion, thus it is not possible to recognize it with a finite automaton, therefore it is not regular.

别低头,皇冠会掉 2024-08-28 12:54:08

理论计算机科学对常规语言的严格定义可能看起来很抽象,几乎没有什么实际好处,但如果您曾经面临过需要实现一个状态机来识别某些输入,那么您可以节省大量无用的精力和麻烦,如果你可以预先证明这是不可能的。

一种非正式的表达方式是,识别常规语言不能需要任意数量的内存。 正则语言的泵引理对于证明特定语言(即 /em>,一组字符串)无法被确定性有限自动机识别。

来自形式语言和自动机简介 Peter Linz(第 115 页,第 3 版):

定理4.8

L为无限正则语言。那么存在一些正整数m使得任何w ε L与|w|  ;≥ m可以分解为

<块引用>

w = xyz

<块引用>

|xy| ≤

<块引用>

|y| ≥1,

这样

<块引用>

wi = xyiz — 等式(4.2)

对于所有 i = 0、1、2、...,

也在 L

为了识别无限语言,有限自动机必须“泵送”或重复它的状态,这就是 yi 的函数(表示某些字符串 y 重复 i 次)。

几乎所有涉及泵引理的证明都涉及反证法。在第 117 页,作者证明了语言 L = { anbn : n ≥ 0 }—ie,格式为 aaa...bbb... 的字符串,其中 all-a 和 all- b 子串长度相等——不规则:

假设L是正则的,因此泵送引理必须成立。我们不知道m的值,但无论它是什么,我们总是可以选择n = m。因此,子字符串y必须完全由a组成。假设|y| =k。那么式(4.2)中使用i = 0得到的字符串为

<块引用>

w0 = amkbm

并且显然不在L中。这与泵引理相矛盾,从而表明 L 是正则的假设必定是错误的。

其他非正则语言示例:

  • L = { wwR : w ∈ Σ* } — ie,回文
  • L = { w ∈ Σ* : n< sub>a(w) < nb(w) } — iea 的数量少于b 的数量
  • L = { an! : n ≥ 0 }
  • L = { anbl : nl< /em> }
  • L = { anbl : n + < em>l 是素数 }

事实证明,我们宽松地称为正则表达式的功能要强大得多:将正则表达式与反向引用相匹配是 NP 难

The strict definition of regular language from theoretical computer science may seem abstract with little practical benefit, but if you’re ever faced with the need to implement a state machine to recognize certain inputs, you can save yourself a lot of useless effort and hairpulling if you can prove up front that it can’t be done.

An informal way to express it is recognition of a regular language cannot require an arbitrary amount of memory. The pumping lemma for regular languages is useful for proving that a particular language (i.e., a set of strings) cannot be recognized by a deterministic finite automaton.

From An Introduction to Formal Languages and Automata by Peter Linz (pg. 115, 3rd ed.):

Theorem 4.8

Let L be an infinite regular language. Then there exists some positive integer m such that any w ∈ L with |w| ≥ m can be decomposed as

w = xyz,

with

|xy| ≤ m,

and

|y| ≥ 1,

such that

wi = xyiz — Eq. (4.2)

is also in L for all i = 0, 1, 2, …

To recognize an infinite language, a finite automaton must “pump” or repeat some portion of its states, and that’s the function of yi (notation for some string y repeated i times).

Very nearly all proofs involving the pumping lemma involve proof by contradiction. On page 117, the author proves that the language L = { anbn : n ≥ 0 }—i.e., strings of the form aaa…bbb… where the all-a and all-b substrings are equal in length—is not regular:

Assume that L is regular, so that the pumping lemma must hold. We do not know the value of m, but whatever it is, we can always choose n = m. Therefore, the substring y must consist entirely of a's. Suppose |y| = k. Then the string obtained by using i = 0 in Equation (4.2) is

w0 = am-kbm

and is clearly not in L. This contradicts the pumping lemma and thereby indicates that the assumption that L is regular must be false.

Other examples of languages that are not regular:

  • L = { wwR : w ∈ Σ* } — i.e., palindromes
  • L = { w ∈ Σ* : na(w) < nb(w) } — i.e., number of as fewer than number of bs
  • L = { an! : n ≥ 0 }
  • L = { anbl : nl }
  • L = { anbl : n + l is a prime number }

It turns out that what we loosely call regular expressions are considerably more powerful: matching regular expressions with backreferences is NP-hard!

南七夏 2024-08-28 12:54:08

其他答案的基础需要理解所涉及的计算理论。如果您唯一接触正则表达式是在编程环境中,您可能没有意识到正则表达式也是数学构造。关于正则表达式的维基百科文章可能会为正则表达式的理论方面提供一些背景知识。

The basis of the other answers requires an understanding of the involved computation theory. If your only exposure to regular expressions is in a programming environment, you may not realize that regular expressions are mathematical constructs, as well. The wikipedia article on regular expressions may provide some background into the theoretical aspects of regular expressions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文