当前位置：文江博客话题详情

通俗地说，泵引理是什么？

发布于 2024-07-12 21:50:59 字数 382 浏览 7 评论 0原文

我看到这个问题，并且很好奇泵引理是什么（维基百科没有没有多大帮助）。

我知道这基本上是一个理论证明，为了使一种语言属于某个类别，它必须是正确的，但除此之外我真的不明白。

有人愿意尝试以非数学家/计算机科学博士可以理解的方式在相当细粒度的水平上解释它吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

庆幸我还是我 2024-07-19 21:50:59

泵引理是一个简单的证明，表明一种语言是不规则的，这意味着无法为其构建有限状态机。典型的例子是语言(a^n)(b^n)。这是一种简单语言，由任意数量的 a 组成，后跟相同数量的 b 。所以字符串

ab
aabb
aaabbb
aaaabbbb

等是语言中的，但

aab
bab
aaabbbbbb

等不是。

为这些示例构建 FSM 非常简单：

这将一直工作到 n=4。问题是我们的语言没有对 n 施加任何约束，并且有限状态机必须是有限的。无论我向这台机器添加多少个状态，有人可以给我一个输入，其中 n 等于状态数加一，我的机器就会失败。因此，如果可以构建一台机器来读取这种语言，那么其中必须有一个循环来保持状态数量有限。添加这些循环后：

我们语言中的所有字符串都将被接受，但有一个问题。在前四个 a 之后，机器会丢失输入了多少 a 的计数，因为它保持在相同的状态。这意味着在四个之后，我可以向字符串中添加任意数量的 a，而不添加任何 b，并且仍然获得相同的返回值。这意味着字符串：

aaaa(a*)bbbb

with (a*) 代表任意数量的 a，将被机器接受，即使它们显然不都是该语言的。在这种情况下，我们会说字符串 (a*) 的部分可以被泵送。有限状态机是有限的并且 n 是无界的，这一事实保证了任何接受该语言中所有字符串的机器都必须具有此属性。机器必须在某个点循环，并且在循环的点可以泵送语言。因此，无法为该语言构建有限状态机，并且该语言不是正则的。

请记住正则表达式和有限状态机是等效的，然后替换a和 b 带有开始和结束 Html 标记，可以相互嵌入，您可以看到为什么无法使用正则表达式解析Html

The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n). This is the simple language which is just any number of as, followed by the same number of bs. So the strings

ab
aabb
aaabbb
aaaabbbb

etc. are in the language, but

aab
bab
aaabbbbbb

etc. are not.

It's simple enough to build a FSM for these examples:

This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:

all of the strings in our language will be accepted, but there is a problem. After the first four as, the machine loses count of how many as have been input because it stays in the same state. That means that after four, I can add as many as as I want to the string, without adding any bs, and still get the same return value. This means that the strings:

aaaa(a*)bbbb

with (a*) representing any number of as, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*) can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.

Remember that Regular Expressions and Finite State Machines are equivalent, then replace a and b with opening and closing Html tags which can be embedded within each other, and you can see why it is not possible to use regular expressions to parse Html

回复收藏 0 原文

菩提树下叶撕阳。 2024-07-19 21:50:59

它是一种旨在证明给定语言不能属于某一类的装置。

让我们考虑一下平衡括号的语言（表示符号“(”和“)”，并包括所有在通常含义中平衡的字符串，并且没有不平衡的字符串）。我们可以使用泵引理来证明这是不规则的。

（语言是一组可能的字符串。解析器是我们可以用来查看字符串是否在该语言中的某种机制，因此它必须能够区分该语言中的字符串与该语言之外的字符串之间的区别如果存在可以识别该语言的常规（或其他）解析器，区分该语言中的字符串和非该语言中的字符串，则该语言是“常规”（或“上下文无关”或“上下文敏感”或其他）。语言。）

LFSR Consulting 提供了很好的描述。我们可以将常规语言的解析器绘制为框和箭头的有限集合，箭头代表字符，框连接它们（充当“状态”）。（如果比这更复杂，那么它就不是一种常规语言。）如果我们能得到一个比盒子数量还要长的字符串，这意味着我们不止一次地通过一个盒子。这意味着我们有一个循环，并且我们可以根据需要多次执行该循环。

因此，对于常规语言，如果我们可以创建一个任意长的字符串，我们可以将其分为 xyz，其中 x 是我们需要到达循环开始的字符，y 是实际的循环，z 是我们想要的任何字符。需要使字符串在循环后有效。重要的是x和y的总长度是有限的。毕竟，如果长度大于盒子的数量，那么我们在执行此操作时显然已经穿过了另一个盒子，因此存在循环。

因此，在我们的平衡语言中，我们可以从编写任意数量的左括号开始。特别是，对于任何给定的解析器，我们可以编写比框更多的左括号，因此解析器无法判断有多少个左括号。因此，x 是一定数量的左括号，并且这是固定的。 y 也是一定数量的左括号，并且可以无限增加。我们可以说 z 是一些右括号。

这意味着我们的解析器可能会识别出一个由 43 个左括号和 43 个右括号组成的字符串，但是解析器无法从一个由 44 个左括号和 43 个右括号组成的字符串中分辨出来，而这不是我们的语言，所以解析器无法解析我们的语言。

由于任何可能的常规解析器都有固定数量的框，因此我们总是可以编写比这更多的左括号，并且通过泵引理，我们可以以解析器无法分辨的方式添加更多左括号。因此，平衡括号语言不能被正则解析器解析，因此不是正则表达式。

It's a device intended to prove that a given language cannot be of a certain class.

Let's consider the language of balanced parentheses (meaning symbols '(' and ')', and including all strings that are balanced in the usual meaning, and none that aren't). We can use the pumping lemma to show this isn't regular.

(A language is a set of possible strings. A parser is some sort of mechanism we can use to see if a string is in the language, so it has to be able to tell the difference between a string in the language or a string outside the language. A language is "regular" (or "context-free" or "context-sensitive" or whatever) if there is a regular (or whatever) parser that can recognize it, distinguishing between strings in the language and strings not in the language.)

LFSR Consulting has provided a good description. We can draw a parser for a regular language as a finite collection of boxes and arrows, with the arrows representing characters and the boxes connecting them (acting as "states"). (If it's more complicated than that, it isn't a regular language.) If we can get a string longer than the number of boxes, it means we went through one box more than once. That means we had a loop, and we can go through the loop as many times as we want.

Therefore, for a regular language, if we can create an arbitrarily long string, we can divide it into xyz, where x is the characters we need to get to the start of the loop, y is the actual loop, and z is whatever we need to make the string valid after the loop. The important thing is that the total lengths of x and y are limited. After all, if the length is greater than the number of boxes, we've obviously gone through another box while doing this, and so there's a loop.

So, in our balanced language, we can start by writing any number of left parentheses. In particular, for any given parser, we can write more left parens than there are boxes, and so the parser can't tell how many left parens there are. Therefore, x is some amount of left parens, and this is fixed. y is also some number of left parens, and this can increase indefinitely. We can say that z is some number of right parens.

This means that we might have a string of 43 left parens and 43 right parens recognized by our parser, but the parser can't tell that from a string of 44 left parens and 43 right parens, which isn't in our language, so the parser can't parse our language.

Since any possible regular parser has a fixed number of boxes, we can always write more left parens than that, and by the pumping lemma we can then add more left parens in a way that the parser can't tell. Therefore, the balanced parenthesis language can't be parsed by a regular parser, and therefore isn't a regular expression.

回复收藏 0 原文

清泪尽 2024-07-19 21:50:59

用外行人的话来说，这是一件很难理解的事情，但基本上正则表达式中应该有一个非空子字符串，可以根据需要重复多次，同时整个新单词对该语言仍然有效。

在实践中，泵送引理不足以证明语言正确，而是作为一种通过矛盾进行证明并显示语言不适合语言类别（常规或上下文无关）的方法）通过显示泵引理对其不起作用。

回复收藏 0 原文

被你宠の有点坏 2024-07-19 21:50:59

基本上，您有一种语言的定义（例如 XML），这是一种判断给定字符串（“单词”）是否是该语言的成员的方法。

泵引理建立了一种方法，您可以通过该方法从语言中选择一个“单词”，然后对其进行一些更改。该定理指出，如果语言是规则的，这些变化应该产生仍然来自同一语言的“单词”。如果你想出的单词不在该语言中，那么该语言首先就不可能是常规的。

回复收藏 0 原文

一张白纸 2024-07-19 21:50:59

简单的泵引理适用于常规语言，即由有限自动机等描述的字符串集。有限自动化的主要特征是它只有有限的内存，由其状态描述。

现在假设你有一个字符串，它被有限自动机识别，并且它的长度足以“超出”自动机的内存，即必须重复的状态。然后有一个子串，其中子串开头的自动机状态与子串末尾的状态相同。由于读取子字符串不会改变状态，因此它可能会被删除或重复任意次数，而自动机并不明智。所以这些修改后的字符串也必须被接受。

对于上下文无关语言，还有一个更复杂的泵引理，您可以在字符串中的两个位置删除/插入直观上被视为匹配括号的内容。

回复收藏 0 原文

榕城若虚 2024-07-19 21:50:59

用外行的话来说，我认为你几乎是对的。这是一种证明技术（实际上有两种），用于证明某种语言不属于某个类别。

例如，考虑一种包含无限数量字符串的常规语言（正则表达式、自动机等）。在某个时刻，正如星蓝所说，你会耗尽内存，因为字符串对于自动机来说太长了。这意味着必须有一个字符串块，自动机无法告诉您有多少个字符串副本（您处于循环中）。因此，在字符串中间有任意数量的该子字符串的副本，您仍然可以使用该语言。

这意味着，如果您的语言不具有此属性，即有一个足够长的字符串，其中包含 NO 子字符串，您可以重复任意次并且仍然处于该语言中，那么语言不规则。

回复收藏 0 原文

何以畏孤独 2024-07-19 21:50:59

例如，采用此语言 L = aⁿbⁿ。

现在尝试将上述语言的有限自动机可视化为一些n。

如果n = 1，则字符串w = ab。这里我们可以制作一个没有外循环的有限自动机
如果n = 2，则字符串w = a²b²。这里我们可以制作一个没有循环的有限自动机，

如果 n = p，字符串 w = a^{pb^p}。本质上，可以假设有限自动机具有 3 个阶段。
第一阶段，需要一系列输入并进入第二阶段。同样，从阶段 2 到阶段 3。我们将这些阶段称为 x、y 和 z。

有一些观察结果

肯定x将包含“a”，z将包含“b”。
现在我们必须清楚y：
- case a：y 只能包含“a”
- case b：y 只能包含“b”
- case c：y 可能包含“a”和“b”的组合

因此阶段 y 的有限自动机状态应该能够接受输入“a”和“b”，并且不应该接受更多不可数的a和b。

如果阶段 y 仅采用一个“a”和一个“b”，则需要两个状态
如果阶段y仅采用两个“a”和一个“b”，则需要三个状态（不带外循环）
等等......

所以舞台y的设计纯粹是无限的。我们只能通过放置一些循环来使其有限，如果我们放置循环，有限自动机可以接受超出 L = aⁿb^{n< 的语言/sup>}。因此对于这种语言我们无法构造有限自动机。因此它不是有规律的。

回复收藏 0 原文

墨离汐 2024-07-19 21:50:59

根据定义，常规语言是由有限状态自动机识别的语言。将其视为一个迷宫：状态是房间，过渡是房间之间的单向走廊，有一个初始房间和一个出口（最终）房间。正如“有限状态自动机”这个名字所说，房间的数量是有限的。每次你沿着走廊行走时，你都会记下写在走廊墙上的信。如果您能找到一条从最初的房间到最后的房间的路径，并以正确的顺序穿过标有字母的走廊，则可以识别单词。

泵送引理指出，存在一个最大长度（泵送长度），在该长度内，您可以在迷宫中漫步而无需返回到您之前走过的房间。这个想法是，由于你只能走进这么多不同的房间，经过某个点，你必须退出迷宫或穿越你的轨道。如果您设法在迷宫中行走的路径比这个泵送长度更长，那么您就走了一条弯路：您在路径中插入了一个（至少一个）可以删除的循环（如果您希望穿越迷宫识别较小的单词）或无限重复（泵送）（允许识别超长单词）。

上下文无关语言也有类似的引理。这些语言可以表示为下推自动机接受的单词，下推自动机是有限状态自动机，可以利用堆栈来决定执行哪些转换。尽管如此，由于状态数量仍然有限，因此即使通过属性的正式表达可能会稍微有点更复杂。

回复收藏 0 原文