通俗地说,泵引理是什么?
我看到这个问题 ,并且很好奇泵引理是什么(维基百科没有没有多大帮助)。
我知道这基本上是一个理论证明,为了使一种语言属于某个类别,它必须是正确的,但除此之外我真的不明白。
有人愿意尝试以非数学家/计算机科学博士可以理解的方式在相当细粒度的水平上解释它吗?
I saw this question, and was curious as to what the pumping lemma was (Wikipedia didn't help much).
I understand that it's basically a theoretical proof that must be true in order for a language to be in a certain class, but beyond that I don't really get it.
Anyone care to try to explain it at a fairly granular level in a way understandable by non mathematicians/comp sci doctorates?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
泵引理是一个简单的证明,表明一种语言是不规则的,这意味着无法为其构建有限状态机。 典型的例子是语言
(a^n)(b^n)
。 这是一种简单语言,由任意数量的a
组成,后跟相同数量的b
。 所以字符串等是语言中的,但
等不是。
为这些示例构建 FSM 非常简单:
这将一直工作到 n=4。 问题是我们的语言没有对 n 施加任何约束,并且有限状态机必须是有限的。 无论我向这台机器添加多少个状态,有人可以给我一个输入,其中 n 等于状态数加一,我的机器就会失败。 因此,如果可以构建一台机器来读取这种语言,那么其中必须有一个循环来保持状态数量有限。 添加这些循环后:
我们语言中的所有字符串都将被接受,但有一个问题。 在前四个
a
之后,机器会丢失输入了多少a
的计数,因为它保持在相同的状态。 这意味着在四个之后,我可以向字符串中添加任意数量的a
,而不添加任何b
,并且仍然获得相同的返回值。 这意味着字符串:with
(a*)
代表任意数量的a
,将被机器接受,即使它们显然不都是该语言的。 在这种情况下,我们会说字符串(a*)
的部分可以被泵送。 有限状态机是有限的并且 n 是无界的,这一事实保证了任何接受该语言中所有字符串的机器都必须具有此属性。 机器必须在某个点循环,并且在循环的点可以泵送语言。 因此,无法为该语言构建有限状态机,并且该语言不是正则的。请记住正则表达式和有限状态机是等效的,然后替换
a
和b
带有开始和结束 Html 标记,可以相互嵌入,您可以看到为什么 无法使用正则表达式解析HtmlThe pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language
(a^n)(b^n)
. This is the simple language which is just any number ofa
s, followed by the same number ofb
s. So the stringsetc. are in the language, but
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four
a
s, the machine loses count of how manya
s have been input because it stays in the same state. That means that after four, I can add as manya
s as I want to the string, without adding anyb
s, and still get the same return value. This means that the strings:with
(a*)
representing any number ofa
s, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string(a*)
can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.Remember that Regular Expressions and Finite State Machines are equivalent, then replace
a
andb
with opening and closing Html tags which can be embedded within each other, and you can see why it is not possible to use regular expressions to parse Html它是一种旨在证明给定语言不能属于某一类的装置。
让我们考虑一下平衡括号的语言(表示符号“(”和“)”,并包括所有在通常含义中平衡的字符串,并且没有不平衡的字符串)。 我们可以使用泵引理来证明这是不规则的。
(语言是一组可能的字符串。解析器是我们可以用来查看字符串是否在该语言中的某种机制,因此它必须能够区分该语言中的字符串与该语言之外的字符串之间的区别如果存在可以识别该语言的常规(或其他)解析器,区分该语言中的字符串和非该语言中的字符串,则该语言是“常规”(或“上下文无关”或“上下文敏感”或其他)。语言。)
LFSR Consulting 提供了很好的描述。 我们可以将常规语言的解析器绘制为框和箭头的有限集合,箭头代表字符,框连接它们(充当“状态”)。 (如果比这更复杂,那么它就不是一种常规语言。)如果我们能得到一个比盒子数量还要长的字符串,这意味着我们不止一次地通过一个盒子。 这意味着我们有一个循环,并且我们可以根据需要多次执行该循环。
因此,对于常规语言,如果我们可以创建一个任意长的字符串,我们可以将其分为 xyz,其中 x 是我们需要到达循环开始的字符,y 是实际的循环,z 是我们想要的任何字符。需要使字符串在循环后有效。 重要的是x和y的总长度是有限的。 毕竟,如果长度大于盒子的数量,那么我们在执行此操作时显然已经穿过了另一个盒子,因此存在循环。
因此,在我们的平衡语言中,我们可以从编写任意数量的左括号开始。 特别是,对于任何给定的解析器,我们可以编写比框更多的左括号,因此解析器无法判断有多少个左括号。 因此,x 是一定数量的左括号,并且这是固定的。 y 也是一定数量的左括号,并且可以无限增加。 我们可以说 z 是一些右括号。
这意味着我们的解析器可能会识别出一个由 43 个左括号和 43 个右括号组成的字符串,但是解析器无法从一个由 44 个左括号和 43 个右括号组成的字符串中分辨出来,而这不是我们的语言,所以解析器无法解析我们的语言。
由于任何可能的常规解析器都有固定数量的框,因此我们总是可以编写比这更多的左括号,并且通过泵引理,我们可以以解析器无法分辨的方式添加更多左括号。 因此,平衡括号语言不能被正则解析器解析,因此不是正则表达式。
It's a device intended to prove that a given language cannot be of a certain class.
Let's consider the language of balanced parentheses (meaning symbols '(' and ')', and including all strings that are balanced in the usual meaning, and none that aren't). We can use the pumping lemma to show this isn't regular.
(A language is a set of possible strings. A parser is some sort of mechanism we can use to see if a string is in the language, so it has to be able to tell the difference between a string in the language or a string outside the language. A language is "regular" (or "context-free" or "context-sensitive" or whatever) if there is a regular (or whatever) parser that can recognize it, distinguishing between strings in the language and strings not in the language.)
LFSR Consulting has provided a good description. We can draw a parser for a regular language as a finite collection of boxes and arrows, with the arrows representing characters and the boxes connecting them (acting as "states"). (If it's more complicated than that, it isn't a regular language.) If we can get a string longer than the number of boxes, it means we went through one box more than once. That means we had a loop, and we can go through the loop as many times as we want.
Therefore, for a regular language, if we can create an arbitrarily long string, we can divide it into xyz, where x is the characters we need to get to the start of the loop, y is the actual loop, and z is whatever we need to make the string valid after the loop. The important thing is that the total lengths of x and y are limited. After all, if the length is greater than the number of boxes, we've obviously gone through another box while doing this, and so there's a loop.
So, in our balanced language, we can start by writing any number of left parentheses. In particular, for any given parser, we can write more left parens than there are boxes, and so the parser can't tell how many left parens there are. Therefore, x is some amount of left parens, and this is fixed. y is also some number of left parens, and this can increase indefinitely. We can say that z is some number of right parens.
This means that we might have a string of 43 left parens and 43 right parens recognized by our parser, but the parser can't tell that from a string of 44 left parens and 43 right parens, which isn't in our language, so the parser can't parse our language.
Since any possible regular parser has a fixed number of boxes, we can always write more left parens than that, and by the pumping lemma we can then add more left parens in a way that the parser can't tell. Therefore, the balanced parenthesis language can't be parsed by a regular parser, and therefore isn't a regular expression.
用外行人的话来说,这是一件很难理解的事情,但基本上正则表达式中应该有一个非空子字符串,可以根据需要重复多次,同时整个新单词对该语言仍然有效。
在实践中,泵送引理不足以证明语言正确,而是作为一种通过矛盾进行证明并显示语言不适合语言类别(常规或上下文无关)的方法)通过显示泵引理对其不起作用。
Its a difficult thing to get in layman's terms, but basically regular expressions should have a non-empty substring within it that can be repeated as many times as you wish while the entire new word remains valid for the language.
In practice, pumping lemmas are not sufficient to PROVE a language correct, but rather as a way to do a proof by contradiction and show a language does not fit in the class of languages (Regular or Context-Free) by showing the pumping lemma does not work for it.
基本上,您有一种语言的定义(例如 XML),这是一种判断给定字符串(“单词”)是否是该语言的成员的方法。
泵引理建立了一种方法,您可以通过该方法从语言中选择一个“单词”,然后对其进行一些更改。 该定理指出,如果语言是规则的,这些变化应该产生仍然来自同一语言的“单词”。 如果你想出的单词不在该语言中,那么该语言首先就不可能是常规的。
Basically, you have a definition of a language (like XML), which is a way to tell whether a given string of characters (a "word") is a member of that language or not.
The pumping lemma establishes a method by which you can pick a "word" from the language, and then apply some changes to it. The theorem states that if the language is regular, these changes should yield a "word" that is still from the same language. If the word you come up with isn't in the language, then the language could not have been regular in the first place.
简单的泵引理适用于常规语言,即由有限自动机等描述的字符串集。 有限自动化的主要特征是它只有有限的内存,由其状态描述。
现在假设你有一个字符串,它被有限自动机识别,并且它的长度足以“超出”自动机的内存,即必须重复的状态。 然后有一个子串,其中子串开头的自动机状态与子串末尾的状态相同。 由于读取子字符串不会改变状态,因此它可能会被删除或重复任意次数,而自动机并不明智。 所以这些修改后的字符串也必须被接受。
对于上下文无关语言,还有一个更复杂的泵引理,您可以在字符串中的两个位置删除/插入直观上被视为匹配括号的内容。
The simple pumping lemma is the one for regular languages, which are the sets of strings described by finite automata, among other things. The main characteristic of a finite automation is that it only has a finite amount of memory, described by its states.
Now suppose you have a string, which is recognized by a finite automaton, and which is long enough to "exceed" the memory of the automation, i.e. in which states must repeat. Then there is a substring where the state of the automaton at the beginning of the substring is the same as the state at the end of the substring. Since reading the substring doesn't change the state it may be removed or duplicated an arbitrary number of times, without the automaton being the wiser. So these modified strings must also be accepted.
There is also a somewhat more complicated pumping lemma for context-free languages, where you can remove/insert what may intuitively be viewed as matching parentheses at two places in the string.
用外行的话来说,我认为你几乎是对的。 这是一种证明技术(实际上有两种),用于证明某种语言不属于某个类别。
例如,考虑一种包含无限数量字符串的常规语言(正则表达式、自动机等)。 在某个时刻,正如星蓝所说,你会耗尽内存,因为字符串对于自动机来说太长了。 这意味着必须有一个字符串块,自动机无法告诉您有多少个字符串副本(您处于循环中)。 因此,在字符串中间有任意数量的该子字符串的副本,您仍然可以使用该语言。
这意味着,如果您的语言不具有此属性,即有一个足够长的字符串,其中包含 NO 子字符串,您可以重复任意次并且仍然处于该语言中,那么语言不规则。
In laymans terms, I think you have it almost right. It's a proof technique (two actually) for proving that a language is NOT in a certain class.
Fer example, consider a regular language (regexp, automata, etc) with an infinite number of strings in it. At a certain point, as starblue said, you run out of memory because the string is too long for the automaton. This means that there has to be a chunk of the string that the automaton can't tell how many copies of it you have (you're in a loop). So, any number of copies of that substring in the middle of the string, and you still are in the language.
This means that if you have a language that does NOT have this property, ie, there is a sufficiently long string with NO substring that you can repeat any number of times and still be in the language, then the language isn't regular.
例如,采用此语言 L = anbn。
现在尝试将上述语言的有限自动机可视化为一些n。
如果n = 1,则字符串w = ab。 这里我们可以制作一个没有外循环的有限自动机
如果n = 2,则字符串w = a2b2。 这里我们可以制作一个没有循环的有限自动机,
如果 n = p,字符串 w = apbp。 本质上,可以假设有限自动机具有 3 个阶段。
第一阶段,需要一系列输入并进入第二阶段。 同样,从阶段 2 到阶段 3。我们将这些阶段称为 x、y 和 z。
有一些观察结果
因此阶段 y 的有限自动机状态应该能够接受输入“a”和“b”,并且不应该接受更多不可数的a和b。
等等......
所以舞台y的设计纯粹是无限的。 我们只能通过放置一些循环来使其有限,如果我们放置循环,有限自动机可以接受超出 L = anbn< 的语言/sup>。 因此对于这种语言我们无法构造有限自动机。 因此它不是有规律的。
For example, take this language L = anbn.
Now try to visualize finite automaton for the above language for some n's.
if n = 1, the string w = ab. Here we can make a finite automaton with out looping
if n = 2, the string w = a2b2. Here we can make a finite automaton with out looping
if n = p, the string w = apbp. Essentially a finite automaton can be assumed with 3 stages.
First stage, it takes a series of inputs and enter second stage. Similarly from stage 2 to stage 3. Let us call these stages as x, y and z.
There are some observations
So the finite automaton states for stage y should be able to take inputs 'a' and 'b' and also it should not take more a's and b's which cannot be countable.
and so on....
So the design of stage y is purely infinite. We can only make it finite by putting some loops and if we put loops, the finite automaton can accept languages beyond L = anbn. So for this language we can't construct a finite automaton. Hence it is not regular.
根据定义,常规语言是由有限状态自动机识别的语言。 将其视为一个迷宫:状态是房间,过渡是房间之间的单向走廊,有一个初始房间和一个出口(最终)房间。 正如“有限状态自动机”这个名字所说,房间的数量是有限的。 每次你沿着走廊行走时,你都会记下写在走廊墙上的信。 如果您能找到一条从最初的房间到最后的房间的路径,并以正确的顺序穿过标有字母的走廊,则可以识别单词。
泵送引理指出,存在一个最大长度(泵送长度),在该长度内,您可以在迷宫中漫步而无需返回到您之前走过的房间。 这个想法是,由于你只能走进这么多不同的房间,经过某个点,你必须退出迷宫或穿越你的轨道。 如果您设法在迷宫中行走的路径比这个泵送长度更长,那么您就走了一条弯路:您在路径中插入了一个(至少一个)可以删除的循环(如果您希望穿越迷宫识别较小的单词)或无限重复(泵送)(允许识别超长单词)。
上下文无关语言也有类似的引理。 这些语言可以表示为下推自动机接受的单词,下推自动机是有限状态自动机,可以利用堆栈来决定执行哪些转换。 尽管如此,由于状态数量仍然有限,因此即使通过属性的正式表达可能会稍微有点更复杂。
By definition regular languages are those recognized by a finite state automaton. Think of it as a labyrinth : states are rooms, transitions are one-way corridors between rooms, there's an initial room, and an exit (final) room. As the name 'finite state automaton' says, there is a finite number of rooms. Each time you travel along a corridor, you jot down the letter written on its wall. A word can be recognized if you can find a path from the initial to the final room, going through corridors labelled with its letters, in the correct order.
The pumping lemma says that there is a maximum length (the pumping length) for which you can wander through the labyrinth without ever going back to a room through which you have gone before. The idea is that since there are only so many distinct rooms you can walk in, past a certain point, you have to either exit the labyrinth or cross over your tracks. If you manage to walk a longer path than this pumping length in the labyrinth, then you are taking a detour : you are inserting a(t least one) cycle in your path that could be removed (if you want your crossing of the labyrinth to recognize a smaller word) or repeated (pumped) indefinitely (allowing to recognize a super-long word).
There is a similar lemma for context-free languages. Those languages can be represented as word accepted by pushdown automata, which are finite state automata that can make use of a stack to decide which transitions to perform. Nonetheless, since there is stilla finite number of states, the intuition explained above carries over, even through the formal expression of the property may be slightly more complex.
这不是一个解释,但很简单。
对于 a^nb^n,我们的 FSM 应该以这样的方式构建:b 必须知道已经解析的 a 的数量,并且接受相同的 n 个 b。 FSM 不能简单地做这样的事情。
This is not an explanation as such but it is simple.
For a^n b^n our FSM should be built in such a way that b must know the number of a's already parsed and will accept the same n number of b's. A FSM can not simply do stuff like that.