DFA 与 NFA 引擎:它们的功能和限制有何区别?
我正在根据 DFA 与 NFA 引擎的功能和限制,寻找关于 DFA 与 NFA 引擎之间差异的非技术解释。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我正在根据 DFA 与 NFA 引擎的功能和限制,寻找关于 DFA 与 NFA 引擎之间差异的非技术解释。
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
确定性有限自动机 (DFA) 和非确定性有限自动机 (NFA) 具有完全相同的功能和限制。唯一的区别是符号方便。
有限自动机是具有状态并读取输入的处理器,每个输入字符都可能将其设置为另一个状态。例如,状态可能是“刚刚连续读取两个 C”或“正在开始一个单词”。这些通常用于快速扫描文本以查找模式,例如对源代码进行词法扫描以将其转换为标记。
确定性有限自动机一次处于一种状态,这是可以实现的。不确定性有限自动机一次可以处于多个状态:例如,在标识符可以以数字开头的语言中,可能存在一种状态“读取数字”和另一种状态“读取标识符”,并且当读取以“123”开头的内容时,NFA 可能同时处于两者中。实际应用哪种状态取决于它是否在单词结尾之前遇到了非数字的内容。
现在,我们可以将“读取数字或标识符”表示为状态本身,突然我们就不需要 NFA 了。如果我们将 NFA 中的状态组合表示为状态本身,我们就会得到一个比 NFA 具有更多状态的 DFA,但它做同样的事情。
这是一个更容易读、更容易写、更容易处理的问题。 DFA 本身更容易理解,但 NFA 通常较小。
Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations. The only difference is notational convenience.
A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.
A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.
Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.
It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.
这是来自 Microsoft 的非技术答案:
DFA 引擎以线性时间运行,因为它们不需要回溯(因此它们永远不会两次测试相同的字符)。它们还可以保证匹配尽可能长的字符串。但是,由于 DFA 引擎仅包含有限状态,因此它无法将模式与反向引用相匹配,并且由于它不构造显式扩展,因此它无法捕获子表达式。
传统的 NFA 引擎运行所谓的“贪婪”匹配回溯算法,以特定顺序测试正则表达式的所有可能扩展并接受第一个匹配。由于传统 NFA 为成功匹配构建了正则表达式的特定扩展,因此它可以捕获子表达式匹配和匹配反向引用。然而,由于传统的 NFA 会回溯,如果状态是通过不同的路径到达的,则它可以多次访问完全相同的状态。因此,在最坏的情况下,它的运行速度可能会呈指数级缓慢。因为传统的 NFA 接受它找到的第一个匹配,所以它也可能会留下其他(可能更长)的匹配未被发现。
POSIX NFA 引擎与传统的 NFA 引擎类似,不同之处在于它们会继续回溯,直到可以保证找到最长的匹配。因此,POSIX NFA 引擎比传统 NFA 引擎慢,并且在使用 POSIX NFA 时,您不能通过更改回溯搜索的顺序来选择较短的匹配而不是较长的匹配。
传统的 NFA 引擎受到程序员的青睐,因为它们比 DFA 或 POSIX NFA 引擎更具表现力。尽管在最坏的情况下它们可能运行缓慢,但您可以使用减少歧义和限制回溯的模式引导它们在线性或多项式时间内找到匹配项。
[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]
Here's a non-technical answer from Microsoft:
DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.
Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.
POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.
Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.
[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]
一个简单的、非技术性的解释,摘自 Jeffrey Friedl 的书掌握正则表达式。
警告:
虽然这本书通常被认为是“正则表达式圣经”,但对于这里对 DFA 和 NFA 之间的区别是否正确存在一些争议。我不是计算机科学家,我不理解真正的“常规”表达式背后的大部分理论,无论是否确定。争议开始后,我因此删除了这个答案,但此后它已在其他答案的评论中引用。我很有兴趣进一步讨论这个问题——弗里德尔真的错了吗?还是我弄错了弗里德尔(但我昨天晚上重读了那一章,就像我记得的一样……)?
编辑:看来弗里德尔和我确实错了。请查看下面 Eamon 的精彩评论。
原始答案:
DFA 引擎逐个字符地遍历输入字符串,并尝试(并记住)正则表达式此时可以匹配的所有可能方式观点。如果到达字符串末尾,则声明成功。
想象一下字符串
AAB
和正则表达式A*AB
。现在我们逐个字母地浏览字符串。<代码>A:
A*
匹配。A*
(允许零次重复)并使用正则表达式中的第二个A
来进行匹配。<代码>A:
A*
来匹配。B
匹配。第二个分支失败。但是:A*
并使用第二个A
来进行匹配。<代码>B:
A*
或在正则表达式中移至下一个标记A
来匹配。第一个分支失败。DFA 引擎绝不会在字符串中回溯。
NFA 引擎逐个标记地遍历正则表达式标记,并尝试字符串上所有可能的排列,必要时回溯。如果到达正则表达式的末尾,则表明成功。
想象一下与以前相同的字符串和相同的正则表达式。现在,我们逐个标记地遍历正则表达式标记:
A*
:匹配AA
。记住回溯位置 0(字符串的开头)和 1。A
:不匹配。但我们有一个回溯位置,我们可以返回并重试。正则表达式引擎后退一个字符。现在A
匹配。B
:匹配。已到达正则表达式末尾(还有一个回溯位置可供使用)。万岁!A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.
CAVEAT:
While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?
Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.
Original answer:
A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.
Imagine the string
AAB
and the regexA*AB
. We now step through our string letter by letter.A
:A*
.A*
(zero repetitions are allowed) and using the secondA
in the regex.A
:A*
.B
. Second branch fails. But:A*
and using the secondA
instead.B
:A*
or by moving on in the regex to the next tokenA
. First branch fails.A DFA engine never backtracks in the string.
An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.
Imagine the same string and the same regex as before. We now step through our regex token by token:
A*
: MatchAA
. Remember the backtracking positions 0 (start of string) and 1.A
: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. NowA
matches.B
: Matches. End of regex reached (with one backtracking position to spare). Hooray!正如其名称所示,NFA 和 DFA 都是有限自动机。
两者都可以表示为起始状态、成功(或“接受”)状态(或成功状态集)以及列出转换的状态表。
在DFA的状态表中,每个
键将转换到一个且仅有一个state₁
。在 NFA 的状态表中,每个
将转换为一组状态。当您使用 DFA 时,将其重置为开始状态,给它一系列输入符号,您将确切地知道它处于什么结束状态以及是否是成功状态。
然而,当您采用 NFA 时,它会针对每个输入符号查找一组可能的结果状态,并(理论上)随机地、“不确定地”选择其中一个。如果存在导致该输入字符串成功状态之一的随机选择序列,则称 NFA 对于该字符串成功。换句话说,您应该假装它总是神奇地选择正确的。
计算领域的一个早期问题是,由于这种魔力,NFA 是否比 DFA 更强大,而答案是否定的,因为任何 NFA 都可以转化为等效的 DFA。 它们的功能和限制完全相同。
对于那些想知道 NFA 引擎如何真实、非神奇地为给定符号选择正确的后继状态的人,此页面描述了两种常见的方法。
Both NFAs and DFAs are finite automata, as their names say.
Both can be represented as a starting state, a success (or "accept") state (or set of success states), and a state table listing transitions.
In the state table of a DFA, each
<state₀, input>
key will transit to one and only onestate₁
.In the state table of an NFA, each
<state₀, input>
will transit to a set of states.When you take a DFA, reset it to it's start state, give it a sequence of input symbols, and you will know exactly what end state it's in and whether it's a success state or not.
When you take an NFA, however, it will, for each input symbol, look up the set of possible result states, and (in theory) randomly, "nondeterministically," select one of them. If there exists a sequence of random selections which leads to one of the success states for that input string, then the NFA is said to succeed for that string. In other words, you are expected to pretend that it magically always selects the right one.
One early question in computing was whether NFAs were more powerful than DFAs, due to that magic, and the answer turned out to be no since any NFA could be translated into an equivalent DFA. Their capabilities and limitations are exactly precisely the same as one another.
For those wondering how real, non-magical, NFA engine can "magically" select the right successor state for a given symbol, this page describes the two common approaches.
我发现 Jan Goyvaerts 在正则表达式,完整教程中给出的解释是最有用的。请参阅此 PDF 的第 7 页:
https://www.princeton.edu/~ mlovet/reference/Regular-Expressions.pdf
在第 7 页上提出的其他观点中,有两种正则表达式引擎:文本导向引擎和正则表达式导向引擎。 Jeffrey Friedl 分别将它们称为 DFA 和 NFA 引擎。 ...某些非常有用的功能,例如惰性量词和反向引用,只能在正则表达式导向的引擎中实现。
I find the explanation given in Regular Expressions, The Complete Tutorial by Jan Goyvaerts to be the most usable. See page 7 of this PDF:
https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
Among other points made on page 7, There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. ...certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines.