基于 DFA 的正则表达式匹配 - 如何获取所有匹配项?

发布于 2024-07-29 06:38:54 字数 226 浏览 10 评论 0原文

我有一个代表正则表达式的给定 DFA。
我想要将 DFA 与输入流进行匹配,并获取所有可能的匹配项,而不仅仅是最左边的最长匹配项。

例如:

正则表达式:a*ba|baa

输入:aaaaabaaababbabbbaa

结果:

  1. aaaaaba
  2. aaba
  3. ba
  4. baa

I have a given DFA that represents a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leftmost longest match.

For example:

regex: a*ba|baa

input: aaaaabaaababbabbbaa

result:

  1. aaaaaba
  2. aaba
  3. ba
  4. baa

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半枫 2024-08-05 06:38:54

假设

根据您的问题和后来的评论,您需要一种通用方法将句子拆分为不重叠的匹配子字符串,并丢弃句子中不匹配的部分。 您似乎还想要最佳的运行时性能。 另外,我假设您已经有一个现有的算法可以将正则表达式转换为 DFA 形式。 我进一步假设您正在通过首先构建 NFA 并通过子集构建将其转换为 DFA 的常用方法来执行此操作,因为我不知道完成此操作的任何其他方法。

在你去追逐阴影之前,请确保你尝试使用正确的工具来完成这项工作。 任何关于正则表达式的讨论几乎总是令人困惑,因为人们使用正则表达式做的事情比他们真正最适合做的事情要多得多。 如果您想获得正则表达式的好处,请确保您使用的是正则表达式,而不是更广泛的表达式。 如果您想要做的事情不能在某种程度上编码到正则表达式本身中,那么您就无法(完全)受益于正则表达式算法的优势。

一个明显的例子是,再多的聪明也不允许使用 FSM 或任何其他 算法。算法,预测未来。 例如,像 (a*b)|(a) 这样的表达式,当与字符串 aaa 匹配时...,其中省略号是表达式中尚未出现的部分已扫描因为用户尚未键入它们,无法为您提供所有可能的正确子组。

有关正则表达式实现的更详细讨论,特别是 Thompson NFA,请查看此链接 ,它描述了一个带有一些巧妙优化的简单 C 实现。

正则语言的局限性

正则表达式算法的 O(n) 和 Space(O(1)) 保证是一个相当狭窄的主张。 具体来说,常规语言是可以在恒定空间中识别的所有语言的集合。 这个区别很重要。 任何比接受或拒绝句子更复杂的算法增强功能都可能适用于比常规语言更大的语言集。 最重要的是,如果您可以证明某些增强功能需要大于恒定空间来实现,那么您也不在性能保证范围内。 话虽这么说,如果我们非常小心地将我们的算法保持在这些狭窄的限制内,我们仍然可以做很多事情。

显然,这消除了我们可能想要用递归回溯做的任何事情。 堆栈没有恒定的空间。 即使维护句子的指针也是被禁止的,因为我们不知道句子可能有多长。 足够长的句子会溢出任何整数指针。 当我们要解决这个问题时,我们无法为自动机创建新的状态。 在将识别器暴露给任何输入之前,所有可能的状态(以及一些不可能的状态)都必须是可预测的,并且该数量必须受某个常数的限制,该常数可能因我们想要匹配的特定语言而异,但不受其他变量的限制。

这仍然为添加额外的行为留出了一些空间。 获得更多里程的常用方法是为处理中发生的某些事件添加一些额外的注释,例如子表达式开始或停止匹配的时间。 由于我们只允许进行恒定空间处理,这限制了我们可以处理的子表达式匹配的数量。 这通常意味着该子表达式的最新实例。 这就是为什么当您请求 (a|)* 匹配的子组时,您总是得到一个空字符串,因为 a 的任何序列都隐含地跟随无限多个空字符串。

另一个常见的增强功能是在状态之间做一些聪明的事情。 例如,在 perl 正则表达式中, \b 匹配空字符串,但前提是前一个字符是单词字符而下一个字符不是,反之亦然。 许多简单的断言都适合这种情况,包括公共行锚运算符 ^$。 前向断言和后向断言也是可能的,但要困难得多。

在讨论各种常规语言识别器之间的差异时,值得澄清的是我们谈论的是匹配识别还是搜索识别,前者仅当整个句子都在该语言中时才接受,而后者如果句子中存在任何子字符串则接受是在语言中。 它们在某种意义上是等效的,如果搜索方法接受某个表达式 E,则 .*<匹配方法接受strong>(E).*

这很重要,因为我们可能想弄清楚像 a*b|a 这样的表达式是否接受 aa 。 在搜索方法中,确实如此。 任一标记都将匹配析取的右侧。 但它不匹配,因为您永远无法通过单步执行表达式并从转换生成标记来获得该句子,至少在一次传递中是这样。 因此,我只讨论匹配语义。 显然如果你想要搜索语义,你可以用.*修改表达式。 注意

:由表达式E<定义的语言/strong>|.* 实际上并不是一种非常易于管理的语言,无论 E 的子语言如何,因为它匹配所有可能的句子。 这对正则表达式识别器来说是一个真正的挑战,因为它们实际上只适合识别一种语言或确认一个句子不是使用同一种语言,而不是做任何更具体的工作。

正则语言识别器的实现

通常有三种方法来处理正则表达式。 这三者的起点相同,都是将表达式转换为 NFA。 此过程为原始表达式中的每个产生规则生成一个或两个状态。 规则极其简单。 这是一些粗略的 ascii 艺术:请注意,a 是语言字母表中的任何单个文字字符,E1E2 是任何正则表达式。 Epsilon(ε) 是一种具有输入和输出的状态,但忽略字符流,也不消耗任何输入。

a     ::=  > a ->

E1 E2 ::=   >- E1 ->- E2 ->

               /---->
E1*   ::= > --ε <-\
               \  /
                E1

             /-E1 ->
E1|E2 ::= > ε
             \-E2 ->

就是这样! E+、E?、[abc] 等常见用法分别相当于 EE*、(E|)、(a|b|c)。 另请注意,我们为每个产生式规则添加了极少量的新状态。 事实上,每个规则添加零个或一个状态(在本演示中)。 字符、量词和脱节都只添加一种状态,而串联不会添加任何状态。 其他一切都是通过将片段的结束指针更新为其他状态或片段的开始指针来完成的。

epsilon 过渡态很重要,因为它们是不明确的。 当遇到这种情况时,机器是否应该将状态更改为以下状态或其他状态? 它应该改变状态还是保持不变? 这就是为什么这些自动机被称为不确定性的原因。 解决方案是让自动机转换到正确状态,以能够匹配最佳状态为准。 因此,棘手的部分是弄清楚如何做到这一点。

基本上有两种方法可以做到这一点。 第一种方法是尝试每一种方法。 遵循第一个选择,如果不起作用,请尝试下一个。 这是递归回溯,出现在一些(并且值得注意的)实现中。 对于精心设计的正则表达式,此实现几乎不需要做任何额外工作。 如果表达式有点复杂,递归回溯就非常非常糟糕,O(2^n)。

另一种方法是并行尝试这两个选项。 在每次 epsilon 转换时,将 epsilon 转换建议的两个状态添加到当前状态集合中。 由于您使用的是集合,因此您可以多次出现相同的状态,但您只需要跟踪它一次,无论您是否处于该状态。 如果您发现某个特定状态无法遵循,请忽略它,因为该路径不匹配。 如果没有更多状态,则整个表达式不匹配。 一旦任何状态达到最终状态,您就完成了。

仅从这个解释来看,我们要做的工作量就增加了一些。 我们已经从必须跟踪单个状态变为必须跟踪多个状态。 在每次迭代中,我们可能必须更新 m 个状态指针的顺序,包括检查重复项之类的事情。 此外,我们所需的存储量也增加了,因为现在它不再是指向 NFA 中一种可能状态的单个指针,而是一整套状态。

然而,这并不像听起来那么糟糕。 首先,状态的数量受到原始正则表达式中产生式数量的限制。 从现在开始,我们将调用该值 m,以区别于输入中的符号数量,即 n 。 如果两个状态指针最终转换到相同的新状态,您可以丢弃其中一个,因为无论发生什么情况,它们都将遵循相同的路径。 这意味着您需要的状态指针的数量受状态数量的限制,因此 to 为 m

与回溯相比,在最坏的情况下,这是一个更大的胜利。 从输入中消耗每个字符后,您将创建、重命名或销毁最多 m 个状态指针。 没有办法制作一个正则表达式,它会导致您执行更多的指令(​​乘以一些常数因子,具体取决于您的具体实现),或者会导致您在堆栈或堆上分配更多空间。

该 NFA 同时处于其 m 状态的某个子集中,可以被视为某种其他状态机,其状态代表其建模的 NFA 可能处于的一组状态。该 FSM 的每个状态代表来自NFA 各州的幂集。 这正是用于匹配正则表达式的 DFA 实现。

使用这种替代表示的一个优点是,您不必更新 m 个状态指针,而只需更新一个。 它也有一个缺点,因为它对 m 状态的幂集进行建模,所以实际上最多有 2m 状态。 这是一个上限,因为您不会对不可能发生的状态进行建模,例如表达式 a|b 在读取第一个字符后有两种可能的状态,要么是看到 >a,或者看到b。 无论您给它什么输入,它都不能同时处于这两种状态,因此状态集不会出现在 DFA 中。 事实上,因为您正在消除 epsilon 转换的冗余,所以许多简单的 DFA 实际上比它们所代表的 NFA 更小,但根本没有办法保证这一点。

为了防止状态爆炸变得太大,该算法的几个版本中使用的解决方案是仅生成您实际需要的 DFA 状态,如果生成太多,则丢弃最近未使用的状态。 您始终可以再次生成它们。

从理论到实践

正则表达式的许多实际用途都涉及跟踪输入的位置。 这在技术上是作弊,因为输入可以是任意长的。 即使您使用 64 位指针,输入也可能是 264+1 个符号长,并且您会失败。 您的位置指针必须随着输入的长度而增长,现在您的算法现在需要的不仅仅是恒定的空间来执行。 实际上,这并不相关,因为如果您的正则表达式最终确实通过了这么多输入,您可能不会注意到它会失败,因为您会在那之前很久就终止它。

当然,我们想要做的不仅仅是接受或拒绝整个输入。 最有用的变体是提取子匹配,以发现输入的哪一部分与原始表达式的特定部分匹配。 实现此目的的简单方法是为表达式中的每个左大括号和右大括号添加 epsilon 转换。 当 FSM 模拟器遇到这些状态之一时,它会用有关遇到特定转换时输入中的位置的信息来注释状态指针。 如果同一指针第二次返回到该转换,则旧注释将被丢弃并替换为新输入位置的新注释。 如果两个具有不同注释的状态指针折叠到相同状态,则稍后输入位置的注释将再次获胜。

如果您坚持使用 Thompson NFA 或 DFA 实现,那么实际上并没有任何贪婪或非贪婪匹配的概念。 回溯算法需要得到提示,表明它是否应该从尝试尽可能多的匹配并递归地尝试更少的内容开始,或者在第一次尝试失败时尝试尽可能少的匹配并递归地尝试更多的内容。 Thompson NFA 方法同时尝试所有可能的数量。 另一方面,您可能仍然希望使用一些贪婪/非贪婪暗示。 该信息将用于确定是否应首选较新或较旧的子匹配注释,以便捕获输入的正确部分。

另一种实用的增强是断言,产生式不消耗输入,而是根据输入位置的某些方面进行匹配或拒绝。 例如,在 perl 正则表达式中, \b 表示输入必须在该位置包含单词边界,这样刚刚匹配的符号必须是单词字符,但下一个字符不能是,或者反之亦然。 同样,我们通过向模拟器添加带有特殊指令的 epsilon 转换来管理此问题。 如果断言通过,则状态指针继续,否则被丢弃。

前瞻和后瞻断言可以通过更多的工作来实现。 典型的后向断言 r0(?<=r1)r2 转换为两个单独的表达式,.*r1r0 >εr2。 两个表达式都应用于输入。 请注意,我们在断言表达式中添加了 .*,因为我们实际上并不关心它从哪里开始。 当模拟器在第二个生成的片段中遇到 epsilon 时,它会检查第一个片段的状态。 如果该片段处于可以接受的状态,则断言通过,状态指针流入 r2,但否则,它失败,两个片段都继续,第二个片段在 epsilon 转换时丢弃状态指针。

Lookahead 还可以通过为断言使用额外的正则表达式片段来工作,但稍微复杂一些,因为当我们到达输入中断言必须成功的点时,没有遇到任何相应的字符(在lookbehind 情况下,它们都遇到过)。 相反,当模拟器到达断言时,它会在断言子表达式的开始状态中启动一个指针,并在模拟的主要部分中注释状态指针,以便它知道它依赖于子表达式指针。 在每个步骤中,模拟必须检查它所依赖的状态指针是否仍然匹配。 如果找不到,那么无论它在哪里都会失败。 与主要部分相比,您不必保留更多的断言子表达式状态指针副本,如果断言中的两个状态指针位于同一状态,则它们每个所依赖的状态指针将共享相同的状态指针命运,并且可以重新注释以指向您保留的单个指针。

当我们向 epsilon 转换添加特殊指令时,建议一条指令让模拟器偶尔暂停以让用户看到发生了什么,这并不是一个糟糕的主意。 每当模拟器遇到这样的转换时,它都会将其当前状态包装在某种包中,该包可以返回给调用者、检查或更改,然后从中断处恢复。 这可以用于交互式地匹配输入,因此如果用户仅键入部分匹配,则模拟器可以要求更多输入,但如果用户键入无效的内容,则模拟器为空,并且可以向用户抱怨。 另一种可能性是每次匹配子表达式时都会产生,从而允许您查看输入中的每个子匹配。 但这不能用于排除某些子匹配。 例如,如果您尝试将 ((a)*b)aaa 进行匹配,您可能会看到 a 的三个子匹配,即使整个表达式最终失败,因为不存在 b,并且没有对应 b 的子匹配

最后,可能有一种方法可以修改它以使用反向引用。 即使它很优雅,它也肯定是低效的,具体来说,正则表达式加上反向引用是 NP-Complete 的,所以我什至不会尝试想办法做到这一点,因为我们只对(渐近地)感兴趣)有效的可能性。

Assumptions

Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.

Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)

An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.

For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.

Limitations of Regular Languages

The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.

Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.

This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.

The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.

When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.

This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s

Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.

Implementation of Regular Language Recognizers

There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.

a     ::=  > a ->

E1 E2 ::=   >- E1 ->- E2 ->

               /---->
E1*   ::= > --ε <-\
               \  /
                E1

             /-E1 ->
E1|E2 ::= > ε
             \-E2 ->

And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.

The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.

There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).

The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.

Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.

However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.

This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.

This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.

Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.

To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.

From Theory to Practice

Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.

Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.

If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.

Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.

Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.

Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.

While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's

Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文