正则表达式转换为状态机的简短示例?
在 Stack Overflow 播客 #36 (https://blog.stackoverflow.com/2009/01/podcast-36 /),表达了这样的观点: 一旦您了解设置状态机是多么容易,您将永远不会再尝试不恰当地使用正则表达式。
我已经做了很多搜索。 我找到了一些学术论文和其他复杂的例子,但我想找到一个简单的例子来帮助我理解这个过程。 我使用了很多正则表达式,并且我想确保我再也不会“不恰当地”使用正则表达式。
In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed:
Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.
I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
一个相当方便的方法是在任何模式上使用 python 鲜为人知的 re.DEBUG 标志来帮助查看此问题:
“literal”和“range”后面的数字指的是它们应该匹配的 ascii 字符的整数值。
A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:
The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.
当然,尽管您需要更复杂的示例才能真正理解 RE 的工作原理。 考虑下面的 RE:,
这是一个典型的标识符(必须以 alpha 开头,并且后面可以有任意数量的字母数字和取消划线字符,包括没有)。 以下伪代码显示了如何使用有限状态机来完成此操作:
现在,正如我所说,这是一个非常简单的示例。 它没有展示如何进行贪婪/非贪婪匹配、回溯、一行内(而不是整行)匹配以及状态机的其他更深奥的功能,这些功能可以通过 RE 语法轻松处理。
这就是 RE 如此强大的原因。 执行单行 RE 功能所需的实际有限状态机代码通常非常长且复杂。
您能做的最好的事情就是获取特定简单语言的一些 lex/yacc (或等效)代码的副本,并查看它生成的代码。 它并不漂亮(它不一定是因为它不应该由人类阅读,他们应该查看 lex/yacc 代码),但它可能会让您更好地了解它们是如何工作的。
Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:
which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:
Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.
That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.
The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.
即时制作您自己的!
http://osteele.com/tools/reanimator/???
这是一个非常好的组合工具,它将正则表达式可视化为 FSM。 它不支持您在现实世界的正则表达式引擎中找到的某些语法,但肯定足以准确理解正在发生的事情。
Make your own on the fly!
http://osteele.com/tools/reanimator/???
This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.
问题是“如何选择状态和转换条件?”,还是“如何在 Foo 中实现我的抽象状态机?”
如何选择状态和转换条件?
我通常使用 FSM 来解决相当简单的问题,并凭直觉选择它们。 在 我对另一个有关正则表达式的问题的回答,我只是将解析问题视为
Inside
或之一在标签对之外,并从那里写出转换(具有开始和结束状态以保持实现干净)。
如何在 Foo 中实现我的抽象状态机?
如果您的实现语言支持类似于
c
的switch
语句的结构,那么您可以打开当前状态并处理输入以查看接下来执行哪个操作和/或转换。如果没有类似于 switch 的结构,或者它们在某些方面存在缺陷,则可以使用
if
样式分支。 呃。全部写在
c
中的一个地方,我链接的示例看起来像这样:这非常混乱,所以我通常会撕掉
状态
将案件分给不同的职能部门。Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"
How do I choose the states and the transition conditions?
I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either
Inside
oroutside
a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).How do I implement my abstract state machine in Foo?
If your implementation language supports a structure like
c
'sswitch
statement, then you switch on the current state and process the input to see which action and/or transition too perform next.Without
switch
-like structures, or if they are deficient in some way, youif
style branching. Ugh.Written all in one place in
c
the example I linked would look something like this:which is pretty messy, so I usually rip the
state
cases out to separate functions.我确信有人有更好的例子,但你可以检查这个Phil Haack 的帖子,其中有一个正则表达式和状态机执行相同操作的示例(我认为之前的帖子中还包含更多正则表达式示例..)
检查“HenriFormatter”在那一页上。
I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)
Check the "HenriFormatter" on that page.
我不知道你已经读过哪些学术论文,但理解如何实现有限状态机确实并不难。 有一些有趣的数学,但实际上很难理解。 理解 FSM 最简单的方法是通过输入和输出(实际上,这包含了大部分正式定义,我不会在这里描述)。 “状态”本质上只是描述一组已经发生并且可能从某个点发生的输入和输出。
通过图表最容易理解有限状态机。 例如:
替代文本 http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3 .gif
所有这一切都表明,如果您从某个状态 q0(旁边带有“开始”符号的状态)开始,您可以转到其他状态。 每个状态都是一个圆圈。 每个箭头代表一个输入或输出(取决于您如何看待它)。 思考有限状态机的另一种方式是根据“有效”或“可接受”的输入。 某些输出字符串在某些有限状态机中是不可能的; 这将允许您“匹配”表达式。
现在假设您从 q0 开始。 现在,如果输入 0,您将进入状态 q1。 但是,如果输入 1,您将进入状态 q2。 您可以通过输入/输出箭头上方的符号看到这一点。
假设您从 q0 开始并获得输入
0, 1, 0, 1, 1, 1
这意味着您已经经历了状态(q0 没有输入,您只是从那里开始):
q0 -> q1-> q0-> q1-> q0-> q2-> q3-> q3
如果没有意义,请用手指描绘图片。 请注意,对于输入 0 和 1,q3 都会返回到自身。
这一切的另一种说法是“如果您处于状态 q0,并且您看到 0,则转到 q1,但如果您看到 1,则转到 q2”。 如果您为每个状态设定这些条件,您就几乎完成了状态机的定义。 您所要做的就是拥有一个状态变量,然后找到一种泵入输入的方法,这基本上就是其中的内容。
好吧,那么为什么这对于乔尔的声明很重要呢? 嗯,构建“一个真正的正则表达式来统治它们”可能非常困难,也很难维护修改,甚至很难让其他人回来理解。 此外,在某些情况下它更有效。
当然,状态机还有许多其他用途。 希望这能在一些小方面有所帮助。 请注意,我没有费心深入理论,但有一些关于 FSM 的有趣证明。
I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.
Finite state machines are easiest to understand via diagrams. For example:
alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif
All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.
Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.
Let's say you start at q0 and get this input
0, 1, 0, 1, 1, 1
This means you have gone through states (no input for q0, you just start there):
q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3
Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.
Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.
Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.
Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.