正则表达式转换为状态机的简短示例？

发布于 2024-07-13 06:56:52 字数 305 浏览 11 评论 0原文

在 Stack Overflow 播客 #36 (https://blog.stackoverflow.com/2009/01/podcast-36 /），表达了这样的观点：一旦您了解设置状态机是多么容易，您将永远不会再尝试不恰当地使用正则表达式。

我已经做了很多搜索。我找到了一些学术论文和其他复杂的例子，但我想找到一个简单的例子来帮助我理解这个过程。我使用了很多正则表达式，并且我想确保我再也不会“不恰当地”使用正则表达式。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笔芯 2024-07-20 06:56:52

一个相当方便的方法是在任何模式上使用 python 鲜为人知的 re.DEBUG 标志来帮助查看此问题：

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
  in
    range (65, 90)
  max_repeat 0 65535
    in
      range (65, 90)
      range (48, 57)
at at_boundary
max_repeat 0 65535
  not_literal 62
literal 62
subpattern 2
  min_repeat 0 65535
    any None
literal 60
literal 47
groupref 1
literal 62

“literal”和“range”后面的数字指的是它们应该匹配的 ascii 字符的整数值。

A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
  in
    range (65, 90)
  max_repeat 0 65535
    in
      range (65, 90)
      range (48, 57)
at at_boundary
max_repeat 0 65535
  not_literal 62
literal 62
subpattern 2
  min_repeat 0 65535
    any None
literal 60
literal 47
groupref 1
literal 62

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

回复收藏 0 原文

︶￣淡然 2024-07-20 06:56:52

当然，尽管您需要更复杂的示例才能真正理解 RE 的工作原理。考虑下面的 RE:，

^[A-Za-z][A-Za-z0-9_]*$

这是一个典型的标识符（必须以 alpha 开头，并且后面可以有任意数量的字母数字和取消划线字符，包括没有）。以下伪代码显示了如何使用有限状态机来完成此操作：

state = FIRSTCHAR
for char in all_chars_in(string):
    if state == FIRSTCHAR:
            if char is not in the set "A-Z" or "a-z":
                error "Invalid first character"
            state = SUBSEQUENTCHARS
            next char
    if state == SUBSEQUENTCHARS:
            if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                error "Invalid subsequent character"
            state = SUBSEQUENTCHARS
            next char

现在，正如我所说，这是一个非常简单的示例。它没有展示如何进行贪婪/非贪婪匹配、回溯、一行内（而不是整行）匹配以及状态机的其他更深奥的功能，这些功能可以通过 RE 语法轻松处理。

这就是 RE 如此强大的原因。执行单行 RE 功能所需的实际有限状态机代码通常非常长且复杂。

您能做的最好的事情就是获取特定简单语言的一些 lex/yacc （或等效）代码的副本，并查看它生成的代码。它并不漂亮（它不一定是因为它不应该由人类阅读，他们应该查看 lex/yacc 代码），但它可能会让您更好地了解它们是如何工作的。

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR
for char in all_chars_in(string):
    if state == FIRSTCHAR:
            if char is not in the set "A-Z" or "a-z":
                error "Invalid first character"
            state = SUBSEQUENTCHARS
            next char
    if state == SUBSEQUENTCHARS:
            if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                error "Invalid subsequent character"
            state = SUBSEQUENTCHARS
            next char

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

回复收藏 0 原文

四叶草在未来唯美盛开 2024-07-20 06:56:52

即时制作您自己的！

http://osteele.com/tools/reanimator/???

有限状态机

这是一个非常好的组合工具，它将正则表达式可视化为 FSM。它不支持您在现实世界的正则表达式引擎中找到的某些语法，但肯定足以准确理解正在发生的事情。

回复收藏 0 原文

離人涙 2024-07-20 06:56:52

问题是“如何选择状态和转换条件？”，还是“如何在 Foo 中实现我的抽象状态机？”

如何选择状态和转换条件？

我通常使用 FSM 来解决相当简单的问题，并凭直觉选择它们。在我对另一个有关正则表达式的问题的回答，我只是将解析问题视为 Inside 或 之一在标签对之外，并从那里写出转换（具有开始和结束状态以保持实现干净）。

如何在 Foo 中实现我的抽象状态机？

如果您的实现语言支持类似于 c 的 switch 语句的结构，那么您可以打开当前状态并处理输入以查看接下来执行哪个操作和/或转换。

如果没有类似于 switch 的结构，或者它们在某些方面存在缺陷，则可以使用 if 样式分支。呃。

全部写在c中的一个地方，我链接的示例看起来像这样：

token_t token;
state_t state=BEGIN_STATE;
do {
   switch ( state.getValue() ) {
   case BEGIN_STATE;
      state=OUT_STATE;
      break;
   case OUT_STATE:
      switch ( token.getValue() ) {
         case CODE_TOKEN:
            state = IN_STATE;
            output(token.string());
            break;
         case NEWLINE_TOKEN;
            output("<break>");
            output(token.string());
            break;
         ...
      }
      break;
   ...
   }
} while (state != END_STATE);

这非常混乱，所以我通常会撕掉状态 将案件分给不同的职能部门。

Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"

How do I choose the states and the transition conditions?

I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).

How do I implement my abstract state machine in Foo?

If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.

Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.

Written all in one place in c the example I linked would look something like this:

token_t token;
state_t state=BEGIN_STATE;
do {
   switch ( state.getValue() ) {
   case BEGIN_STATE;
      state=OUT_STATE;
      break;
   case OUT_STATE:
      switch ( token.getValue() ) {
         case CODE_TOKEN:
            state = IN_STATE;
            output(token.string());
            break;
         case NEWLINE_TOKEN;
            output("<break>");
            output(token.string());
            break;
         ...
      }
      break;
   ...
   }
} while (state != END_STATE);

which is pretty messy, so I usually rip the state cases out to separate functions.

回复收藏 0 原文

赠我空喜 2024-07-20 06:56:52

我确信有人有更好的例子，但你可以检查这个Phil Haack 的帖子，其中有一个正则表达式和状态机执行相同操作的示例（我认为之前的帖子中还包含更多正则表达式示例..）

检查“HenriFormatter”在那一页上。

回复收藏 0 原文

满天都是小星星 2024-07-20 06:56:52

我不知道你已经读过哪些学术论文，但理解如何实现有限状态机确实并不难。有一些有趣的数学，但实际上很难理解。理解 FSM 最简单的方法是通过输入和输出（实际上，这包含了大部分正式定义，我不会在这里描述）。 “状态”本质上只是描述一组已经发生并且可能从某个点发生的输入和输出。

通过图表最容易理解有限状态机。例如：

替代文本 http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3 .gif

所有这一切都表明，如果您从某个状态 q0（旁边带有“开始”符号的状态）开始，您可以转到其他状态。每个状态都是一个圆圈。每个箭头代表一个输入或输出（取决于您如何看待它）。思考有限状态机的另一种方式是根据“有效”或“可接受”的输入。某些输出字符串在某些有限状态机中是不可能的；这将允许您“匹配”表达式。

现在假设您从 q0 开始。现在，如果输入 0，您将进入状态 q1。但是，如果输入 1，您将进入状态 q2。您可以通过输入/输出箭头上方的符号看到这一点。

假设您从 q0 开始并获得输入

0, 1, 0, 1, 1, 1

这意味着您已经经历了状态（q0 没有输入，您只是从那里开始）：

q0 -> q1-> q0-> q1-> q0-> q2-> q3-> q3

如果没有意义，请用手指描绘图片。请注意，对于输入 0 和 1，q3 都会返回到自身。

这一切的另一种说法是“如果您处于状态 q0，并且您看到 0，则转到 q1，但如果您看到 1，则转到 q2”。如果您为每个状态设定这些条件，您就几乎完成了状态机的定义。您所要做的就是拥有一个状态变量，然后找到一种泵入输入的方法，这基本上就是其中的内容。

好吧，那么为什么这对于乔尔的声明很重要呢？嗯，构建“一个真正的正则表达式来统治它们”可能非常困难，也很难维护修改，甚至很难让其他人回来理解。此外，在某些情况下它更有效。

当然，状态机还有许多其他用途。希望这能在一些小方面有所帮助。请注意，我没有费心深入理论，但有一些关于 FSM 的有趣证明。

I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.

Finite state machines are easiest to understand via diagrams. For example:

alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif

All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.

Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.

Let's say you start at q0 and get this input

0, 1, 0, 1, 1, 1

This means you have gone through states (no input for q0, you just start there):

q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3

Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.

Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.

Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.

Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.

回复收藏 0 原文

~没有更多了~