有限状态机解析器

发布于 2024-09-06 03:00:46 字数 1736 浏览 7 评论 0原文

我想用 C++ 中的类似于 FSM 的解析器来解析自己设计的文件格式（这是一个 teach-myself-c++-the-hard-way-by-doing-something -大而困难的项目:)）。我有一个带有换行符的标记化字符串，表示 euh... 行的结尾。请参阅此处获取输入示例。所有注释都会被过滤掉，所以我有一个像这样的 std::string：

global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...

语法解释：

{ } 是范围，大写单词表示后面是选项/文件列表。
\n 仅在选项/文件列表中很重要，表示列表的结尾。

所以我认为 FSM 足够简单/可扩展，足以满足我的需求/知识。据我所知（并希望我的文件设计如此），我不需要并发状态或类似的东西。一些设计/实现问题：

我应该为我的状态使用enum还是抽象class +导数？第一个可能更适合小型语法，但稍后可能会变得丑陋，而第二个则完全相反。我倾向于第一种，因为它简单。 enum 示例和类示例。编辑：这个建议对于goto怎么样，我认为他们在 C++ 中是邪恶的？
阅读列表时，我需要不要忽略 \n。我首选通过 stringstream 使用 string 的方式，默认情况下会忽略 \n。因此，我需要简单的方法来告诉（相同！）stringstream 在启用某种状态时不要忽略换行符。
简单的 enum 状态足以进行多级解析（作用域在 {...{...}...} 范围内）还是需要 hacky 实现？
这是我想到的草案状态：
- upper：读取全局、exe、lib+目标名称...
- 正常：在作用域内，可以读取 SOURCES...、创建用户变量...
- list：将项目添加到列表中，直到遇到换行符。

每个作用域都有一种条件（例如 win32:global { gcc:CFLAGS = ... }），并且需要在任何地方以完全相同的方式处理（即使在 list 状态下，每个物品）。

感谢您的任何意见。

原文

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:

global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...

Syntax explanation:

{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.

So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:

Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
- upper: reads global, exe, lib+ target names...
- normal: inside a scope, can read SOURCES..., create user variables...
- list: adds items to a list until a newline is encountered.

Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).

Thanks for any input.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋千易 2024-09-13 03:00:46

如果您有嵌套作用域，那么有限状态机不是正确的选择，您应该考虑上下文无关语法解析器。 LL(1) 解析器可以写成一组递归函数，或者一个 < href="http://en.wikipedia.org/wiki/LALR_parser" rel="noreferrer">LALR(1) 解析器可以使用解析器生成器（例如 Bison）编写。

如果您向 FSM 添加堆栈，那么您就进入了下推自动机领域。不确定性下推自动机相当于上下文无关语法（尽管确定性下推自动机严格来说是LALR(1) 解析器生成器实际上在内部生成确定性下推自动机。一本好的编译器设计教科书将涵盖从语法构造下推自动机的确切算法。（通过这种方式，添加堆栈并不是“hacky”。）这篇 Wikipedia 文章还描述了如何从语法构造 LR(1) 下推自动机，但在我看来，这篇文章并不那么清楚。

如果您的作用域嵌套深度有限（即您有 upper、normal 和 list 级别，但没有嵌套的 lists 或嵌套的normals），那么您可以使用没有堆栈的 FSM。

回复收藏 0 原文

天邊彩虹 2024-09-13 03:00:46

分析文本输入流进行解析有两个阶段：

词法分析：这是将输入流分解为词法单元的地方。它查看字符序列并生成标记（类似于口语或书面语言中的单词）。只要您对词法结构做出了良好的设计决策，有限状态机就非常擅长词法分析。从上面的数据来看，单个词位将是诸如关键字（例如“global”）、标识符（例如“bitwise”、“SOURCES”）、符号标记（例如“{”“}”、“.”、“/”之类的东西））、数值、转义值（例如“\n”）等。

语法/语法分析：在生成标记序列时（或者可能在您这样做时），您需要能够分析结构以确定标记的顺序是否与您的语言设计一致。为此，您通常需要某种解析器，但如果语言结构不是很复杂，您也许可以使用有限状态机来完成此操作。一般来说（并且因为您特别需要在您的情况下使用嵌套结构），您将需要使用 Ken Bloom 描述的技术之一。

因此，回答您的问题：

我应该为我的状态使用枚举还是抽象类+衍生物？

我发现对于小型分词器，状态/转换值的矩阵是合适的，例如next_state = state_transitions[current_state][current_input_char]。在这种情况下，next_state 和 current_state 是一些整数类型（可能包括枚举类型）。当您转换到无效状态时，会检测到输入错误。令牌的结束是基于有效结束状态的状态识别来识别的，在给定下一个输入字符的情况下，没有有效的转换可用于另一个状态。如果您担心空间，可以使用地图矢量。制作州课程是可能的，但我认为这可能会让事情变得比你需要的更困难。

在读取列表时，我不需要忽略 \n。

您可以创建一个名为“\n”的标记，或者创建一个更通用的转义标记（前面带有反斜杠的标识符。如果您是谈到识别源代码中的换行符，那么这些只是您需要在状态转换矩阵中创建转换的字符（但是请注意 Unix 和 Windows 换行符之间的差异；您可以创建一个在任一操作系统上运行的 FSM） .

简单的枚举状态足以进行多级解析（范围 {...{...}...} 内的范围）还是需要 hacky 实现？

这就是您需要的地方语法或下推自动机，除非您可以保证嵌套不会超过一定级别，否则它可能会使您的 FSM 非常复杂

这是我想到的草案状态：...

请参阅上面我对词汇和语法分析的评论。

There are two stages to analyzing a text input stream for parsing:

Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.

Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.

So in response to your questions:

Should I use an enum or an abstract class + derivatives for my states?

I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.

When reading a list, I need to NOT ignore \n.

You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).

Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?

This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.

Here's the draft states I have in mind: ...

See my commments on lexical and grammatical analysis above.

回复收藏 0 原文