标记的类层次结构并在解析器中检查它们的类型

发布于 2024-12-03 21:50:21 字数 1623 浏览 1 评论 0原文

我正在尝试编写一个可重用的解析库（为了好玩）。

我编写了一个 Lexer 类，它生成一个 Tokens 序列。 Token 是子类层次结构的基类，每个子类代表不同的令牌类型，并具有自己的特定属性。例如，有一个子类 LiteralNumber（从 Literal 派生并通过它从 Token 派生），它有自己的处理数字的特定方法其词位的值。一般处理词位的方法（检索其字符串表示形式、源中的位置等）位于基类 Token 中，因为它们对于所有标记类型都是通用的。该类层次结构的用户可以为我未预测的特定标记类型派生自己的类。

现在我有一个 Parser 类，它读取标记流并尝试将它们与其语法定义进行匹配。例如，它有一个方法 matchExpression，该方法依次调用 matchTerm，而该方法调用 matchFactor，该方法必须测试当前标记是否为Literal 或 Name （均派生自 Token 基类）。

问题是：
我现在需要检查流中当前标记的类型是什么以及它是否与语法匹配。如果不是，则抛出 EParseError 异常。如果是，则采取相应措施以获取表达式中的值、生成机器代码或在语法匹配时执行解析器需要执行的任何操作。

但我读过很多关于在运行时检查类型并从中做出决定的内容，这是一个糟糕的设计™，并且应该将其重构为多态虚拟方法。当然，我同意这一点。

因此，我的第一次尝试是将一些 type 虚拟方法放入 Token 基类中，该方法将被派生类覆盖并返回一些 enum带有类型 ID。

但我已经看到这种方法的缺点：从 Token 派生自己的令牌类的用户将无法向 enum 添加额外的 id，该 id 位于库源码！ :-/ 目标是允许他们在需要时扩展新型代币的层次结构。

我还可以从 type 方法返回一些 string，这样可以轻松定义新类型。

但是，在这两种情况下，有关基类型的信息都会丢失（仅从 type 方法返回叶类型），并且 Parser 类将无法检测到Literal 派生类型，当有人从它派生并重写 type 以返回 “Literal” 以外的内容时。

当然，Parser 类也是供用户扩展的（即编写自己的解析器，识别自己的标记和语法），它不知道 Token 的后代是什么类将来会存在。

许多常见问题解答和设计书籍建议在这种情况下从需要按类型决定的代码中获取行为，并将其放入派生类中重写的虚拟方法中。但我无法想象如何将这种行为放入 Token 后代中，因为这不是他们的业务，例如生成机器代码或评估表达式。此外，语法的某些部分需要匹配多个标记，因此没有一个特定的标记可以将这种行为放入其中。这实际上是特定语法规则的责任，这些规则可以匹配多个标记作为其终端符号。

有什么想法可以改进这个设计吗？

原文

I'm attempting to write a reusable parsing library (for fun).

I wrote a Lexer class which generates a sequence of Tokens. Token is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties. For example, there is a subclass LiteralNumber (deriving from Literal and through it from Token), which has its own specific methods for dealing with numeric value of its lexeme. Methods for dealing with lexemes in general (retrieving their character string representation, position in the source etc.) are in the base class, Token, because they're general to all token types. Users of this class hierarchy can derive their own classes for specific token types not predicted by me.

Now I have a Parser class which reads the stream of tokens and tries to match them with its syntax definition. For example, it has a method matchExpression, which in turn calls matchTerm and this one calls matchFactor, which has to test if the current token is Literal or Name (both derived from Token base class).

The problem is:
I need to check now what is the type of the current token in the stream and whether it matches the syntax or not. If not, throw an EParseError exception. If yes, act accordingly to get its value in the expression, generate machine code, or do whatever the parser needs to do when the syntax matches.

But I've read much about that checking the type in runtime, and deciding from it, is a bad design™, and it should be refactored as polymorphic virtual methods. Of course, I agree with that.

So my first attempt was to put some type virtual method in the Token base class, which would be overrided by the derived classes and return some enum with type id.

But I already see a shortcomings of this approach: Users deriving from Token their own classes of tokens won't be able to add additional id's to the enum, which is in the library source! :-/ And the goal was to allow them for extending the hierarchy for new types of tokens when they'll need it.

I could also return some string from the type method, which would allow for easy defining new types.

But still, in both these cases, the information about base types is lost (only leaf type is returned from type method) and the Parser class wouldn't be able to detect the Literal derived type when someone would derive from it and override the type to return something other than "Literal".

And of course the Parser class, which also is meant for extending by users (that is, writing their own parsers, recognizing their own tokens and syntax) doesn't know what descendants of the Token class will be there in the future.

Many FAQs and books on design recommend in this scenario to take the behavior from the code which needs to decide by type, and put it into the virtual method overriden in derived classes. But I cannot imagine how could I put this behavior into the Token descendants, because it's not their busines, for example, to generate machine code, or evaluate expressions. Moreover, there are parts of the syntax which need to match more than one token, so there is no one particular token which I could put that behavior into. It's rather the responsibility of particular syntax rules, which could match more than one token as their terminal symbols.

Any ideas how to improve this design?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生生漫 2024-12-10 21:50:21

RTTI 得到所有主要 C++ 编译器的良好支持。这至少包括 GCC、Intel 和 MSVC。对于可移植性的担忧已经成为过去。

如果您不喜欢这种语法，那么这里有一个漂亮的 RTTI 解决方案：

class Base {
public:
  // Shared virtual functions
  // ...

  template <typename T>
  T *instance() {return dynamic_cast<T *>(this);}
};

class Derived : public Base {
  // ...
};

// Somewhere in your code
Base *x = f();

if (x->instance<Derived>()) ;// Do something

// or
Derived *d = x->instance<Derived>();

对于使用虚函数重载的解析器 AST，RTTI 的常见替代方案是使用访问者模式，但无需维护自己的类型枚举。我的经验很快就变成了 PITA。您仍然需要维护访问者类，但这可以进行子类化和扩展。为了避免 RTTI，您最终会得到大量样板代码。

另一种选择是为您感兴趣的语法类型创建虚函数。例如 isNumeric() 在 Token 基类中返回 false，但仅在 Numeric 类中被覆盖以返回 true。如果您为虚拟函数提供默认实现，并让子类仅在需要时进行重写，那么您的大部分问题都会消失。

RTTI 不再像以前那么糟糕了。检查您正在阅读的文章的日期。也有人可能会说指针是一个非常糟糕的主意，但最终你会得到像 Java 这样的语言。

RTTI is well supported by all major C++ compilers. This includes at least GCC, Intel and MSVC. The portability concerns are really a thing of the past.

If it is the syntax you don't like then here is a nice solution to pretty up RTTI:

class Base {
public:
  // Shared virtual functions
  // ...

  template <typename T>
  T *instance() {return dynamic_cast<T *>(this);}
};

class Derived : public Base {
  // ...
};

// Somewhere in your code
Base *x = f();

if (x->instance<Derived>()) ;// Do something

// or
Derived *d = x->instance<Derived>();

A common alternative to RTTI for a parser AST using virtual function overloading, without maintaining your own type enumeration, is to use the visitor pattern but in my experience that quickly becomes a PITA. You still have to maintain the visitor class but this can be sub-classed and extended. You will end up with a lot of boilerplate code all for the sake of avoiding RTTI.

Another option is just to create virtual functions for the syntactic types you are interested in. Such as isNumeric() which returns false in the Token base class but is overridden ONLY in Numeric classes to return true. If you provide default implementations for your virtual functions and let the subclasses override only when they need to then much of your problems will disappear.

RTTI is not as bad TM as it once was. Check the dates on the articles you are reading. One could also argue that pointers are a very bad idea but then you end up with languages like Java.

回复收藏 0 原文

~没有更多了~