解析 C++ 的复杂性

发布于 2024-10-02 10:16:00 字数 248 浏览 8 评论 0原文

出于好奇，我想知道解析 C++ 的一些“理论”结果是什么。

令 n 为我的项目的大小（例如，在 LOC 中，但由于我们将处理 big-O，所以它不是很重要）

C++ 是在 O(n) 中解析的吗？如果不是，复杂程度如何？
C（或 Java 或任何语法意义上更简单的语言）的解析时间复杂度为 O(n) 吗？
C++1x 是否会引入更难解析的新功能？

参考文献将不胜感激！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

耳钉梦 2024-10-09 10:16:00

我认为出于问题的目的，不同的人以不同的方式解释“解析”一词。

从狭义的技术意义上来说，解析只是验证源代码是否与语法匹配（或者甚至构建一棵树）。

有一个相当普遍的民间定理，它说你根本无法解析 C++（在这个意义上），因为你必须解析某些要解析的符号的含义。这个民间定理完全是错误的。

它源于使用“弱”（LALR 或回溯递归下降）解析器，如果解析器对文本的局部模糊部分做出了几个可能的子解析的错误选择（这个SO线程讨论了一个示例），有时会因为做出这样的选择而彻底失败。使用此类解析器解决困境的方法是在解析发生时收集符号表数据，并将额外的检查混合到解析过程中，以迫使解析器在遇到此类选择时做出正确的选择。这样做的代价是名称和类型解析与解析显着混乱，这使得构建这样的解析器非常困难。但是，至少对于旧版 GCC，他们使用了 LALR，这是解析时的线性时间，并且我认为如果包含解析器所做的名称/类型捕获，那么成本不会更高（解析后还有更多工作要做，因为我不这样做）我不认为他们会做这一切）。

至少有两种使用“纯”GLR 解析技术完成的 C++ 解析器实现，这只是承认解析可能在局部不明确，并且捕获多个解析而无需注释或显着开销。 GLR 解析器是线性时间，没有局部歧义。它们在歧义区域更昂贵，但实际上，标准 C++ 程序中的大部分源文本都属于“线性时间”部分。因此，即使捕获了歧义，有效率也是线性的。两个实现的解析器在解析后都会进行名称和类型解析，并使用不一致来消除不正确的模糊解析。
（这两个实现是 Elsa 和我们的（SD 的）C++ 前端。我无法评价 Elsa 目前的能力（我不认为它已经在年），但我们的可以执行所有 C++11 [2015 年 1 月编辑：现在完整的 C++14 2017 年 10 月编辑：现在完整的 C++17]，包括 GCC 和 Microsoft 变体）。

如果您采用严格的计算机科学定义，即语言被外延地定义为任意字符串集（语法应该是对其进行内涵编码的简洁方法）并将解析视为“检查程序的语法是否正确”，那么对于C++ 您已展开模板以验证每个模板是否确实完全展开。模板中隐藏着一个图灵机，因此理论上检查 C++ 程序是否有效是不可能的（没有时间限制）。真正的编译器（遵守标准）对它们将执行的模板展开量设置了固定的约束，实际内存也是如此，因此实际上 C++ 编译器会完成。但从这个意义上来说，他们可能需要任意长的时间来“解析”一个程序。我认为这是大多数人关心的答案。

实际上，我猜大多数模板实际上都非常简单，因此 C++ 编译器平均可以与其他编译器一样快地完成。只有那些疯狂到用模板编写图灵机的人才会付出严重的代价。（观点：价格实际上是将复杂的东西硬塞到模板上的概念成本，而不是编译器执行成本。）

I think the term "parsing" is being interpreted by different people in different ways for the purposes of the question.

In a narrow technical sense, parsing is merely verifying the the source code matches the grammar (or perhaps even building a tree).

There's a rather widespread folk theorem that says you cannot parse C++ (in this sense) at all because you must resolve the meaning of certain symbols to parse. That folk theorem is simply wrong.

It arises from the use of "weak" (LALR or backtracking recursive descent) parsers, which, if they commit to the wrong choice of several possible subparse of a locally ambiguous part of text (this SO thread discusses an example), fail completely by virtue of sometimes making that choice. The way those that use such parser resolve the dilemma is collect symbol table data as parsing occurs and mash extra checking into the parsing process to force the parser to make the right choice when such choice is encountered. This works at the cost of significantly tangling name and type resolution with parsing, which makes building such parsers really hard. But, at least for legacy GCC, they used LALR which is linear time on parsing and I don't think that much more expensive if you include the name/type capture that the parser does (there's more to do after parsing because I don't think they do it all).

There are at least two implementations of C++ parsers done using "pure" GLR parsing technology, which simply admits that the parse may be locally ambiguous and captures the multiple parses without comment or significant overhead. GLR parsers are linear time where there are no local ambiguities. They are more expensive in the ambiguity region, but as a practical matter, most the of source text in a standard C++ program falls into the "linear time" part. So the effective rate is linear, even capturing the ambiguities. Both of the implemented parsers do name and type resolution after parsing and use inconsistencies to eliminate the incorrect ambiguous parses.
(The two implementations are Elsa and our (SD's) C++ Front End. I can't speak for Elsa's current capability (I don't think it has been updated in years), but ours does all of C++11 [EDIT Jan 2015: now full C++14 EDIT Oct 2017: now full C++17] including GCC and Microsoft variants).

If you take the hard computer science definition that a language is extensionally defined as an arbitrary set of strings (Grammars are supposed to be succinct ways to encode that intensionally) and treating parsing as "check the the syntax of the program is correct" then for C++ you have expand the templates to verify that each actually expands completely. There's a Turing machine hiding in the templates, so in theory checking that a C++ program is valid is impossible (no time limits). Real compilers (honoring the standard) place fixed constraints on how much template unfolding they'll do, and so does real memory, so in practice C++ compilers finish. But they can take arbitrarily long to "parse" a program in this sense. And I think that's the answer most people care about.

As a practical matter, I'd guess most templates are actually pretty simple, so C++ compilers can finish as fast as other compilers on average. Only people crazy enough to write Turing machines in templates pay a serious price. (Opinion: the price is really the conceptual cost of shoehorning complicated things onto templates, not the compiler execution cost.)

回复收藏 0 原文

人间不值得 2024-10-09 10:16:00

取决于您所说的“解析”的含义，但是如果您的解析应该包括模板实例化，那么一般情况下不会：

[如果您想避免阅读示例，则使用快捷方式 - 模板提供了足够丰富的计算模型，通常可以实例化它们，一个暂停式问题]

template<int N>
struct Triangle {
    static const int value = Triangle<N-1>::value + N;
};

template<>
struct Triangle<0> {
    static const int value = 0;
};

int main() {
    return Triangle<127>::value;
}

显然，在这种情况下，编译器理论上可以发现三角形数字有一个简单的生成器函数，并使用它计算返回值。但除此之外，实例化 Triangle 将花费 O(k) 时间，并且显然 k 会随着程序的大小而快速增加，只要int 类型的限制。

[快捷方式结束]

现在，为了知道 Triangle<127>::value 是对象还是类型，编译器实际上必须实例化 Triangle<127> （同样，在这种情况下，它可能会采取捷径，因为 value 在每个模板专业化中都被定义为对象，但一般情况下不是）。符号是否表示对象或类型与 C++ 语法相关，因此我可能认为“解析”C++ 确实需要模板实例化。不过，其他定义可能会有所不同。

实际实现任意限制了模板实例化的深度，使得大O分析变得无关紧要，但我忽略了这一点，因为在任何情况下实际实现都有自然资源限制，也使得大O分析变得无关紧要......

我希望你可以产生类似的-尽管我不确定您是否打算将预处理器作为解析步骤的一部分包含在内，但使用递归 #include 编写 C 语言的困难程序。

除此之外，C 与许多其他语言一样，可以进行 O（不是很多）解析。您可能需要符号查找等，正如 David 在他的回答中所说，一般不能有严格的 O(1) 最坏情况界限，所以 O(n) 可能有点乐观。即使是汇编程序也可能会查找标签符号。但例如动态语言甚至不一定需要符号查找来进行解析，因为这可能在运行时完成。如果您选择一种语言，解析器需要做的就是确定哪些符号是关键字，并进行某种括号匹配，那么 Shunting Yard 算法的复杂度为 O(n)，所以还有希望。

Depends what you mean by "parsed", but if your parsing is supposed to include template instantiation, then not in general:

[Shortcut if you want to avoid reading the example - templates provide a rich enough computational model that instantiating them is, in general, a halting-style problem]

template<int N>
struct Triangle {
    static const int value = Triangle<N-1>::value + N;
};

template<>
struct Triangle<0> {
    static const int value = 0;
};

int main() {
    return Triangle<127>::value;
}

Obviously, in this case the compiler could theoretically spot that triangle numbers have a simple generator function, and calculate the return value using that. But otherwise, instantiating Triangle<k> is going to take O(k) time, and clearly k can go up pretty quickly with the size of this program, as far as the limit of the int type.

[End of shortcut]

Now, in order to know whether Triangle<127>::value is an object or a type, the compiler in effect must instantiate Triangle<127> (again, maybe in this case it could take a shortcut since value is defined as an object in every template specialization, but not in general). Whether a symbol represents an object or a type is relevant to the grammar of C++, so I would probably argue that "parsing" C++ does require template instantiation. Other definitions might vary, though.

Actual implementations arbitrarily cap the depth of template instantiation, making a big-O analysis irrelevant, but I ignore that since in any case actual implementations have natural resource limits, also making big-O analysis irrelevant...

I expect you can produce similarly-difficult programs in C with recursive #include, although I'm not sure whether you intend to include the preprocessor as part of the parsing step.

Aside from that, C, in common with plenty of other languages, can have O(not very much) parsing. You may need symbol lookup and so on, which as David says in his answer cannot in general have a strict O(1) worst case bound, so O(n) might be a bit optimistic. Even an assembler might look up symbols for labels. But for example dynamic languages don't necessarily even need symbol lookup for parsing, since that might be done at runtime. If you pick a language where all the parser needs to do is establish which symbols are keywords, and do some kind of bracket-matching, then the Shunting Yard algorithm is O(n), so there's hope.

回复收藏 0 原文