如何解析实际代码，如 stackoverflow/intellisense/等？

发布于 2024-09-15 08:45:31 字数 428 浏览 10 评论 0原文

我想知道 stackoverflow 如何解析各种不同的代码并识别关键字、特殊字符、空格格式等。我相信它对大多数代码都是这样做的，而且我注意到它甚至足够复杂，可以理解它解析的所有内容之间的关系，像这样：

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

许多 IDE 也这样做。这是怎么做到的？

编辑：进一步解释 - 我不是在问文本的解析，我的问题是，一旦我过了这一部分..是否有类似通用 XML 方案或跨代码格式层次结构之类的东西来描述哪些字符串是关键字，哪些字符表示注释、文本字符串、逻辑运算符等。或者我必须成为我希望准确解析的任何语言的语法大师吗？

原文

I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I've noticed it's even sophisticated enough to understand the relationships between everything it parses, like so:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

Many IDEs do this also. How is this done?

Edit: Further explaination - I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反目相谮 2024-09-22 08:45:31

要真正让您的 IDE/编译器/解释器“理解”代码并对代码进行着色，您需要解析它并提取不同的语法部分。这方面的经典参考是Dragon Book，“编译器：原理、技术和工具。”您可以看到这样的构造中的一些困难

i+++++i;

或

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens

正确执行此操作是很难问题。有些语言（例如 java）比其他语言（例如 C 和 C++（都有标准）或 ruby（甚至没有规范，并且依赖于规范的实现））使此操作更容易。然而，如果您只想做一些突出显示，您可以跳过大部分语法并更轻松地获得 80% 的解决方案。我怀疑 SO 引擎了解字符串和一些不同类型的注释，这足以满足其目的。

80% 到 100% 之间的难度是大多数 IDE 都具有 C++ 语法高亮功能但 Visual C++ 仍然不支持 C++ 重构的原因之一。突出显示一些错误可能是可以的。当您重构时，您需要真正理解不同命名空间中的变量范围以及各种指针内容。

To really have your IDE/compiler/interpreter "understand" and colorize code you'll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, "Compilers: Principles, Techniques, and Tools." You can see some of the difficulty in constructs like this

i+++++i;

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn't even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn't have C++ refactoring support. For highlighting a few mistakes are probably OK. When you're refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

回复收藏 0 原文