如何解析实际代码,如 stackoverflow/intellisense/等?

发布于 2024-09-15 08:45:31 字数 428 浏览 10 评论 0原文

我想知道 stackoverflow 如何解析各种不同的代码并识别关键字、特殊字符、空格格式等。我相信它对大多数代码都是这样做的,而且我注意到它甚至足够复杂,可以理解它解析的所有内容之间的关系,像这样:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

许多 IDE 也这样做。这是怎么做到的?

编辑:进一步解释 - 我不是在问文本的解析,我的问题是,一旦我过了这一部分..是否有类似通用 XML 方案或跨代码格式层次结构之类的东西来描述哪些字符串是关键字,哪些字符表示注释、文本字符串、逻辑运算符等。或者我必须成为我希望准确解析的任何语言的语法大师吗?

I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I've noticed it's even sophisticated enough to understand the relationships between everything it parses, like so:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

Many IDEs do this also. How is this done?

Edit: Further explaination - I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

反目相谮 2024-09-22 08:45:31

要真正让您的 IDE/编译器/解释器“理解”代码并对代码进行着色,您需要解析它并提取不同的语法部分。这方面的经典参考是Dragon Book,“编译器:原理、技术和工具。”您可以看到这样的构造中的一些困难

i+++++i; 

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens 

正确执行此操作是 很难 问题。有些语言(例如 java)比其他语言(例如 C 和 C++(都有标准)或 ruby​​(甚至没有规范,并且依赖于规范的实现))使此操作更容易。然而,如果您只想做一些突出显示,您可以跳过大部分语法并更轻松地获得 80% 的解决方案。我怀疑 SO 引擎了解字符串和一些不同类型的注释,这足以满足其目的。

80% 到 100% 之间的难度是大多数 IDE 都具有 C++ 语法高亮功能但 Visual C++ 仍然不支持 C++ 重构的原因之一。突出显示一些错误可能是可以的。当您重构时,您需要真正理解不同命名空间中的变量范围以及各种指针内容。

To really have your IDE/compiler/interpreter "understand" and colorize code you'll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, "Compilers: Principles, Techniques, and Tools." You can see some of the difficulty in constructs like this

i+++++i; 

or

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens 

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn't even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn't have C++ refactoring support. For highlighting a few mistakes are probably OK. When you're refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

手心的海 2024-09-22 08:45:31

为了正确突出显示一种语言,您必须构建一个解析树。这需要首先对字符串进行标记,然后执行自上而下或自下而上的解析。然后,某些东西会遍历树并突出显示原始字符串中与某种类型的节点相对应的部分。

要真正理解这一点,您必须阅读一本有关编译器设计/编程语言基础知识的书。相关主题是分词器解析语法

In order to correctly highlight a language, you have to build a parse tree. This requires first tokenizing the string, and then either performing a top-down or a bottom-up parse. Afterwards, something walks the tree and highlights the portions of the original string corresponding to nodes of a certain sort.

To really understand this, you're going to have to read a book on compiler design/programming language fundamentals. The relevant topics are tokenizers, parsing, and grammars.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文