如何解析实际代码,如 stackoverflow/intellisense/等?
我想知道 stackoverflow 如何解析各种不同的代码并识别关键字、特殊字符、空格格式等。我相信它对大多数代码都是这样做的,而且我注意到它甚至足够复杂,可以理解它解析的所有内容之间的关系,像这样:
String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";
许多 IDE 也这样做。这是怎么做到的?
编辑:进一步解释 - 我不是在问文本的解析,我的问题是,一旦我过了这一部分..是否有类似通用 XML 方案或跨代码格式层次结构之类的东西来描述哪些字符串是关键字,哪些字符表示注释、文本字符串、逻辑运算符等。或者我必须成为我希望准确解析的任何语言的语法大师吗?
I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I've noticed it's even sophisticated enough to understand the relationships between everything it parses, like so:
String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";
Many IDEs do this also. How is this done?
Edit: Further explaination - I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要真正让您的 IDE/编译器/解释器“理解”代码并对代码进行着色,您需要解析它并提取不同的语法部分。这方面的经典参考是Dragon Book,“编译器:原理、技术和工具。”您可以看到这样的构造中的一些困难
或
正确执行此操作是 很难 问题。有些语言(例如 java)比其他语言(例如 C 和 C++(都有标准)或 ruby(甚至没有规范,并且依赖于规范的实现))使此操作更容易。然而,如果您只想做一些突出显示,您可以跳过大部分语法并更轻松地获得 80% 的解决方案。我怀疑 SO 引擎了解字符串和一些不同类型的注释,这足以满足其目的。
80% 到 100% 之间的难度是大多数 IDE 都具有 C++ 语法高亮功能但 Visual C++ 仍然不支持 C++ 重构的原因之一。突出显示一些错误可能是可以的。当您重构时,您需要真正理解不同命名空间中的变量范围以及各种指针内容。
To really have your IDE/compiler/interpreter "understand" and colorize code you'll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, "Compilers: Principles, Techniques, and Tools." You can see some of the difficulty in constructs like this
or
Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn't even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.
The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn't have C++ refactoring support. For highlighting a few mistakes are probably OK. When you're refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.
为了正确突出显示一种语言,您必须构建一个解析树。这需要首先对字符串进行标记,然后执行自上而下或自下而上的解析。然后,某些东西会遍历树并突出显示原始字符串中与某种类型的节点相对应的部分。
要真正理解这一点,您必须阅读一本有关编译器设计/编程语言基础知识的书。相关主题是分词器、解析和语法。
In order to correctly highlight a language, you have to build a parse tree. This requires first tokenizing the string, and then either performing a top-down or a bottom-up parse. Afterwards, something walks the tree and highlights the portions of the original string corresponding to nodes of a certain sort.
To really understand this, you're going to have to read a book on compiler design/programming language fundamentals. The relevant topics are tokenizers, parsing, and grammars.