编程语言解析器的来源?
我正在清理我的一个旧项目,该项目计算有关大型软件项目的许多简单指标。指标之一是文件/类/方法的长度。目前,我的代码“猜测”类/方法边界的位置基于非常粗略的算法(遍历文件,维护“当前深度”,并在遇到未加引号的括号时调整它;当您返回到类或方法开始的级别时,认为它已退出)。然而,此过程存在许多问题,并且检测深度何时发生变化的“简单”方法并不总是有效。
为了使其给出准确的结果,我需要使用规范的方式(在每种语言中)来检测函数定义、类定义和深度变化。这相当于编写一个简单的解析器来生成至少包含我希望项目适用的每种语言的这些元素的解析树。
显然,以前已经为所有这些语言编写了解析器,因此似乎我不必重复这项工作(尽管编写解析器很有趣)。 是否有一些开源项目可以收集多种源语言的现成解析器库?或者我应该使用 ANTLR 从头开始制作自己的资源?(注意:我很高兴将该项目移植到另一种语言,以利用现有的优秀资源,所以如果您知道其中一个,它用什么语言编写并不重要。)
I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您想要语言精确的解析,特别是面对宏和预处理器条件等语言复杂性,您需要完整的语言解析器。这些实际上需要大量的工作来构建,并且大多数语言都不能很好地适应周围各种类型的解析器生成器。大多数语言解析器的作者对其他语言也不感兴趣。他们倾向于选择一些解析器生成器,这些生成器在开始时显然不是一个巨大的障碍,为他们想要的特定目的实现解析器,然后继续前进。
结果:很少有语言定义库是使用单一形式主义或共享基础定义的。 ANTLR 人群维护着较大的集合之一,恕我直言,尽管据我所知,大多数解析器都没有完全的生产能力。总是有 Bison,它已经存在了足够长的时间,所以你会期望在某个地方收集一个语言定义库,但我从未见过它。
在过去的 15 年里,我一直在定义程序分析和转换的基础机制,并构建了另一个这样的库,称为 DMS 软件再造工具包。它具有适用于 C、C++、C#、Java、COBOL(IBM 企业版)、JCL、PHP、Python 等的生产质量解析器。您的意见当然可能与我的不同,但这些解析器每天与 DMS 一起使用来执行大规模更改任务在大量代码上。
我不知道还有其他任何语言定义集是成熟的并且构建在单一基础上的......IBM 的编译器可能就是这样的一个集,但 IBM 不提供机器或语言定义。
如果您想要做的只是计算简单的指标,那么您可能可以只使用词法分析器和临时嵌套计数(如您所描述的)。即使在大多数情况下使其正常工作也比看起来更困难(查看 Python、Perl 和 PHP 的疯狂字符串语法)。总而言之,即使只是定义一个准确的词法分析器,C 语言的工作量也令人惊讶:我们有几千行复杂的正则表达式来涵盖您在 Microsoft 和/或 GNU C 中找到的所有奇怪的词素。
因为 DMS 具有为许多语言提供了一致定义的、成熟的解析器,因此 DMS 也为相同的语言提供了一致定义的、成熟的词法分析器。我们实际上构建了一个源代码搜索引擎 (SCSE),它可以在大量代码中提供快速搜索在多种语言中,它的工作原理是对遇到的语言进行词法分析并对这些词素进行索引以进行快速查找。 SCSE 恰好也计算您正在讨论的指标类型,因为它索引代码库,几乎按照您描述的方式,只是它有这些语言准确的词法分析器可供使用。
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
如果您正在解析 C++,您可能会对 gcc-xml 感兴趣。 Java CUP 具有 Java 语言的语法。
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.