源代码语言分析器

发布于 2024-12-21 18:22:09 字数 195 浏览 4 评论 0原文

我想用 ruby​​ 检测编程语言

例如: (PHP)

$a = array("1","2","3");
print_r($a); 

(Ruby)

def index
end

什么gem可以做到这一点?

I want to detect programming language with ruby

For example:
(PHP)

$a = array("1","2","3");
print_r($a); 

(Ruby)

def index
end

etc.

What gem can do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

高速公鹿 2024-12-28 18:22:09

Linguist 可能会为您做这件事(GitHub 使用它来检测项目中的主要语言)。

如果您想构建自己的,那么这将是一个很好的起点。这里还有一些关于您可能还需要做些什么才能制作一个的注意事项。

文件扩展名是一个很好的作弊手段。例如:

  • .rb - 几乎总是 ruby
  • ​​ .cpp - 几乎总是 C++
  • .h - 可能是 C/C++

...等,然后逐行阅读代码。通常有一些常见的关键字,或者这些单词在代码中的位置,可以很快让您了解代码是用什么语言编写的。查看几个针对您想要支持的语言的“入门”教程网站应该给你一个关于这些事情的很好的总结,而不需要实际学习这些语言本身。您真正需要的是每种语言的一些独特的东西,您可以从中获取这些东西,从而使文件明确地成为一种语言或另一种语言。

您还可以使用贝叶斯学习过滤器(Ruby 中有一个名为 Classifier 的模块似乎可以执行此操作)训练一个更灵活的学习引擎来根据语言自行识别代码。由于编程语言是高度结构化的文本,因此您的学习软件不需要很长时间就能非常擅长识别语言。如果你想彻底疯狂,你甚至可以训练它不仅识别语言,还识别可以编译代码的语言的最低版本。例如,在 Java 中,他们在语言生命周期的特定点添加了泛型。如果您在代码中看到泛型的使用,那么您就知道源代码是为某个最低版本的 Java 等编写的。

稍微复杂一点,但不会太复杂,会出现诸如 .erb 文件。您将这些称为“嵌入式 Ruby”,还是将它们称为“Ruby”,还是计算 HTML、Ruby 和 JavaScript 的行数,然后用数量最多的语言来调用它,或者您只是用 ALL 标记文件找到的语言?我想这实际上更多的是一个设计决定。

Linguist might do that for you (it's what GitHub uses to detect the primary languages in a project).

If you're looking to build your own, that would be a good place to start. Here are a few more notes on what else you might have to do in order to make one.

File extensions are a good cheat. For example:

  • .rb - almost always ruby
  • .cpp - almost always C++
  • .h - could be C/C++

...etc., then read the code line by line. There are usually common key words, or the placement of those words within the code that will tip you off pretty quickly as to what language it's written in. A review of several "getting started" tutorial web sites for the languages that you want to support should give you a good summary of these things, without needing to actually learn the languages themselves. All you really need is a few unique things to each language that you can pick up on that makes a file definitively one language or another.

You could also use a Bayesian learning filter (there is a module called Classifier in Ruby that appears to do this) to train a more flexible learning engine to identify code by language on its own. Since programming languages are highly structured text, it shouldn't take very long for your learning software to get extremely good at identifying the language. If you wanted to go totally crazy, you could even train it to identify not only the language, but the minimum version of the language that the code can be compiled against. For example, in Java, they added generics at a particular point in the language's life cycle. If you see the use of generics in the code, then you know that the source was written for a certain minimum version of Java, etc.

A little more complex, but not much, will be questions like .erb files. Do you call those "Embedded Ruby", do you call them "Ruby", or do you count the lines of HTML vs. Ruby vs. JavaScript, and call it by the most numerous language, or do you just tag the file with ALL the found languages? I suppose that's really more of a design decision.

雨的味道风的声音 2024-12-28 18:22:09

源分类器 是一个应该适合你想做的事情的宝石。源分类器使用在“计算机语言基准游戏”生成的语料库上训练的贝叶斯分类器来识别编程语言:http://shootout.alioth.debian.org/。它是用 Ruby 编写的,并且可以作为 gem 使用。开箱即用的 SourceClassifier 可识别 C、Java、Javascript、Perl、Python 和 Ruby。使用贝叶斯分类器来识别源代码的一个很好的优点是,即使错误匹配仍然会给出一些可用的突出显示。要训​​练分类器识别新语言,请从 github 下载源代码。

Source classifier is a gem that should work for what you want to do. Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the "Computer Language Benchmarks Game":http://shootout.alioth.debian.org/. It is written in Ruby and available as a gem. Out of the box SourceClassifier recognises C, Java, Javascript, Perl, Python and Ruby. A nice advantage of using a Bayesian classifier to identify the source code is that even false matches will still give some usable highlighting. To train the classifier to identify new languages download the sources from github .

━╋う一瞬間旳綻放 2024-12-28 18:22:09

我唯一能想到的是 https://github.com/github/linguist。一颗美妙的宝石,但我不认为它正是您所需要的。

The only thing I can think about is https://github.com/github/linguist. A wonderful gem but I don't think it's exactly what you need.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文