One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.
If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.
If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.
I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.
It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.
正如 dmckee 所建议的,您可能想看看 Unix file 程序,其 来源可用。该实用程序使用的启发式方法可能是一个很好的灵感来源。由于它是用 C 编写的,我猜它符合 C++ 的要求。 :) 不过,您无法直接获得置信百分比;也许它们在内部使用?
As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?
Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.
Since the list of languages is known upfront you know the syntax/grammar for each of them. Hence you can, as an example, to write a function to extract reserved words from the provided source code.
Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.
If in the end you only have 1 possibility left - this is your language. If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.
You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.
By constructing this search tree efficiently you could end with a rather fast code.
The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.
发布评论
评论(10)
您遇到文档分类问题。我建议您阅读朴素贝叶斯分类器和支持向量机。文章中提供了实现这些算法的库的链接,其中许多都有 C++ 接口。
You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.
我能想到的一个简单的解决方案是,您可以识别不同语言中使用的关键字。每个识别出的单词都会有 +1 分。然后计算比率=identified_words/total_words。得分最高的语言就是获胜者。当然,存在评论使用等问题,但我认为这是一个非常简单的解决方案,在大多数情况下应该有效。
One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.
如果您知道源文件将符合标准,那么文件扩展名对于几乎每种语言都是唯一的。我假设您已经考虑过这一点,并根据其他一些信息排除了这一点。
如果您无法使用文件扩展名,最好的方法是找到最不同的语言之间的内容,并使用它们来确定文件类型。例如,for 循环语句语法在不同语言之间不会有太大差异,但包包含语句应该有很大差异。如果您有一个包含 java.util.* 的文件,那么您就知道它是一个 java 文件。
If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.
If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.
抱歉,如果您必须解析数千个文件,那么最好的选择是查看文件扩展名。不要过度设计一个简单的问题,或者对一个简单的任务提出繁重的要求。
听起来你有数千个源代码文件,但你不知道它们是用什么编程语言编写的。你在什么样的编程环境中工作? (排除人工作业要求的可能性)我的意思是我始终可以依赖的软件工程基础知识之一是 c++ 代码文件具有 .cpp 扩展名,java 代码文件具有 .java 扩展名,c 代码文件有 .c 扩展名等...您的公司是否对这些标准反复无常?如果是这样的话我真的会很担心。
I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.
It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.
正如 dmckee 所建议的,您可能想看看 Unix
file
程序,其 来源可用。该实用程序使用的启发式方法可能是一个很好的灵感来源。由于它是用 C 编写的,我猜它符合 C++ 的要求。 :) 不过,您无法直接获得置信百分比;也许它们在内部使用?As dmckee suggested, you might want to have a look at the Unix
file
program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?看看nedit。它有一个语法突出显示识别系统,位于语法突出显示->识别模式下。您可以此处浏览示例识别模式,或下载该程序并查看标准那些。
以下是突出显示系统的说明。
Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.
Here's a description of the highlighting system.
由于语言列表是预先已知的,因此您知道每种语言的语法/语法。
因此,作为示例,您可以编写一个函数来从提供的源代码中提取保留字。
构建一个二叉树,其中包含您支持的所有语言的所有保留字。然后用上一步中提取的保留字遍历该树。
如果最后你只剩下一种可能性 - 这就是你的语言。
如果您过早到达程序末尾 - 那么(从您停止的地方开始) - 您可以分析您在树上的位置,以找出哪些语言仍然是可能的。
Since the list of languages is known upfront you know the syntax/grammar for each of them.
Hence you can, as an example, to write a function to extract reserved words from the provided source code.
Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.
If in the end you only have 1 possibility left - this is your language.
If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.
您也许可以尝试考虑语言差异并使用二叉树对其进行建模,例如“是否找到功能 X?”如果是,则朝一个方向前进,如果没有,则朝另一个方向前进。
通过有效地构建此搜索树,您可以得到相当快的代码。
You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.
By constructing this search tree efficiently you could end with a rather fast code.
这个速度不快,可能无法满足您的要求,但这只是一个想法。它应该易于实施并且应该给出 100% 的结果。
您可以尝试使用不同的编译器/解释器(开源或免费)编译/执行输入文本,并在幕后检查错误。
This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.
You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.
Sequitur 算法从终结符序列推断上下文无关语法。也许您可以使用它来与每种语言的一组已知产生规则进行比较。
The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.