识别(编程)语言的关键字
这是我最近的问题的后续问题( 用于识别编程的代码文本文件中的语言)。我真的很感谢我得到的所有答案,它们对我帮助很大。 我完成此任务的代码已经完成,并且运行得相当好 - 快速且相当准确。
我使用的方法如下:我有一个“学习”perl 脚本,可以识别最常用的单词在一种语言中,通过对一组示例文件制作单词直方图。然后,这些数据由 C++ 程序加载,然后检查给定的文本并根据找到的单词累积每种语言的分数,然后简单地检查哪种语言累积了最高分数。
现在我想让它变得更好,在识别质量上做一些工作。 问题是我经常得到“未知”的结果(许多语言累积的分数很小,但没有一个比我的阈值大)。经过一些调试、研究等后,我发现这可能是因为所有单词都被认为是平等的。这意味着,例如,看到“#include”与看到“while”具有相同的效果 - 两者都表明它可能是 c/c++(我现在忽略了“while”在许多其他中使用的事实)语言),但当然在较大的 .cpp 文件中可能有大量的“while”,但大多数时候只有几个“#include”。
因此,“#include”更重要的事实被忽略了,因为我无法想出一个好方法来确定一个单词是否比另一个单词更重要。现在请记住,创建数据的脚本相当愚蠢,它只是一个单词直方图,并且为每个选定的单词分配 1 分。它甚至不查看单词(因此,如果单词中有“#&|?/”经常归档它可能会被选为一个好词)。
另外,我希望数据创建部分完全自动化,因此没有人应该查看数据并更改它们、更改分数、更改单词等。所有“brainz”都应该在脚本和 cpp 程序中。
有人建议如何识别关键字,或更一般地说,重要的单词吗?一些可能有帮助的事情:我有每个单词出现的次数和总单词的数量(因此比率可以进行计算)。我还考虑过删除像 ; 等字符,因为直方图脚本经常放置例如“继续;”在结果中,但重要的词是“继续”。最后注意:所有相等性检查都是为了精确匹配而完成的 - 无子字符串,区分大小写。这主要是因为速度,但子字符串可能会有所帮助(或伤害,我不知道)...
注意:感谢所有费心回答的人,你帮了我很多。
我的工作几乎是完成了,所以我将描述我为获得良好结果所做的事情。
1) 从各种来源获得一个像样的训练集,每种语言大约 30-50 个文件,以避免编码风格偏差
2) 编写一个执行单词直方图的 Perl 脚本。实施黑名单和白名单(更多相关信息见下文)
3)将虚假单词添加到黑名单中,例如“license”,“the”等。这些通常在许可证信息中的文件开头找到。
4) 将每种语言的大约五个最重要的单词添加到白名单中。这些词在给定语言的大多数源代码中都可以找到,但频率不够高,无法进入直方图。例如,对于 C/C++,我在白名单中有:#include、#define、#ifdef、#ifndef 和 #endif。
5) 强调文件的开头,因此对前 50-100 行中找到的单词给予更多分数
6) 在做单词直方图时,使用 @words = split(/[\s\(\){}\[\];.,=]+/, $_); 标记文件我认为这对于大多数语言来说应该没问题(给我最好的结果)。对于每种语言,最终结果中约有 10-20 个最常见的单词。
7) 直方图完成后,删除黑名单中找到的所有单词并添加白名单中找到的所有单词
8) 编写一个程序,以与脚本相同的方式处理文本文件 - 使用相同的规则进行标记化。如果在直方图数据中找到单词,则将点添加到正确的语言。直方图中仅对应于一种语言的单词应添加较多的点,属于多种语言的单词应添加较少的点。
欢迎评论。目前,在大约 1000 个文本文件中,我得到了 80 个未知数(主要是在极短的文件上 - 主要是只有一两行的 javascript)。大约20个文件被识别错误。文件大小约为 11kB,范围从 100 字节到 100 kByte(总共几乎 11MB)。处理完它们只需要一秒钟,这对我来说已经足够了。
this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using @words = split(/[\s\(\){}\[\];.,=]+/, $_);
This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在回答您的其他问题时,有人推荐了朴素贝叶斯分类器。您应该实施此建议,因为该技术擅长根据区别特征进行分离。您提到了 while 关键字,但这不太可能有用,因为很多语言都使用它,而贝叶斯分类器不会将其视为有用。
您的问题的一个有趣的部分是如何标记未知程序。用空格分隔的块是一个不错的粗略的开始,但要有意义地超越它将会很棘手。
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the
while
keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.
您需要在查找数据中获得一些排他性。
在教授您期望的编程语言时,您应该搜索一种或几种语言的典型单词。如果某个单词出现在同一语言的多个代码文件中,但很少或没有出现在其他语言文件中,则这是对该语言的强烈建议。
因此,可以在查找端通过选择一种语言或一组语言独有的单词来计算单词的分数。找到其中几个单词,并通过添加分数来获得这些单词的交集,并找到您将拥有的语言。
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
使用 Google 代码搜索来了解关键字集的权重:#include 在 C++ 中获得 672.000 次点击,在 Python 中只有约 5000 次。
您可以通过查看该语言的结果总数来标准化结果:
C++ 提供大约 770.000 个文件,而 Python 返回 120.000 个文件。
因此,“#include”在 Python 文件中极为罕见,但几乎在每个 C++ 文件中都存在。 (当然,现在你仍然必须学会区分 C++ 和 C。)剩下的就是对概率进行正确的推理。
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
我认为你从错误的角度来看待这个问题。根据您的描述,听起来您正在构建一个分类器。一个好的分类器需要区分不同的类;它不需要精确估计输入和最可能的类别之间的对应关系。
实际上:您的分类器不需要精确评估某个输入与 C++ 的接近程度;它只需要确定输入是否更像 C 而不是 C++。这使您的工作变得更加轻松 - 您当前的大多数“未知”案例将接近一两种语言,即使它们没有超出您的基本阈值。
现在,一旦您意识到这一点,您还将看到您的分类器需要什么训练:不是示例文件的某些随机方面,而是使两种语言不同的因素。因此,当您解析 C 示例和 C++ 示例时,您将看到
#include
并未将它们分开。然而,class
和template
在 C++ 中将更加常见。另一方面,#include 确实区分了 C++ 和 Java。当然,除了关键字之外,您还可以使用其他方面。例如,最明显的是
{
的频率,而;
也有类似的区别。对于分类器来说另一个非常有用的功能是不同语言的注释标记。当然,基本问题是自动识别它们。再次硬编码//
、/*
、'
、--
、#
和!
作为伪关键字会有所帮助。这也确定了另一个分类规则:SQL 通常在行的开头有
--
,而在 C 中它通常会出现在其他地方。因此,分类器也考虑上下文可能很有用。I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that
#include
does not set them apart. However,class
andtemplate
will be far more common in C++. On the other hand,#include
does distinguish between C++ and Java.There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of
{
, and;
is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding//
,/*
,'
,--
,#
and!
as pseudo-keywords would help.This also identifies another classification rule: SQL will often have
--
at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.