如何混淆 C++变量和函数
我正在尝试对抄袭进行一些算法比较。我发现了很多抄袭的文本比较。
但在算法中,情况却截然不同。假设某些算法使用大量变量、函数和用户定义的结构。如果有人从某人那里复制源代码,他至少会更改变量和函数名称。使用简单的文本比较算法,函数和变量字母中的这种差异将被视为“差异”,从而使算法对抄袭给出“假”结果。
我想做的是“概括”(我不知道这个词是否正确)C++源代码中的所有变量、函数和用户定义的结构名称。因此变量将被命名为“a”、“b”,函数“... fa(...)”、“... fb(...)”也是如此。 我有 PHP 中字符串变量中的 C++ 源算法要进行比较。
我知道应该分析许多其他内容才能进行准确的源代码比较,但这对我来说就足够了。
I'm trying to do some algorithm comparison for plagiarism. I've found many TEXT comparison for plagiarism.
But in an algorithm it's very different. Let's say that some algorithm uses an huge number of variables, functions and user defined structures. If some guy copy the source code from someone, he'll at least, change the variables and functions names. With an simple text comparison algorithm this difference in functions and variables letters will count as an "difference" making the algorithm gives an "false" for plagiarism.
What I want to do is "generalize" (I don't know if that's the right word) all the variables, functions and user-defined structures names in an C++ source code. So the varibles will be named like "a", "b", the same for functions "... fa(...)", "... fb(...)".
I have the c++ source algorithms in strings variables in PHP to be compared.
I know that many other things should be analysed for an accurate source code comparison, but that will be enough to me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个有趣的问题。然而,根据算法的复杂程度,变量名可能会泄露抄袭行为。例如,你可以用多少种方法来编写树遍历的代码?
我认为几年前有一篇论文是通过编码员的风格来识别他们的 - 查看所有小事情,例如空格、
{}
的放置位置等。谁知道,但也许就是这样去之前,寻找与学生之前的风格的负面匹配,而不是与已知来源的正面匹配。话虽如此,学生不太可能在学习的早期阶段就形成非常个人化的编码风格。一个想法 - 这些示例是用什么语言编写的?可以编译吗?如果您编译 C,然后对可执行文件进行二进制比较,那么具有不同局部变量名称的相同程序是否会具有完全相同的二进制文件? (不过,全局变量和函数不会)。
It's an interesting question. Depending on how complex the algorithm, however, it might be that variable names are what gives the plagiarism away. How many ways can you really code up a tree traversal for example?
I think there was a paper a few years ago on identifying coders through their style - looking at all the little things like whitespace, where
{}
s are placed, etc. Who knows but maybe that is the way to go, look for a negative match to the student's previous style rather than positive match to the known sources. Saying that, students aren't likely to have developed a very personal coding style at an early stage of learning.One thought - what language are the examples written in? Can it be compiled? If you compile C and then do a binary comparison on the executables, then will identical programs with different local variable names have the exact same binary? (Global vars and functions wouldn't, though).
我过去使用过 MOSS: http://theory.stanford.edu/~aiken/moss / 检测抄袭代码。由于它在语义级别上工作,因此它会检测您上面提到的情况。该工具具有语言感知能力,因此在分析中不考虑注释,并且它在检测通过简单搜索和替换变量和/或函数名称而修改的代码方面大有帮助。
注意:几年前,当我在研究生院教授计算机科学时,我使用了该工具,它在检测从互联网上抓取的代码方面效果非常好。以下是类似应用程序的详细记录: http:// fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf
如果您用谷歌搜索“测量软件相似性”,您应该找到一些更有用的点击:http:// /www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html
I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.
Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf
If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html