Git 能否检测两个源文件本质上是否是彼此的副本?
抱歉,如果这是题外话,但这是您减少本网站上“家庭作业”问题数量的机会:-)
我正在教授一门 C 编程课程,学生们在 C 语言的一个小型数字例程库中工作今年,几组学生的源文件中存在大量重复代码。
(直到拼写错误的 printf
调试语句。我的意思是,你能有多蠢。)
我知道 Git 可以检测两个源文件何时彼此相似超过某个阈值,但我从来没有管理过使其能够处理不在 Git 存储库中的两个源文件。
请记住,这些学生并不是特别有经验的学生。他们不太可能会费心更改变量/函数名称。
有没有一种方法可以使用 Git 来检测显着的文字代码重复(即抄袭)?或者您可以推荐其他一些工具
Sorry if this is off-topic, but here is your chance to reduce the amount of "homework" questions on this site :-)
I'm teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.
(Down to identically misspelled printf
debug statements. I mean, how dumb can you be.)
I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.
Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.
Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
为什么要使用 git?一种简单但有效的技术是比较所有不同提交之间的差异大小,然后手动检查和比较差异最小的那些。
Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.
Moss 是由斯坦福大学计算机科学教授开发的工具。我认为他们也在那里使用它。这就像源代码的 diff 一样。
Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It's like diff for source code.
添加到其他答案中,您可以使用
diff
——但我认为答案本身没有那么有用。您想要的是匹配的行数,减去非空白行的数量,并且要自动获得该值,您需要使用wc -l
和grep 施展一些魔法
计算文件长度的总和,减去 diff 文件的长度,再减去diff
包含作为匹配的空行数。即使这样,您也会错过一些情况,其中diff
认为相同的行不匹配,因为它们之前插入了不同的内容。更好的选择是 https://stackoverflow.com 中列出的建议之一/questions/5294447/how-can-i-find-source-code-copying (或在 https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code,尽管答案似乎重复)。
Adding to the other answers, you could use
diff
-- but I don't think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic withwc -l
andgrep
to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines thatdiff
included as matching. And even then you'll miss some cases wherediff
decided that identical lines didn't match because of different things inserted before them.A much better option is one of the suggestions listed in https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or in https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code, though the answers seem to duplicate).
您可以使用
diff
并检查两个文件是否看起来相似:这些选项告诉
diff
忽略空格更改并创建git< /code> 类似
diff
文件。在两个样本上尝试一下。You could use
diff
and check whether the two files seem similar:Those options tell
diff
to ignore whitespace changes and make agit
-likediff
file. Try it out on two samples.使用 diff 绝对不是一个好主意,除非你想冒险进入组合地狱的领域:
(n-1)!
差异!另一方面, Moss 已在其他答案中建议,使用 完全不同的算法。基本上,它计算每个文档的重要 k-gram 的一组指纹。指纹实际上是用于对文档进行分类的哈希值,当两个文档最终被排序在同一个存储桶中时,就会检测到可能的抄袭。
Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:
(n-1)!
diff !On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.