Git 能否检测两个源文件本质上是否是彼此的副本?

发布于 2024-12-28 07:48:46 字数 362 浏览 0 评论 0原文

抱歉,如果这是题外话,但这是您减少本网站上“家庭作业”问题数量的机会:-)

我正在教授一门 C 编程课程,学生们在 C 语言的一个小型数字例程库中工作今年,几组学生的源文件中存在大量重复代码。

(直到拼写错误的 printf 调试语句。我的意思是,你能有多蠢。)

我知道 Git 可以检测两个源文件何时彼此相似超过某个阈值,但我从来没有管理过使其能够处理不在 Git 存储库中的两个源文件。

请记住,这些学生并不是特别有经验的学生。他们不太可能会费心更改变量/函数名称。

有没有一种方法可以使用 Git 来检测显着的文字代码重复(即抄袭)?或者您可以推荐其他一些工具

Sorry if this is off-topic, but here is your chance to reduce the amount of "homework" questions on this site :-)

I'm teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.

(Down to identically misspelled printf debug statements. I mean, how dumb can you be.)

I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.

Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.

Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

爱殇璃 2025-01-04 07:48:46

为什么要使用 git?一种简单但有效的技术是比较所有不同提交之间的差异大小,然后手动检查和比较差异最小的那些。

Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.

痴情换悲伤 2025-01-04 07:48:46

Moss 是由斯坦福大学计算机科学教授开发的工具。我认为他们也在那里使用它。这就像源代码的 diff 一样。

Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It's like diff for source code.

回眸一遍 2025-01-04 07:48:46

添加到其他答案中,您可以使用 diff ——但我认为答案本身没有那么有用。您想要的是匹配的行数,减去非空白行的数量,并且要自动获得该值,您需要使用 wc -lgrep 施展一些魔法 计算文件长度的总和,减去 diff 文件的长度,再减去 diff 包含作为匹配的空行数。即使这样,您也会错过一些情况,其中 diff 认为相同的行不匹配,因为它们之前插入了不同的内容。

更好的选择是 https://stackoverflow.com 中列出的建议之一/questions/5294447/how-can-i-find-source-code-copying (或在 https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code,尽管答案似乎重复)。

Adding to the other answers, you could use diff -- but I don't think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic with wc -l and grep to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines that diff included as matching. And even then you'll miss some cases where diff decided that identical lines didn't match because of different things inserted before them.

A much better option is one of the suggestions listed in https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or in https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code, though the answers seem to duplicate).

缺⑴份安定 2025-01-04 07:48:46

您可以使用 diff 并检查两个文件是否看起来相似:

diff -iEZbwB -U 0 file1.cpp file2.cpp

这些选项告诉 diff 忽略空格更改并创建 git< /code> 类似 diff 文件。在两个样本上尝试一下。

You could use diff and check whether the two files seem similar:

diff -iEZbwB -U 0 file1.cpp file2.cpp

Those options tell diff to ignore whitespace changes and make a git-like diff file. Try it out on two samples.

绳情 2025-01-04 07:48:46

使用 diff 绝对不是一个好主意,除非你想冒险进入组合地狱的领域:

  • 如果你有 2 个提交,你必须执行 1 个 diff 来检查抄袭,
  • 如果你有 3 个提交,你必须执行 2 diff 来检查抄袭,
  • 如果你有 4 个提交,你必须执行 6 diff 来检查是否抄袭,
  • ...
  • 如果你有 n 个提交,你必须执行(n-1)! 差异!

另一方面, Moss 已在其他答案中建议,使用 完全不同的算法。基本上,它计算每个文档的重要 k-gram 的一组指纹。指纹实际上是用于对文档进行分类的哈希值,当两个文档最终被排序在同一个存储桶中时,就会检测到可能的抄袭。

Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:

  • If you have 2 submissions, you have to perform 1 diff to check for plagiarism,
  • If you have 3 submissions, you have to perform 2 diff to check for plagiarism,
  • If you have 4 submissions, you have to perform 6 diff to check for plagiarism,
  • ...
  • If you have n submissions, you have to perform (n-1)! diff !

On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文