搜索复制的文件,同时忽略文件名

发布于 2024-12-21 23:40:08 字数 218 浏览 0 评论 0原文

有时我的学生尝试提交相同的作业文件。如果他们自己做足功课,不可能有任何两个文件完全相同。

我把作业放在这样的文件夹中: /section/id/

这样,课程的每个部分都有自己的文件夹,每个学生都有自己的文件夹,所有文件都在里面最后一个级别。学生文件有多种格式。

  • 如何检查任何子文件夹中是否存在完全相同的文件(忽略文件名)?

Sometimes my students try to submit identical files for their homework. If they did their homework themselves, it would be impossible for any two files to be the exactly the same.

I put the homework in folders arranged like this: /section/id/

In this way, each section of the course has its own folder, each student has their own folder, and all of the files are within that last level. The student files come in a variety of formats.

  • How can I check if there are any exactly identical files (ignoring file names) within any sub-folder?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

和影子一齐双人舞 2024-12-28 23:40:08

这可以帮助您使用以下 for 循环awk 单行代码识别学生提供的完全相同的文件:

步骤:1 - 对于路径/到/文件中的 i ;执行 cksum "$i";完成> cksum.txt
步骤:2 - awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt

测试:

一些示例文件,其中学生 2 使用了与学生 1< 相同的文件/code>

[jaypal:~/Temp/homework] ls -lrt
total 32
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student1
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student2
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student3
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student4
[jaypal:~/Temp/homework] cat student1 
homework1
[jaypal:~/Temp/homework] cat student2 
homework1
[jaypal:~/Temp/homework] cat student3 
homework3
[jaypal:~/Temp/homework] cat student4 
homework4

第 1 步:

使用 cksum 实用程序创建 cksum.txt 文件

[jaypal:~/Temp/homework] for i in *; do cksum "$i"; done > cksum.txt
[jaypal:~/Temp/homework] cat cksum.txt 
4294967295 0 cksum.txt
1271506813 10 student1
1271506813 10 student2
1215889011 10 student3
1299429862 10 student4

第 2 步:

使用 awk one -liner 识别所有文件相同

[jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 
1271506813 10 student1
1271506813 10 student2 

测试2:

[jaypal:~/Temp/homework] for i in stu*; do cksum "$i"; done > cksum.txt
[jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 
1271506813 10 student1
1271506813 10 student2
1271506813 10 student5
[jaypal:~/Temp/homework] cat student5
homework1

This can help you identify exact same files from your students using the following for loop and awk one-liner:

Step: 1 - for i in path/to/files; do cksum "$i"; done > cksum.txt
Step: 2 - awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt

Test:

Some sample files in which student 2 has used identical file as student 1

[jaypal:~/Temp/homework] ls -lrt
total 32
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student1
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student2
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student3
-rw-r--r--  1 jaypalsingh  staff  10 17 Dec 17:58 student4
[jaypal:~/Temp/homework] cat student1 
homework1
[jaypal:~/Temp/homework] cat student2 
homework1
[jaypal:~/Temp/homework] cat student3 
homework3
[jaypal:~/Temp/homework] cat student4 
homework4

Step 1:

Create a cksum.txt file using the cksum utility

[jaypal:~/Temp/homework] for i in *; do cksum "$i"; done > cksum.txt
[jaypal:~/Temp/homework] cat cksum.txt 
4294967295 0 cksum.txt
1271506813 10 student1
1271506813 10 student2
1215889011 10 student3
1299429862 10 student4

Step 2:

Using awk one-liner identify all files that are same

[jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 
1271506813 10 student1
1271506813 10 student2 

Test 2:

[jaypal:~/Temp/homework] for i in stu*; do cksum "$i"; done > cksum.txt
[jaypal:~/Temp/homework] awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt 
1271506813 10 student1
1271506813 10 student2
1271506813 10 student5
[jaypal:~/Temp/homework] cat student5
homework1
清引 2024-12-28 23:40:08

创建所有文件的 md5 并将它们插入字典中。

Create an md5 of all the files and insert them into a dictionary.

箹锭⒈辈孓 2024-12-28 23:40:08

列出至少有一个重复项的文件:

md5sum * | sort | uniq -w32 --all-repeat=separate | awk '{print $2}'

当然,这只会查找完全相同的文件。

要处理子文件夹中的内容,您需要修改它以与 find 一起使用。

To list those files that have at least one duplicate:

md5sum * | sort | uniq -w32 --all-repeat=separate | awk '{print $2}'

Of course, this only finds files that are completely identical.

To deal with things in subfolders, you'll want to modify it to work with find.

疏忽 2024-12-28 23:40:08

这是一个完整的研究领域:

上述方法的问题是,选项卡大小/设置和类似内容的更改将会产生影响。大多数家庭作业甚至要求学生的名字位于顶部。这将使所有相同的提交内容看起来有所不同。

我建议通过预处理器(一方面,剥离注释)和一些(非常严格的)代码缩进器(astyle、bcpp、cindent...?)运行提交,以消除任何“表面差异”。

如果允许出现一些误报,您甚至可能要考虑忽略大小写。这甚至能够发现喜欢命名约定的抄袭者(将 FindSpork() 重命名为 findSpork()?)。

我可以想到添加一些启发式方法。不过,这应该会让你朝着正确的方向前进。

编辑 PS当然,在完成其他操作之后,您仍然可以通过校验和来运行它。因此,例如,您可以

cat submission.cpp | astyle -bj | cpp - | md5sum

获取对意外/表面更改(例如注释或空格)不太敏感的指纹。

This is a whole field of study:

The thing with the mentioned approaches is, that changes in the tab size/settings and stuff like that will make a difference. Most homework assignments even require the student's name at the top. That will make all identical submissions look different.

I suggest running the submission throught the preprocessor (stripping comments, for one thing) and through some (very strict) code indenter (astyle, bcpp, cindent...?) to remove any 'superficial differences'.

You might even want to consider ignoring case - if you allow some false positives. This would even be able to spot the plagiarizer with a taste for naming conventions (renaming FindSpork() to findSpork()?).

There is a number of heuristics I could think of to add. This should set you off in the right direction, though.

Edit P.S. of course after anything else, you can still run it through a checksum. So e.g. you could do

cat submission.cpp | astyle -bj | cpp - | md5sum

to get something of fingerprint that is far less sensitive to accidental/superficial changes (like, comments or whitespace).

奶茶白久 2024-12-28 23:40:08

如果您确实对精确副本感兴趣,请按大小对文件进行分组。如果一个组有多个成员,请对文件运行 md5sum,然后 sort | uniq -c 查看是否有重复。

If you are really interested in exact copies, group files by size. If a group has more than one member, run md5sum on the files and then sort | uniq -c to see whether there are duplicates.

只是在用心讲痛 2024-12-28 23:40:08

fdupes 在这里可以很好地完成此任务

fdupes works well here for this task

我喜欢麦丽素 2024-12-28 23:40:08

下面将检测他们是否只是重命名了一堆变量并更改了空格和制表符等内容。

#!/bin/bash
separate=$(for f; do gzip -9 -c < "$f"; done | wc -c)
together=$(cat "$@" | gzip -9 -c | wc -c)
echo "$separate / $together = $((separate*100/together))%"

该技术被称为“基于压缩的相异性”。请参阅使用数据压缩进行文本比较

相似性检测在垃圾邮件检测、抄袭检测或主题检测领域非常重要。文本文档比较的主要算法基于柯尔莫哥洛夫复杂度,它是计算定义字母表中两个字符串相似度的完美度量之一。不幸的是,这个度量是不可计算的,我们必须定义几个近似值,这些近似值根本不是度量的,但在某些情况下接近这种行为,并且可以在实践中使用。

(PDF) 使用数据压缩进行文本比较。来源:https://www.researchgate.net/publication/289189224_Text_comparison_using_data_compression [9 月 13 日访问2018]。

The following will detect if they've simply renamed a bunch of variables and changed things like whitespace and tabs.

#!/bin/bash
separate=$(for f; do gzip -9 -c < "$f"; done | wc -c)
together=$(cat "$@" | gzip -9 -c | wc -c)
echo "$separate / $together = $((separate*100/together))%"

This technique is referred to as "Compression-Based Dissimilarity". See Text Comparison Using Data Compression:

Similarity detection is very important in the field of spam detection, plagiarism detection or topic detection. The main algorithm for comparison of text document is based on the Kolmogorov Complexity, which is one of the perfect measures for computation of the similarity of two strings in defined alphabet. Unfortunately, this measure is incomputable and we must define several approximations which are not metric at all, but in some circumstances are close to this behaviour and may be used in practice.

(PDF) Text comparison using data compression. Available from: https://www.researchgate.net/publication/289189224_Text_comparison_using_data_compression [accessed Sep 13 2018].

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文