搜索复制的文件,同时忽略文件名
有时我的学生尝试提交相同的作业文件。如果他们自己做足功课,不可能有任何两个文件完全相同。
我把作业放在这样的文件夹中: /section/id/
这样,课程的每个部分都有自己的文件夹,每个学生都有自己的文件夹,所有文件都在里面最后一个级别。学生文件有多种格式。
- 如何检查任何子文件夹中是否存在完全相同的文件(忽略文件名)?
Sometimes my students try to submit identical files for their homework. If they did their homework themselves, it would be impossible for any two files to be the exactly the same.
I put the homework in folders arranged like this: /section/id/
In this way, each section of the course has its own folder, each student has their own folder, and all of the files are within that last level. The student files come in a variety of formats.
- How can I check if there are any exactly identical files (ignoring file names) within any sub-folder?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这可以帮助您使用以下
for 循环
和awk
单行代码识别学生提供的完全相同的文件:步骤:1 -
对于路径/到/文件中的 i ;执行 cksum "$i";完成> cksum.txt
步骤:2 -
awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt
测试:
一些示例文件,其中
学生 2
使用了与学生 1< 相同的文件/code>
第 1 步:
使用
cksum
实用程序创建 cksum.txt 文件第 2 步:
使用
awk
one -liner 识别所有文件相同测试2:
This can help you identify exact same files from your students using the following
for loop
andawk
one-liner:Step: 1 -
for i in path/to/files; do cksum "$i"; done > cksum.txt
Step: 2 -
awk 'NR==FNR && a[$1]++ { b[$1]; next } $1 in b' cksum.txt cksum.txt
Test:
Some sample files in which
student 2
has used identical file asstudent 1
Step 1:
Create a cksum.txt file using the
cksum
utilityStep 2:
Using
awk
one-liner identify all files that are sameTest 2:
创建所有文件的 md5 并将它们插入字典中。
Create an md5 of all the files and insert them into a dictionary.
列出至少有一个重复项的文件:
当然,这只会查找完全相同的文件。
要处理子文件夹中的内容,您需要修改它以与
find
一起使用。To list those files that have at least one duplicate:
Of course, this only finds files that are completely identical.
To deal with things in subfolders, you'll want to modify it to work with
find
.这是一个完整的研究领域:
上述方法的问题是,选项卡大小/设置和类似内容的更改将会产生影响。大多数家庭作业甚至要求学生的名字位于顶部。这将使所有相同的提交内容看起来有所不同。
我建议通过预处理器(一方面,剥离注释)和一些(非常严格的)代码缩进器(astyle、bcpp、cindent...?)运行提交,以消除任何“表面差异”。
如果允许出现一些误报,您甚至可能要考虑忽略大小写。这甚至能够发现喜欢命名约定的抄袭者(将
FindSpork()
重命名为findSpork()
?)。我可以想到添加一些启发式方法。不过,这应该会让你朝着正确的方向前进。
编辑 PS当然,在完成其他操作之后,您仍然可以通过校验和来运行它。因此,例如,您可以
获取对意外/表面更改(例如注释或空格)不太敏感的指纹。
This is a whole field of study:
The thing with the mentioned approaches is, that changes in the tab size/settings and stuff like that will make a difference. Most homework assignments even require the student's name at the top. That will make all identical submissions look different.
I suggest running the submission throught the preprocessor (stripping comments, for one thing) and through some (very strict) code indenter (astyle, bcpp, cindent...?) to remove any 'superficial differences'.
You might even want to consider ignoring case - if you allow some false positives. This would even be able to spot the plagiarizer with a taste for naming conventions (renaming
FindSpork()
tofindSpork()
?).There is a number of heuristics I could think of to add. This should set you off in the right direction, though.
Edit P.S. of course after anything else, you can still run it through a checksum. So e.g. you could do
to get something of fingerprint that is far less sensitive to accidental/superficial changes (like, comments or whitespace).
如果您确实对精确副本感兴趣,请按大小对文件进行分组。如果一个组有多个成员,请对文件运行
md5sum
,然后sort | uniq -c
查看是否有重复。If you are really interested in exact copies, group files by size. If a group has more than one member, run
md5sum
on the files and thensort | uniq -c
to see whether there are duplicates.fdupes 在这里可以很好地完成此任务
fdupes works well here for this task
下面将检测他们是否只是重命名了一堆变量并更改了空格和制表符等内容。
该技术被称为“基于压缩的相异性”。请参阅使用数据压缩进行文本比较:
The following will detect if they've simply renamed a bunch of variables and changed things like whitespace and tabs.
This technique is referred to as "Compression-Based Dissimilarity". See Text Comparison Using Data Compression: