比较 n 个文件(二进制)
我想比较多个文件并找出哪些文件相同,但它们不一定是文本文件(所以请不要建议diff
)
文件可以是任何格式(即二进制文件)。
我发现我可以运行 md5sum 来查找每个文件的哈希值,然后手动比较它们是否相同。但我怎样才能自动化这个过程呢?
Ps:我还发现我可以使用将 md5sums 存储在文件中,
md5sum <file-names> | cat >md5sum.txt
但我不知道如何自动化此过程。
我希望通过脚本来完成此操作(语言禁止)。
I want to compare a number of files and find out which files which are the same, but they are not necessarily text files(So please don't suggest diff
)
The files can be in any format (ie binary files).
I found out that I can run md5sum
to find the hash of each file and then compare it manually to check if they are the same . But how can I automate this process ?
Ps : I also found that I can store the md5sums in a file using
md5sum <file-names> | cat >md5sum.txt
but I am stuck on how to automate this process.
I would prefer this to be done via a script (language no-bar).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您可以使用 Perl 或 Python 等具有内置哈希/字典支持的语言,那么这真的很容易。
循环文件名和签名,并创建一个以 md5sum 作为键的哈希值以及以该 md5 作为值的文件列表。
然后循环哈希的内容并显示包含多个项目的条目。这些文件可能是相同的(使用基于签名的方法无法真正确定)。
当人们要求代码时,可能会像下面这样。这是一个 Perl 实现。如果需要的话,我可以稍后添加一个等效的 python 示例。
假设你把它放在一个文件 same.pl 中,你可以这样称呼它:
perl same.pl
使用示例:
下面是一个可能的 python 版本(适用于 python2 和 python3)。
请注意,如果您要比较大量文件,则在命令行上提供文件名(如上面的示例所示)可能还不够,您应该使用一些更复杂的方法来执行此操作(或在脚本中放入一些 glob),或者shell命令行会溢出。
If you can use languages like perl or python with builtin support for hashes/dictionnaries, it's really easy.
Loop over file names and signature and create a hash with md5sum as key and list of files with that md5 as value.
Then loop over content of hash and show entries with more than one item. These are files likely to be identical (you can't be really sure with a signature based approach).
As people are asking for code, maybe something like below. That is a perl implementation. I may add an equivalent python sample later if it is wanted.
Say you put that in a file same.pl, you call it like:
perl same.pl
exemple of use:
Below is a possible python version (working with both python2 and python3).
Note that if you are comparing really large number of files, providing file names on command line as in the above exemples may not be enough and you should use some more elaborate way to do that (or put some glob inside the script), or the shell command line will overflow.