比较 n 个文件（二进制）

发布于 2024-12-10 21:39:05 字数 318 浏览 5 评论 0原文

我想比较多个文件并找出哪些文件相同，但它们不一定是文本文件（所以请不要建议diff）

文件可以是任何格式（即二进制文件）。

我发现我可以运行 md5sum 来查找每个文件的哈希值，然后手动比较它们是否相同。但我怎样才能自动化这个过程呢？

Ps：我还发现我可以使用将 md5sums 存储在文件中，

md5sum <file-names> | cat >md5sum.txt

但我不知道如何自动化此过程。

我希望通过脚本来完成此操作（语言禁止）。

原文

I want to compare a number of files and find out which files which are the same, but they are not necessarily text files(So please don't suggest diff)

The files can be in any format (ie binary files).

I found out that I can run md5sum to find the hash of each file and then compare it manually to check if they are the same . But how can I automate this process ?

Ps : I also found that I can store the md5sums in a file using

md5sum <file-names> | cat >md5sum.txt

but I am stuck on how to automate this process.

I would prefer this to be done via a script (language no-bar).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白芷 2024-12-17 21:39:05

如果您可以使用 Perl 或 Python 等具有内置哈希/字典支持的语言，那么这真的很容易。

循环文件名和签名，并创建一个以 md5sum 作为键的哈希值以及以该 md5 作为值的文件列表。

然后循环哈希的内容并显示包含多个项目的条目。这些文件可能是相同的（使用基于签名的方法无法真正确定）。

当人们要求代码时，可能会像下面这样。这是一个 Perl 实现。如果需要的话，我可以稍后添加一个等效的 python 示例。

#!perl
my $same = {};
for my $x (@ARGV){
    my ($sig, $name) = split(/\s+/, `md5sum $x`);
    if (!defined($same{$sig})){$same{$sig} = []}
    push @{$same{$sig}}, $name;
}

for my $sig (keys %same){
    if (@{$same{$sig}} > 1){
        print "Files with same MD5 : ".join('-', @{$same{$sig}})."\n";
    }
}

假设你把它放在一个文件 same.pl 中，你可以这样称呼它：

perl same.pl

使用示例：

$ md5sum F*
c9904273735f3141c1dd61533e02246a  F1
c9904273735f3141c1dd61533e02246a  F2
c9904273735f3141c1dd61533e02246a  F3
d41d8cd98f00b204e9800998ecf8427e  F4

$ perl same.pl F1 F2 F3 F4
Files with same MD5 : F1-F2-F3

下面是一个可能的 python 版本（适用于 python2 和 python3）。

#!python

import hashlib

def md5sum(filename):
    f = open(filename, mode='rb')
    buf = f.read(128)
    d = hashlib.md5(buf)
    while len(buf) == 128:
        buf = f.read(128)
        d.update(buf)
    return d.hexdigest()


if __name__ == "__main__":
    import sys
    same = {}
    for name in sys.argv[1:]:
        sig = md5sum(name)
        same.setdefault(sig, []).append(name)

    for k in same:
        if len(same[k]) > 1:
            print("Files with same MD5: {l}".format(l="-".join(same[k])))

请注意，如果您要比较大量文件，则在命令行上提供文件名（如上面的示例所示）可能还不够，您应该使用一些更复杂的方法来执行此操作（或在脚本中放入一些 glob），或者shell命令行会溢出。

If you can use languages like perl or python with builtin support for hashes/dictionnaries, it's really easy.

Loop over file names and signature and create a hash with md5sum as key and list of files with that md5 as value.

Then loop over content of hash and show entries with more than one item. These are files likely to be identical (you can't be really sure with a signature based approach).

As people are asking for code, maybe something like below. That is a perl implementation. I may add an equivalent python sample later if it is wanted.

#!perl
my $same = {};
for my $x (@ARGV){
    my ($sig, $name) = split(/\s+/, `md5sum $x`);
    if (!defined($same{$sig})){$same{$sig} = []}
    push @{$same{$sig}}, $name;
}

for my $sig (keys %same){
    if (@{$same{$sig}} > 1){
        print "Files with same MD5 : ".join('-', @{$same{$sig}})."\n";
    }
}

Say you put that in a file same.pl, you call it like:

perl same.pl

exemple of use:

$ md5sum F*
c9904273735f3141c1dd61533e02246a  F1
c9904273735f3141c1dd61533e02246a  F2
c9904273735f3141c1dd61533e02246a  F3
d41d8cd98f00b204e9800998ecf8427e  F4

$ perl same.pl F1 F2 F3 F4
Files with same MD5 : F1-F2-F3

Below is a possible python version (working with both python2 and python3).

#!python

import hashlib

def md5sum(filename):
    f = open(filename, mode='rb')
    buf = f.read(128)
    d = hashlib.md5(buf)
    while len(buf) == 128:
        buf = f.read(128)
        d.update(buf)
    return d.hexdigest()


if __name__ == "__main__":
    import sys
    same = {}
    for name in sys.argv[1:]:
        sig = md5sum(name)
        same.setdefault(sig, []).append(name)

    for k in same:
        if len(same[k]) > 1:
            print("Files with same MD5: {l}".format(l="-".join(same[k])))

Note that if you are comparing really large number of files, providing file names on command line as in the above exemples may not be enough and you should use some more elaborate way to do that (or put some glob inside the script), or the shell command line will overflow.

回复收藏 0 原文

~没有更多了~