如何计算目录的 MD5 校验和?

发布于 2024-08-09 19:34:08 字数 187 浏览 1 评论 0原文

我需要计算放置在目录和所有子目录下的特定类型(例如 *.py)的所有文件的摘要 MD5 校验和。

最好的方法是什么?


提出的解决方案非常好,但这并不是我所需要的。我正在寻找一种解决方案来获取单个摘要校验和,该校验和将唯一标识整个目录 - 包括其所有子目录的内容。

I need to calculate a summary MD5 checksum for all files of a particular type (*.py for example) placed under a directory and all sub-directories.

What is the best way to do that?


The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its subdirectories.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

墨小墨 2024-08-16 19:34:08

动态创建一个 tar 存档文件并将其通过管道传输到 md5sum

tar c dir | md5sum

这会生成一个 MD5 哈希值,该值对于您的文件和子目录设置应该是唯一的。磁盘上未创建任何文件。

Create a tar archive file on the fly and pipe that to md5sum:

tar c dir | md5sum

This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.

薄荷港 2024-08-16 19:34:08
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum

find 命令列出所有以 .py 结尾的文件。
为每个 .py 文件计算 MD5 哈希值。 AWK 用于选取 MD5 哈希值(忽略文件名,文件名可能不是唯一的)。
MD5 哈希值已排序。然后返回该排序列表的 MD5 哈希值。

我通过复制测试目录对此进行了测试:

rsync -a ~/pybin/ ~/pybin2/

我重命名了 ~/pybin2 中的一些文件。

find...md5sum 命令为两个目录返回相同的输出。

2bcf49a4d19ef9abd284311108d626f1  -

为了考虑文件布局(路径),以便在重命名或移动文件时校验和会发生变化,可以简化命令:

find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | md5sum

在使用 md5 的 macOS 上:

find /path/to/dir/ -type f -name "*.py" -exec md5 {} + | md5
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum

The find command lists all the files that end in .py.
The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique).
The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.

I've tested this by copying a test directory:

rsync -a ~/pybin/ ~/pybin2/

I renamed some of the files in ~/pybin2.

The find...md5sum command returns the same output for both directories.

2bcf49a4d19ef9abd284311108d626f1  -

To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:

find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | md5sum

On macOS with md5:

find /path/to/dir/ -type f -name "*.py" -exec md5 {} + | md5
窝囊感情。 2024-08-16 19:34:08

ire_and_curses 的建议 tar c

有一些问题:

  • tar 按照目录条目在文件系统中存储的顺序处理目录条目,并且无法更改此顺序。如果您在不同的地方有“相同”的目录,这实际上会产生完全不同的结果,并且我不知道如何解决这个问题(tar 无法按特定顺序“排序”其输入文件)。
  • 我通常关心groupid和ownerid数字是否相同,而不一定关心group/owner的字符串表示是否相同。这与 rsync -a --delete 等示例一致:它几乎同步所有内容(减去 xattrs 和 acl),但它将根据所有者和组的 ID(而不是字符串表示)同步所有者和组。因此,如果您同步到不一定具有相同用户/组的不同系统,则应将 --numeric-owner 标志添加到 tar
  • tar tar 将包含您所在目录的文件名检查本身,只是需要注意的事情。

只要第一个问题没有解决(或者除非你确定它不会影响你),我就不会使用这种方法。

提议的基于 find 的解决方案也不好,因为它们只包含文件,而不包含目录,如果您的校验和应记住空目录,这将成为一个问题。

最后,大多数建议的解决方案的排序不一致,因为不同系统的排序规则可能不同。

这是我提出的解决方案:

dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum

关于此解决方案的注释:

  • LC_ALL=C 是为了确保跨系统可靠的排序顺序
  • 这并不区分目录“named\nwithanewline”和两个目录“命名”和“withanewline”,但发生这种情况的可能性似乎很小。人们通常使用 find-print0 标志来修复此问题,但由于这里还有其他事情发生,我只能看到会使命令变得比其价值更复杂的解决方案。

PS:我的一个系统使用有限的 busybox find ,它不支持 -exec 也不支持 -print0 标志,而且它还会附加“/”表示目录,而 findutils find 似乎没有,所以对于这台机器我需要运行:

dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum

幸运的是,我没有名称中带有换行符的文件/目录,所以这在该系统上不是问题。

ire_and_curses's suggestion of using tar c <dir> has some issues:

  • tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
  • I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a --delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner flag to tar
  • tar will include the filename of the directory you're checking itself, just something to be aware of.

As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.

The proposed find-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.

Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.

This is the solution I came up with:

dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum

Notes about this solution:

  • The LC_ALL=C is to ensure reliable sorting order across systems
  • This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occurring seems very unlikely. One usually fixes this with a -print0 flag for find, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.

PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:

dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum

Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.

挽手叙旧 2024-08-16 19:34:08

如果您只关心文件而不关心空目录,那么这很有效:

find /path -type f | sort -u | xargs cat | md5sum

If you only care about files and not empty directories, this works nicely:

find /path -type f | sort -u | xargs cat | md5sum
戏舞 2024-08-16 19:34:08

最适合我的解决方案:

find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum

它最适合我的原因:

  1. 处理包含空格的文件名
  2. 忽略文件系统元数据
  3. 检测文件是否已重命名

其他答案的问题:

文件系统元数据不是忽略:

tar c - "$path" | md5sum

不处理包含空格的文件名,也不检测文件是否已重命名:

find /path -type f | sort -u | xargs cat | md5sum

A solution which worked best for me:

find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum

Reason why it worked best for me:

  1. handles file names containing spaces
  2. Ignores filesystem meta-data
  3. Detects if file has been renamed

Issues with other answers:

Filesystem meta-data is not ignored for:

tar c - "$path" | md5sum

Does not handle file names containing spaces nor detects if file has been renamed:

find /path -type f | sort -u | xargs cat | md5sum
尝蛊 2024-08-16 19:34:08

为了完整起见,有 md5deep(1);由于 *.py 过滤器要求,它不直接适用,但应该与 find(1) 一起使用。

For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).

各自安好 2024-08-16 19:34:08

如果你想要一个跨越整个目录的 MD5 哈希值,我会这样做

cat *.py | md5sum

If you want one MD5 hash value spanning the whole directory, I would do something like

cat *.py | md5sum
北城孤痞 2024-08-16 19:34:08

对所有文件进行校验和,包括内容及其文件名

grep -ar -e . /your/dir | md5sum | cut -c-32

*与上面相同,但仅包括 .py 文件

grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32

如果需要,您还可以遵循符号链接

grep -aR -e . /your/dir | md5sum | cut -c-32

您可以考虑使用的其他选项grep

-s, --no-messages         suppress error messages
-D, --devices=ACTION      how to handle devices, FIFOs and sockets;
-Z, --null                print 0 byte after FILE name
-U, --binary              do not strip CR characters at EOL (MSDOS/Windows)

Checksum all files, including both content and their filenames

grep -ar -e . /your/dir | md5sum | cut -c-32

*Same as above, but only including .py files

grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32

You can also follow symlinks if you want

grep -aR -e . /your/dir | md5sum | cut -c-32

Other options you could consider using with grep

-s, --no-messages         suppress error messages
-D, --devices=ACTION      how to handle devices, FIFOs and sockets;
-Z, --null                print 0 byte after FILE name
-U, --binary              do not strip CR characters at EOL (MSDOS/Windows)
世界和平 2024-08-16 19:34:08

GNU 查找

find /path -type f -name "*.py" -exec md5sum "{}" +;

GNU find

find /path -type f -name "*.py" -exec md5sum "{}" +;
极度宠爱 2024-08-16 19:34:08

从技术上讲,您只需要运行 ls -lR *.py | md5sum。除非您担心有人修改文件并将其恢复到原始日期并且从不更改文件大小,否则 ls 的输出应该会告诉您文件是否已更改。我的 unix-foo 很弱,因此您可能需要更多命令行参数来获取要打印的创建时间和修改时间。 ls 还会告诉您文件的权限是否已更改(如果您不关心的话,我确信有开关可以将其关闭)。

Technically you only need to run ls -lR *.py | md5sum. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output from ls should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print. ls will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).

◇流星雨 2024-08-16 19:34:08

使用md5deep

md5deep -r 文件夹| awk '{print $1}' |排序| md5和

Using md5deep:

md5deep -r FOLDER | awk '{print $1}' | sort | md5sum

狠疯拽 2024-08-16 19:34:08

我想补充一点,如果您尝试对 Git 存储库中的文件/目录执行此操作以跟踪它们是否已更改,那么这是最好的方法:

git log -1 --format=format:%H --full-diff <file_or_dir_name>

如果它不是 Git 目录/存储库,则 ire_and_curses的答案可能是最好的选择:

tar c <dir_name> | md5sum

但是,请注意,如果您在不同的操作系统等中运行,tar 命令将更改输出哈希。如果您想避免这种情况,这是最好的方法,尽管乍一看它看起来不太优雅:

find <dir_name> -type f -print0 | sort -z | xargs -0 md5sum | md5sum | awk '{ print $1 }'

I want to add that if you are trying to do this for files/directories in a Git repository to track if they have changed, then this is the best approach:

git log -1 --format=format:%H --full-diff <file_or_dir_name>

And if it's not a Git directory/repository, then the answer by ire_and_curses is probably the best bet:

tar c <dir_name> | md5sum

However, please note that tar command will change the output hash if you run it in a different OS and stuff. If you want to be immune to that, this is the best approach, even though it doesn't look very elegant on first sight:

find <dir_name> -type f -print0 | sort -z | xargs -0 md5sum | md5sum | awk '{ print $1 }'
樱娆 2024-08-16 19:34:08

md5sum 对我来说工作得很好,但我在 sort 和排序文件名方面遇到了问题。因此,我按照 md5sum 结果进行排序。我还需要排除一些文件才能创建可比较的结果。

<代码>
寻找 。 -类型 f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v“供应商/autoload.php”\
| grep -v“供应商/作曲家/”\
|排序-d \
| md5和

md5sum worked fine for me, but I had issues with sort and sorting file names. So instead I sorted by md5sum result. I also needed to exclude some files in order to create comparable results.


find . -type f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v "vendor/autoload.php" \
| grep -v "vendor/composer/" \
| sort -d \
| md5sum

腹黑女流氓 2024-08-16 19:34:08

如果您希望真正独立于文件系统属性和某些tar 版本,您可以使用 cpio

cpio -i -e theDirname | md5sum

If you want really independence from the file system attributes and from the bit-level differences of some tar versions, you could use cpio:

cpio -i -e theDirname | md5sum
终陌 2024-08-16 19:34:08

我遇到了同样的问题,所以我想出了这个脚本,它只列出目录中文件的 MD5 哈希值,如果它找到一个子目录,它会从那里再次运行,为了发生这种情况,脚本必须能够运行当前目录或子目录(如果在 $1 中传递所述参数)

#!/bin/bash

if [ -z "$1" ] ; then

# loop in current dir
ls | while read line; do
  ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
    md5sum "$ecriv"
elif [ -d $ecriv ] ; then
    sh myScript "$line" # call this script again
fi

done


else # if a directory is specified in argument $1

ls "$1" | while read line; do
  ecriv=`pwd`"/$1/"$line

if [ -f $ecriv ] ; then
    md5sum "$ecriv"

elif [ -d $ecriv ] ; then
    sh myScript "$line"
fi

done


fi

I had the same problem so I came up with this script that just lists the MD5 hash values of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1

#!/bin/bash

if [ -z "$1" ] ; then

# loop in current dir
ls | while read line; do
  ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
    md5sum "$ecriv"
elif [ -d $ecriv ] ; then
    sh myScript "$line" # call this script again
fi

done


else # if a directory is specified in argument $1

ls "$1" | while read line; do
  ecriv=`pwd`"/$1/"$line

if [ -f $ecriv ] ; then
    md5sum "$ecriv"

elif [ -d $ecriv ] ; then
    sh myScript "$line"
fi

done


fi
猫弦 2024-08-16 19:34:08

还有两种解决方案:

创建:

du -csxb /path | md5sum > file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file

检查:

du -csxb /path | md5sum -c file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file

There are two more solutions:

Create:

du -csxb /path | md5sum > file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file

Check:

du -csxb /path | md5sum -c file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文