如何计算目录的 MD5 校验和?
我需要计算放置在目录和所有子目录下的特定类型(例如 *.py
)的所有文件的摘要 MD5 校验和。
最好的方法是什么?
提出的解决方案非常好,但这并不是我所需要的。我正在寻找一种解决方案来获取单个摘要校验和,该校验和将唯一标识整个目录 - 包括其所有子目录的内容。
I need to calculate a summary MD5 checksum for all files of a particular type (*.py
for example) placed under a directory and all sub-directories.
What is the best way to do that?
The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its subdirectories.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
动态创建一个 tar 存档文件并将其通过管道传输到
md5sum
:这会生成一个 MD5 哈希值,该值对于您的文件和子目录设置应该是唯一的。磁盘上未创建任何文件。
Create a tar archive file on the fly and pipe that to
md5sum
:This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.
find 命令列出所有以 .py 结尾的文件。
为每个 .py 文件计算 MD5 哈希值。 AWK 用于选取 MD5 哈希值(忽略文件名,文件名可能不是唯一的)。
MD5 哈希值已排序。然后返回该排序列表的 MD5 哈希值。
我通过复制测试目录对此进行了测试:
我重命名了 ~/pybin2 中的一些文件。
find...md5sum 命令为两个目录返回相同的输出。
为了考虑文件布局(路径),以便在重命名或移动文件时校验和会发生变化,可以简化命令:
在使用
md5
的 macOS 上:The find command lists all the files that end in .py.
The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique).
The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.
I've tested this by copying a test directory:
I renamed some of the files in ~/pybin2.
The
find...md5sum
command returns the same output for both directories.To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:
On macOS with
md5
:ire_and_curses 的建议
tar c
有一些问题:
--numeric-owner
标志添加到 tar只要第一个问题没有解决(或者除非你确定它不会影响你),我就不会使用这种方法。
提议的基于 find 的解决方案也不好,因为它们只包含文件,而不包含目录,如果您的校验和应记住空目录,这将成为一个问题。
最后,大多数建议的解决方案的排序不一致,因为不同系统的排序规则可能不同。
这是我提出的解决方案:
关于此解决方案的注释:
LC_ALL=C
是为了确保跨系统可靠的排序顺序find
的-print0
标志来修复此问题,但由于这里还有其他事情发生,我只能看到会使命令变得比其价值更复杂的解决方案。PS:我的一个系统使用有限的 busybox
find
,它不支持-exec
也不支持-print0
标志,而且它还会附加“/”表示目录,而 findutils find 似乎没有,所以对于这台机器我需要运行:幸运的是,我没有名称中带有换行符的文件/目录,所以这在该系统上不是问题。
ire_and_curses's suggestion of using
tar c <dir>
has some issues:rsync -a --delete
does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the--numeric-owner
flag to tarAs long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.
The proposed
find
-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.
This is the solution I came up with:
Notes about this solution:
LC_ALL=C
is to ensure reliable sorting order across systems-print0
flag forfind
, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.PS: one of my systems uses a limited busybox
find
which does not support-exec
nor-print0
flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.
如果您只关心文件而不关心空目录,那么这很有效:
If you only care about files and not empty directories, this works nicely:
最适合我的解决方案:
它最适合我的原因:
其他答案的问题:
文件系统元数据不是忽略:
tar c - "$path" | md5sum
不处理包含空格的文件名,也不检测文件是否已重命名:
A solution which worked best for me:
Reason why it worked best for me:
Issues with other answers:
Filesystem meta-data is not ignored for:
tar c - "$path" | md5sum
Does not handle file names containing spaces nor detects if file has been renamed:
为了完整起见,有 md5deep(1);由于 *.py 过滤器要求,它不直接适用,但应该与 find(1) 一起使用。
For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).
如果你想要一个跨越整个目录的 MD5 哈希值,我会这样做
If you want one MD5 hash value spanning the whole directory, I would do something like
对所有文件进行校验和,包括内容及其文件名
*与上面相同,但仅包括 .py 文件
如果需要,您还可以遵循符号链接
您可以考虑使用的其他选项grep
Checksum all files, including both content and their filenames
*Same as above, but only including .py files
You can also follow symlinks if you want
Other options you could consider using with grep
GNU 查找
GNU find
从技术上讲,您只需要运行 ls -lR *.py | md5sum。除非您担心有人修改文件并将其恢复到原始日期并且从不更改文件大小,否则
ls
的输出应该会告诉您文件是否已更改。我的 unix-foo 很弱,因此您可能需要更多命令行参数来获取要打印的创建时间和修改时间。 ls 还会告诉您文件的权限是否已更改(如果您不关心的话,我确信有开关可以将其关闭)。Technically you only need to run
ls -lR *.py | md5sum
. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output fromls
should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print.ls
will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).使用
md5deep
:md5deep -r 文件夹| awk '{print $1}' |排序| md5和
Using
md5deep
:md5deep -r FOLDER | awk '{print $1}' | sort | md5sum
我想补充一点,如果您尝试对 Git 存储库中的文件/目录执行此操作以跟踪它们是否已更改,那么这是最好的方法:
如果它不是 Git 目录/存储库,则 ire_and_curses的答案可能是最好的选择:
但是,请注意,如果您在不同的操作系统等中运行,
tar
命令将更改输出哈希。如果您想避免这种情况,这是最好的方法,尽管乍一看它看起来不太优雅:I want to add that if you are trying to do this for files/directories in a Git repository to track if they have changed, then this is the best approach:
And if it's not a Git directory/repository, then the answer by ire_and_curses is probably the best bet:
However, please note that
tar
command will change the output hash if you run it in a different OS and stuff. If you want to be immune to that, this is the best approach, even though it doesn't look very elegant on first sight:md5sum
对我来说工作得很好,但我在sort
和排序文件名方面遇到了问题。因此,我按照md5sum
结果进行排序。我还需要排除一些文件才能创建可比较的结果。<代码>
寻找 。 -类型 f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v“供应商/autoload.php”\
| grep -v“供应商/作曲家/”\
|排序-d \
| md5和
md5sum
worked fine for me, but I had issues withsort
and sorting file names. So instead I sorted bymd5sum
result. I also needed to exclude some files in order to create comparable results.find . -type f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v "vendor/autoload.php" \
| grep -v "vendor/composer/" \
| sort -d \
| md5sum
如果您希望真正独立于文件系统属性和某些tar 版本,您可以使用 cpio:
If you want really independence from the file system attributes and from the bit-level differences of some tar versions, you could use cpio:
我遇到了同样的问题,所以我想出了这个脚本,它只列出目录中文件的 MD5 哈希值,如果它找到一个子目录,它会从那里再次运行,为了发生这种情况,脚本必须能够运行当前目录或子目录(如果在 $1 中传递所述参数)
I had the same problem so I came up with this script that just lists the MD5 hash values of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1
还有两种解决方案:
创建:
检查:
There are two more solutions:
Create:
Check: