需要更快的 Linux 指纹/哈希比较的建议

发布于 2024-10-28 21:52:49 字数 409 浏览 7 评论 0原文

我目前在一个非常轻量级(低处理器/低内存)Linux 设备上的 bash 脚本中使用 linux md5sum 命令来返回并记录单个目录中数千个名称相似的 32MB 文件的校验和。

md5sum ./file* >fingerprint.txt

第二天,我对同一组文件重复该过程,并以编程方式比较前一天的哈希结果。当我发现文件的指纹在第 1 天和第 2 天之间发生变化时,我会对该特定文件采取操作。如果文件保持不变,我不采取任何行动并继续比较。

我遇到的问题是 md5 方法需要很长时间来处理每个文件。比较需要在一定的时间范围内完成,而我开始遇到整个过程花费太长时间的事件。

我可以使用其他方法/工具来可靠地执行这种比较吗? (请注意,这不足以对文件执行日期比较,并且文件大小保持恒定的 32MB)

I'm currently using the linux md5sum command in a bash script on a very lightweight (low processor/low memory) linux device to return and record the checksums on thousands of similarly-named 32MB files in a single directory.

md5sum ./file* >fingerprint.txt

The next day, I repeat the process on the same set of files and programmatically compare the results from the prior day's hashes. When I find that the fingerprint of a file has changed between day1 and day2 I take action on that specific file. If the file remained unchanged I take no action and continue my comparison.

The problem that I'm running into is that the md5 method takes a LONG time to process on each file. The comparison needs to be completed within a certain time-frame and I'm starting to bump into incidents where the entire process simply takes too long.

Is there some other method/tool I could be using to reliably perform this kind of comparison? (note, it's not adequate enough to perform a date comparison on the files and the file sizes remain a constant 32MB)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

最好是你 2024-11-04 21:52:49

MD5 被认为是加密哈希函数中速度最快的。但任何给定的实现都可能做出在特定机器上意味着次优性能的选择。您使用什么样的硬件?处理器类型和 L1 缓存大小非常重要。

您可能想看看 sphlib:这是一个用 C 语言实现许多加密哈希函数的库(优化,但可移植)和 Java。 C 代码可以使用额外的“小占用空间”标志进行编译,这有助于小型嵌入式平台(主要是由于 L1 缓存大小问题)。此外,该代码还附带一个类似 md5sum 的命令行实用程序和一个速度基准工具。

在哈希函数中,MD4通常是最快的,但在某些平台Panama上,Radiogatun[32]和Radiogatun[64]可以达到相似或更好的性能。您可能还想查看一些 SHA-3 候选者,特别是 Shabal,它在小型 32 位系统上速度相当快。

重要说明:一些哈希函数是“损坏的”,因为可能会产生冲突:两个不同的输入文件,它们哈希为相同的值(正是您想要避免的)。 MD4和MD5因此被“破坏”。然而,碰撞必须是故意发生的;你不会因为(坏)运气而击中一个(概率小于由于计算过程中的硬件错误而发生“碰撞”)。如果您处于与安全相关的情况(有人可能想主动引发碰撞),那么事情会更加困难。在我引用的功能中,Radiogatun 和 Shabal 功能目前尚未中断。

MD5 is supposed to be fast among cryptographic hash functions. But any given implementation may make choices which, on a specific machine, imply suboptimal performances. What kind of hardware do you use ? Processor type and L1 cache size are quite important.

You may want to have a look at sphlib: this is a library implementing many cryptographic hash functions, in C (optimized, but portable) and Java. The C code can be compiled with an additional "small footprint" flag which helps on small embedded platforms (mainly due to L1 cache size issues). Also, the code comes with a md5sum-like command-line utility, and a speed benchmark tool.

Among the hash functions, MD4 is usually the fastest, but on some platforms Panama, Radiogatun[32] and Radiogatun[64] can achieve similar or better performance. You may also want to have a look at some of the SHA-3 candidates, in particular Shabal, which is quite fast on small 32-bit systems.

Important note: some hash functions are "broken", in that it is possible to create collisions: two distinct input files, which hash to the same value (exactly what you want to avoid). MD4 and MD5 are thus "broken". However, a collision must be done on purpose; you will not hit one out of (bad) luck (probabilities are smaller than having a "collision" due to a hardware error during the computation). If you are in a security-related situation (someone may want to actively provoke a collision) then things are more difficult. Among those I cite, the Radiogatun and Shabal functions are currently unbroken.

短暂陪伴 2024-11-04 21:52:49

加快速度的方法:

  • 如果您有多个核心,则可以一次使用多个 md5hash 进程。但我怀疑您的问题是磁盘访问,在这种情况下这可能没有帮助。
  • 你真的需要做MD5哈希吗?检查修改日期/时间、大小和 INODE 而不是哈希值进行快速检查
  • 考虑每天执行快速检查,每周执行慢速 MD5 检查

我怀疑您实际上并不需要每次都对每个文件执行 MD5 哈希,您最好仔细考虑您的实际需求,以及满足这些需求的最小解决方案是什么。

Ways to speed it up:

  • If you have multiple cores you could use more than one md5hash process at a time. But I suspect that your problem is disk access, in which case this may not help.
  • Do you really need to do MD5 hash? Check the modification date/time, size and INODE instead of the hash for a quick check
  • Consider Performing the quick check daily, and the slow MD5 check weekly

I suspect you don't really need to do an MD5 hash of every file every time, and you might be better off carefully considering your actual requirements, and what is the minimal solution which will meet them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文