组合 MD5 哈希值
在计算大文件上的单个 MD5 校验和时,通常使用什么技术将各个 MD5 值组合成单个值?你只是把它们加在一起吗?我对任何特定的语言、库或 API 并不真正感兴趣;相反,我只对其背后的技术感兴趣。有人可以解释它是如何完成的吗?
给出以下伪代码算法:
MD5Digest X
for each file segment F
MD5Digest Y = CalculateMD5(F)
Combine(X,Y)
但是 Combine
到底会做什么?它将两个 MD5 摘要添加在一起,还是什么?
When calculating a single MD5 checksum on a large file, what technique is generally used to combine the various MD5 values into a single value? Do you just add them together? I'm not really interested in any particular language, library or API which will do this; rather I'm just interested in the technique behind it. Can someone explain how it is done?
Given the following algorithm in pseudo-code:
MD5Digest X
for each file segment F
MD5Digest Y = CalculateMD5(F)
Combine(X,Y)
But what exactly would Combine
do? Does it add the two MD5 digests together, or what?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
考虑到这一点,您不想“组合”两个 MD5 哈希值。对于任何 MD5 实现,您都会有一个保留当前校验和状态的对象。因此,您可以随时提取 MD5 校验和,这在对具有相同开头的两个文件进行哈希处理时非常方便。对于大文件,您只需继续输入数据 - 一次或分块散列文件没有什么区别,因为状态会被记住。在这两种情况下,您都会得到相同的哈希值。
With that in mind, you don't want to "combine" two MD5 hashes. With any MD5 implementation, you have a object that keeps the current checksum state. So you can extract the MD5 checksum at any time, which is very handy when hashing two files that share the same beginning. For big files, you just keep feeding in data - there's no difference if you hash the file at once or in blocks, as the state is remembered. In both cases you will get the same hash.
MD5是一种迭代算法。您不需要计算大量的小 MD5,然后以某种方式将它们组合起来。您只需读取文件的小块,然后将它们添加到摘要中,因此您不必将整个文件一次性保存在内存中。这是一个java实现。
等等瞧。您拥有整个文件的 MD5,而无需将整个文件一次性存储在内存中。
值得注意的是,如果由于某种原因您确实需要文件各部分的 MD5 哈希值(这有时对于通过低带宽连接传输的大文件进行临时检查很有用),那么您可以通过克隆来获取它们随时更新摘要对象,如下所示
这不会影响实际的摘要对象,因此您可以继续使用整体 MD5 哈希值。
还值得注意的是,MD5 是用于加密目的(例如验证来自不受信任来源的文件真实性)的过时哈希值,在大多数情况下应替换为更好的哈希值,例如 SHA-1。对于非加密目的,例如验证两个可信源之间的文件完整性,MD5 仍然足够。
MD5 is an iterative algorithm. You don't need to calculate a ton of small MD5's and then combine them somehow. You just read small chunks of the the file and add them to the digest as your're going, so you never have to have the entire file in memory at once. Here's a java implementation.
Et voila. You have the MD5 of an entire file without ever having the whole file in memory at once.
Its worth noting that if for some reason you do want MD5 hashes of subsections of the file as you go along (this is sometimes useful for doing interim checks on a large file being transferred over a low bandwidth connection) then you can get them by cloning the digest object at any time, like so
This does not affect the actual digest object so you can continue to work with the overall MD5 hash.
Its also worth noting that MD5 is an outdated hash for cryptographic purposes (such as verifying file authenticity from an untrusted source) and should be replaced with something better in most circumstances, such as SHA-1. For non-cryptographic purposes, such as verifying file integrity between two trusted sources, MD5 is still adequate.
AndiDog 的答案的 Python 2.7 示例。文件 123.txt 有多行。
对于内存无法容纳的大文件,可以逐行或逐块读取。 MD5 的一种用途是当 diff 命令失败时比较两个大文件。
A Python 2.7 example for AndiDog's answer. File 123.txt has multiple lines.
For large file that can't fit in memory, it can be read line by line or chunk by chunk. One usage of this MD5 is comparing two large files when diff command fails.
openSSL 库允许您将数据块添加到正在进行的哈希 (sha1/md5) 中,然后当您完成添加所有数据后,您可以调用
Final
方法,它将输出最终哈希。您不需要计算每个单独块的 md5,然后添加它,而是将数据添加到 openssl 库中正在进行的哈希方法中。然后,这将为您提供所有单个数据块的 md5 哈希值,对输入数据大小没有限制。
http://www.openssl.org/docs/crypto/md5.html#< /a>
The openSSL library allows you to add blocks of data to a ongoing hash (sha1/md5) then when you have finished adding all the data you call the
Final
method and it will output the final hash.You don't calculate md5 on each individual block then add it, rather you add the data to the ongoing hash method from the openssl library. This will then give you an md5 hash of all the individual data blocks with no limit on the input data size.
http://www.openssl.org/docs/crypto/md5.html#
这个问题没有多大意义,因为 MD5 算法接受任意长度的输入。一个像样的库应该具有功能,这样您就不必一次添加整个消息,因为消息被分解为按顺序散列的块,正在处理的块仅取决于前一个的结果散列环形。
维基百科文章中的伪代码应该概述该算法的工作原理。
This question doesn't make much sense as the MD5 algorithm takes any length input. A decent library should have functions so that you don't have to add the entire message at a single time as the message is broken down into blocks an hashed sequentially, with the block that is being processed depending only on the resultant hashes from the previous loop.
The pseudo code in the wikipedia article should give a overview of how the algorithm works.
大多数摘要计算实现允许您以较小的块的形式向它们提供数据。您无法以结果等于整个输入的 MD5 的方式组合多个 MD5 摘要。 MD5 会进行一些填充,并在最后阶段使用已处理的字节数,这使得原始引擎状态无法从最终摘要值恢复。
Most digest calculation implementations allow you to feed them the data in smaller blocks. You can't combine multiple MD5 digests in a way that the result will be equal to the MD5 of the entire input. MD5 does some padding and uses the number of proccessed bytes in the final stage which makes the original engine state unrecoverable from the final digest value.
这是一种组合哈希的 C# 方法。让我们创建扩展方法来简化用户代码。
用法:
h1和h2相同。就是这样。
Here is a C# way to combine hash. Let's make extention methods to simplify the user code.
Usage:
h1 and h2 are the same. That's it.