我应该使用什么校验和算法?
我正在构建一个系统,该系统需要能够查找字节块是否已更新。 我认为我应该计算它的校验和,存储它并稍后计算相同的校验和,以查看该 blob 是否已更新,而不是存储整个 blob(它们最多可达 5MB)。
目标是最小化以下内容(按顺序):
- 冲突可能性的校验和时间的大小
- 计算
- (即使内容已被修改,也会发生 2 个相同的校验和)。
我们的系统的碰撞次数不超过 1/1,000,000 是可以接受的。关心的不是安全性,而只是更新/错误检测,因此很少发生冲突是可以的。 (这就是为什么我把它放在要最小化的事情的最后)。
此外,我们无法自己修改文本块。
当然,我会想到 md5
、crc
或 sha1
,如果我想要一个快速的解决方案,我会这样做。然而,我不仅仅是在寻找快速解决方案,而是在寻找不同方法及其优缺点的比较。
I'm building a system which needs to be able to find if blobs of bytes have been updated.
Rather than storing the whole blob (they can be up to 5MBs), I'm thinking I should compute a checksum of it, store this and compute the same checksum a little bit later, to see whether the blob has been updated.
The goal is to minimize the following (in that order) :
- size of the checksum
- time to compute
- likeliness of collisions (2 identical checksums happening even if the content has been modified).
It is acceptable for our system to have collision not more than 1/1,000,000. The concern is not security, but simply update/error detection, so rare collisions are ok. (Which is why I put it last in the things to minimize).
Also, we cannot modify the blobs of text ourselves.
Of course, md5
, crc
or sha1
come to mind, and if I wanted a quick solution, I'd go for it. However, more than a quick solution, I'm looking for what could be a comparison of different methods as well as the pros and cons.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Blake2 是您可以使用的最快的哈希函数,主要采用:
SHA-3 竞赛的获胜者是 Keccak算法但尚未流行的实现,在 GNU/Linux 发行版中默认不采用。相反,作为 SHA-3 竞赛候选者的 Blake2 比 Keccak 更快,并且是 GNU coreutils。因此,在 GNU/Linux 发行版上,您可以使用
b2sum
来使用 Blake2 哈希算法。Blake2 is the fastest hash function you can use and that is mainly adopted:
Winner of SHA-3 contest was Keccak algorithm but is not yet has a popular implementation is not adopted by default in GNU/Linux distributions. Instead, Blake2 that was a SHA-3 contest candidate is faster than Keccak and is part of GNU coreutils. So on you GNU/Linux distribution you can use
b2sum
to use Blake2 hash algorithm.我建议您查看此页面,CRC 与 MD5/SHA1。
这个其他线程中讨论了速度和碰撞。
一如既往,维基百科是您的朋友。
如果我必须选择,有一个重要的问题需要回答:你是否希望在任何情况下都不会发生碰撞——或者至少希望这种可能性如此之低,以至于接近月球与地球相撞的可能性在接下来的 5 分钟内?
如果是,请选择 SHA 系列。
对于您的情况,我会更改更新检查的完成方式。
例如,增量编号可以与 blob 关联,并代替哈希发送,如果另一个编号不同,则需要更新请求边。在这种情况下,碰撞概率从 ~10^-18 到 ~0(基本上是 0 + bug 概率)...
编辑以下评论
找到了这个算法,Adler-32 ,这对于 CRC 为 32 位的长消息 (MB) 很有用,即大约 ~1/10^9(MD5 的长度为 128 位)。
计算速度很快。
Adler-32。底部有一些示例(链接)。
I suggest you have a look to this SO page, CRC vs MD5/SHA1.
Speed and collisions are discussed in this other thread.
And as always Wikipedia is your friend.
If I had to choose, there is an important question to answer: do you want that in any case there is no collision - or, at least, that the probability is so low that it is close to the chance that the Moon collides with Earth within the next 5 minutes?
If yes, choose the SHA family.
In your case I would change the way the update check is being done.
For instance, an incremental number could be associated with the blob, and be sent instead of the hash, the request for update would be required if the number is different on the other side. The collision probability in this case goes from ~10^-18 to ~0 (basically 0 + bug probability )...
Edit following comments
Found this algorithm, Adler-32, which is good for long messages (MB) with a CRC of 32 bits, i.e. about ~1/10^9 (MD5 is 128 bits long).
It is fast to calculate.
Adler-32. There is some come sample (link) at the bottom.