假设我使用哈希来识别文件,所以我不需要它是安全的,我只需要最大限度地减少冲突。我当时想,我可以通过使用 SIMD 并行运行四个散列,然后对最终结果进行散列来加速散列。如果哈希被设计为采用 512 位块,我只需单步执行一次采用 4x512 位块的文件,并从中生成四个哈希;然后在文件的末尾,我将四个结果哈希值哈希在一起。
我很确定这种方法会产生更差的哈希值......但是差多少?有粗略的计算吗?
Say I'm using a hash to identify files, so I don't need it to be secure, I just need to minimize collisions. I was thinking that I could speed the hash up by running four hashes in parallel using SIMD and then hashing the final result. If the hash is designed to take a 512-bit block, I just step through the file taking 4x512 bit blocks at one go and generate four hashes out of that; then at the end of the file I hash the four resulting hashes together.
I'm pretty sure that this method would produce poorer hashes... but how much poorer? Any back of the envelope calculations?
发布评论
评论(1)
从磁盘读取文件块的速度比散列它们的速度更快的想法是一个未经测试的假设吗?磁盘 IO(甚至 SSD)比哈希处理的 RAM 慢很多数量级。
确保低冲突是所有哈希的设计标准,所有主流哈希都做得很好 - 只需使用主流哈希,例如 MD5。
具体到发帖人正在考虑的解决方案,并行散列不会削弱散列。正如海报所说,有专门为块的并行散列和组合结果而设计的散列,尽管可能尚未广泛采用(例如 MD6,从 SHA3 中完整退出)
更一般地说,有主流实现 使用 SIMD 的哈希函数。哈希实现者非常性能意识,并且确实需要时间来优化其实现;你的工作会很辛苦,与他们的努力相当。进行强哈希处理的最佳软件大约是 6 到 10 个周期/字节。如果散列是真正的瓶颈,则还可以使用硬件加速散列。
The idea that you can read blocks of the file from disk quicker than you can hash them is, well, an untested assumption? Disk IO - even SSD - is many orders of magnitude slower than the RAM that the hashing is going though.
Ensuring low collisions is a design criteria for all hashes, and all mainstream hashes do a good job of it - just use a mainstream hash e.g. MD5.
Specific to the solution the poster is considering, its not a given that parallel hashing weakens the hash. There are hashes specifically designed for parallel hashing of blocks and combining the results as the poster said, although perhaps not yet in widespread adoption (e.g. MD6, which withdrew unbroken from SHA3)
More generally, there are mainstream implementations of hashing functions that do use SIMD. Hashing implementers are very performance-aware, and do take time to optimise their implementations; you'd have a hard job equaling their effort. The best software for strong hashing is around 6 to 10 cycles / byte. Hardware accelerated hashing is also available if hashing is the real bottleneck.