这是 python 内置哈希函数的适当使用吗？

发布于 2024-12-08 14:12:45 字数 1152 浏览 3 评论 0原文

我需要比较大块数据的相等性，并且需要每秒比较许多对，快。每个对象都保证具有相同的长度，有可能并且很可能只在未知位置存在细微的差异。

下面的计时显示，如果数据开头附近存在差异，则使用 == 运算符会非常快，而如果差异位于数据末尾，则使用 == 运算符会非常慢。

>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

在我的用例中，差异可能位于字节的中间或末尾（上下文：它是未压缩的图像数据）。我寻找一种使用散列或校验和来加快速度的方法。使用 md5 速度较慢，但 Python 的内置 hash 实际上确实加快了速度。

>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

我对这个哈希的技术细节感兴趣，它是否足够像哈希一样，当 hash(a) == hash(b) 时，a == b 是 < em>很有可能？如果哈希冲突相当罕见，则误报是可以接受的，其目的是在平均情况下加速比较的快速路径。

原文

I need to compare large chunks of data for equality, and I need to compare many pairs per second, fast. Each object is guaranteed to be the same length, it is possible and likely that there may only be slight differences located at unknown positions.

Timings below show that using == operator is very fast if there is a difference near the start of the data, and significantly slower if differences are located towards the end.

>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my use case, differences may be located towards the middle or the end of the bytes (context: it is uncompressed image data). I looked for a way to speed things up using a hash or checksum. Using md5 was slower but Python's built-in hash did actually speed things up.

>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I'm interested in technical detail of this hash, is it sufficiently hash-like that when hash(a) == hash(b) then a == b is very likely? False positives are acceptable if a hash collision is reasonably rare, the intention is a fast-path to speed up comparisons in the average case.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

邮友 2024-12-15 14:12:45

Python 的哈希函数专为提高速度而设计，并映射到 64 位空间。由于生日悖论，这意味着您可能会在大约 50 亿个条目时发生冲突（可能更早，因为哈希函数不是加密的）。此外，hash 的精确定义取决于 Python 实现，并且可能是特定于体系结构甚至特定于机器的。如果您希望在多台机器上获得相同的结果，请不要使用它。

md5 被设计为加密哈希函数；即使输入中的轻微扰动也会完全改变输出。它还映射到 128 位空间，这使得您根本不可能遇到碰撞，除非您专门寻找碰撞。

如果您可以处理冲突（即测试存储桶中所有成员之间的相等性，可能通过使用 MD5 或 SHA2 等加密算法），Python 的哈希函数就完全没问题。

还有一件事：为了节省空间，如果将数据写入磁盘，则应该以二进制形式存储数据。（即 struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()）。

作为旁注：是< /a> 不等同于 Python 中的 ==。你的意思是==。