python 中的快速、大宽度、非加密字符串哈希

发布于 2024-10-25 03:19:17 字数 1513 浏览 1 评论 0原文

我需要 python 中的高性能字符串哈希函数,该函数生成至少具有 34 位输出的整数(64 位是有意义的,但 32 位太少了)。 Stack Overflow 上还有其他几个类似的问题,但在我能找到的每个已接受/赞成的答案中,都属于几个类别之一,这些类别不适用(由于给定的原因)。

  • 使用内置的-in hash() 函数。 这个函数,至少在我正在开发的机器上(使用 python 2.7 和 64 位 cpu)会生成一个 32 以内的整数位 - 对于我的目的来说不够大。
  • 使用 hashlib。 hashlib 提供加密哈希例程,其速度比非加密目的所需的速度慢得多。我发现这是不言而喻的,但如果您需要基准和引用来说服您这一事实,那么我可以提供。
  • 使用string.__hash__()函数作为原型来编写自己的函数。我怀疑这将是正确的方法,只不过这个特定函数的效率在于它使用了 c_mul 函数,该函数环绕 32 位 - 同样,对于我的使用来说太小了!非常令人沮丧,它是如此接近完美!

理想的解决方案应具有以下属性,按相对松散的重要性顺序排列。

  1. 输出范围至少延伸 34 位(可能是 64 位),同时在所有位上保持一致的雪崩属性。 (连接 32 位散列往往会违反雪崩特性,至少在我的愚蠢示例中是这样。)
  2. 可移植。如果两台不同的机器上有相同的输入字符串,我两次应该得到相同的结果。这些值将存储在文件中以供以后重复使用。
  3. 高性能。越快越好,因为在我正在运行的程序执行期间,该函数将被调用大约 200 亿次(这是目前性能关键的代码。)它不需要用 C 编写,它确实只需要优于 md5(在字符串的内置 hash() 领域中的某个地方)。
  4. 接受“扰动”(这里用什么词更好?)整数作为输入来修改输出。我在下面举了一个例子(列表格式规则不允许我把它放得更近。)我想这不是 100% 必要的,因为它可以通过手动扰动函数的输出来模拟,但是将它作为输入给了我一种美好温暖的感觉。
  5. 完全用 Python 编写。如果绝对需要用 C 语言编写,那么我想这是可以做到的,但我会选择用 Python 编写的函数比用 C 语言编写的函数慢 20%,这只是由于项目原因使用两种不同语言的协调头痛。是的,这是一种逃避,但这是一个愿望清单。

“扰动”哈希示例,其中哈希值因一个小整数值 n 而发生巨大变化

def perturb_hash(key,n):
    return hash((key,n))

最后,如果您好奇我到底在做什么,我需要这样一个特定的哈希函数,我正在做一个完整的重写 pybloom 模块以显着提高其性能。我成功了(它现在运行速度提高了约 4 倍,并使用了约 50% 的空间),但我注意到,有时如果过滤器足够大,假阳性率会突然飙升。我意识到这是因为哈希函数没有寻址足够的位。 32 位只能寻址 40 亿位(请注意,过滤器寻址的是位而不是字节),而我用于基因组数据的一些过滤器是该数字的两倍或更多(因此最小为 34 位。)

谢谢!

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)

  • Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
  • Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
  • Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!

An ideal solution would have the following properties, in a relative, loose order of importance.

  1. Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
  2. Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
  3. High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
  4. Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
  5. Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.

'Perturbed' hash example, where the hash value is changed drastically by a small integer value n

def perturb_hash(key,n):
    return hash((key,n))

Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

五里雾 2024-11-01 03:19:17

查看 MurmurHash3 的 128 位变体算法页面包含一些性能数据。应该可以将其移植到 Python,纯 Python 或作为 C 扩展。 (更新作者建议使用 128 位变体并丢弃不需要的位)。

如果 MurmurHash2 64 位适合您,pyfasthash 软件包中有一个 Python 实现(C 扩展),其中包括其他一些非加密哈希变体,尽管其中一些仅提供 32 位输出。

更新 我为 Murmur3 哈希函数做了一个快速的 Python 包装器。 Github 项目在这里,您可以在 Python 包索引以及;它只需要一个 C++ 编译器来构建;无需升压。

使用示例和时序比较:

import murmur3
import timeit

# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)

# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()

t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()

输出:

15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.

Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

Usage example and timing comparison:

import murmur3
import timeit

# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)

# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()

t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()

Output:

15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653
嘴硬脾气大 2024-11-01 03:19:17

小心内置的哈希函数!

从Python3开始,每次解释器启动时都会输入不同的种子(我不知道更多细节),因此每次都会生成不同的值--但不适用于本机数字类型。

$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443

BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!

Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.

$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-1756730906053498061 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4556027264747844925 322818021289917443
$ python3 -c 'print(hash("Hello!"), hash(3.14))'
-4403217265550417031 322818021289917443
痴情 2024-11-01 03:19:17

看看 xxHash,还有 pip 包

xxHash 是一种极快的哈希算法,以 RAM 速度限制运行。它成功完成了 SMHasher 测试套件,该套件评估哈希函数的碰撞、分散和随机性质量。代码具有高度可移植性,并且哈希值在所有平台上都是相同的(小/大端)。

我已经使用 xxHash 很长时间了(我的典型用例是散列字符串——不是出于安全目的),我对它的性能非常满意。

Have a look at xxHash, there's also the pip package.

xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical across all platforms (little / big endian).

I've been using xxHash for a long time (my typical use case is to hash strings -- not for security purposes) and I'm really satisfied of the performance.

真心难拥有 2024-11-01 03:19:17

使用内置的 hash() 函数。这个功能,至少在我正在开发的机器上(与
python 2.7 和 64 位 cpu)生成一个适合 32 位的整数 - 不够大
我的目的。

那不是真的。内置哈希函数将在 64 位系统上生成 64 位哈希。

这是来自 Objects/stringobject.c 的 python str 哈希函数(Python 版本 2.7):

static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;      /* Notice the 64-bit hash, at least on a 64-bit system */

    if (a->ob_shash != -1)
    return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Use the built-in hash() function. This function, at least on the machine I'm developing for (with
python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for
my purposes.

That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.

This is the python str hashing function from Objects/stringobject.c (Python version 2.7):

static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;      /* Notice the 64-bit hash, at least on a 64-bit system */

    if (a->ob_shash != -1)
    return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}
林空鹿饮溪 2024-11-01 03:19:17

“strings”:我假设您希望散列 Python 2.x str 对象和/或 Python3.x bytes 和/或 bytearray对象。

这可能违反您的第一个约束,但是:考虑使用类似的方法

(zlib.adler32(strg, perturber) << N) ^ hash(strg)

来获取 (32+N) 位散列。

"strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.

This may violate your first constraint, but: consider using something like

(zlib.adler32(strg, perturber) << N) ^ hash(strg)

to get a (32+N)-bit hash.

魂牵梦绕锁你心扉 2024-11-01 03:19:17

如果您可以使用Python 3.2,则64位Windows上的哈希结果现在是64位值。

If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文