字符串的哈希函数

柠北森屋 2024-12-15 12:22:21

我使用 Dan Bernstein 的 djb2 取得了不错的效果。

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

I've had nice results with djb2 by Dan Bernstein.

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

回复收藏 0 原文

画骨成沙 2024-12-15 12:22:21

首先，您通常不想对哈希表使用加密哈希。按照加密标准来看非常的算法，按照哈希表标准来看仍然慢得令人难以忍受。

其次，您要确保输入的每一位都可以/将会影响结果。一种简单的方法是将当前结果旋转一定位数，然后将当前哈希码与当前字节进行异或。重复直到到达字符串的末尾。请注意，您通常也不希望旋转是字节大小的偶数倍。

例如，假设 8 位字节的常见情况，您可能会旋转 5 位：

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

编辑：另请注意，对于哈希表大小来说，10000 个槽很少是一个好的选择。您通常需要以下两件事之一：您要么需要质数作为大小（确保某些类型的哈希解析的正确性所需），要么需要 2 的幂（因此可以通过简单的方法将值减少到正确的范围）位掩码）。

First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.

Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.

For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:

int hash(char const *input) { 
    int result = 0x55555555;

    while (*input) { 
        result ^= *input++;
        result = rol(result, 5);
    }
}

Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).

回复收藏 0 原文

余生共白头 2024-12-15 12:22:21

我想验证卞晓宁的回答，可惜他没有贴出他的代码。因此，我实现了一个小测试套件，并在列表上运行了不同的小哈希函数466K 英语单词查看每个单词的冲突数量：

Hash function      |     Collisions | Time (words) | Time (file) | Avalanching
===============================================================================
CRC32              |    23 (0.005%) |     96.1 ms  |   120.9 ms  |      99.8%
MurmurOAAT         |    26 (0.006%) |     17.2 ms  |    19.4 ms  |     100.0%
FNV hash           |    32 (0.007%) |     16.4 ms  |    14.0 ms  |      98.6%
Jenkins OAAT       |    36 (0.008%) |     19.9 ms  |    18.5 ms  |     100.0%
DJB2 hash          |   344 (0.074%) |     18.2 ms  |    12.7 ms  |      92.9%
K&R V2             |   356 (0.076%) |     18.0 ms  |    11.9 ms  |      92.8%
Coffin             |   763 (0.164%) |     16.3 ms  |    11.7 ms  |      91.4%
x17 hash           |  2242 (0.481%) |     18.6 ms  |    20.0 ms  |      91.5%
-------------------------------------------------------------------------------
large_prime        |    25 (0.005%) |     16.5 ms  |    18.6 ms  |      96.6%
MurmurHash3_x86_32 |    19 (0.004%) |     20.5 ms  |     4.4 ms  |     100.0%

我包括了两者的时间：分别对所有单词进行哈希处理，并对所有英语单词的整个文件进行一次哈希处理。我还在我的测试中添加了一个更复杂的 MurmurHash3_x86_32 以供参考。此外，我还计算了“雪崩”，它是衡量连续单词的哈希值的不可预测性的指标。任何低于 100% 的值都意味着“相当可预测”（这对于哈希表可能没问题，但对于其他用途来说很糟糕）。

结论：

在 Intel x86-64（或 AArch64）架构上使用流行的 DJB2 哈希函数来处理字符串几乎没有意义。因为它比类似的函数有更多的冲突（FNV, 詹金斯OAAT 和 MurmurOAAT），同时具有可比的吞吐量。 Bernstein 的 DJB2 在短弦上表现尤其糟糕。冲突示例：Liz/MHz、Bon/COM、Rey/性别。
如果必须使用“乘以奇数”方法（例如 DJB2 = x33、K&R V2 = x31 或 SDBM = x65599），请选择较大的 32 位素数而不是较小的 8 位素数。它对于减少碰撞次数有很大的影响。例如，hash = 17000069 * hash + s[i] 仅产生 25 次冲突，而 DJB2 则产生 344 次冲突。

测试代码（使用 gcc -O2 编译）：

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define MODE     1          // 1 = hash each word individually; 2 = hash the entire file
#define MAXLEN   50         // maximum length of a word
#define MAXWORDS 500000     // maximum number of words in a file
#define SEED     0x12345678

#if MODE == 1
char words[MAXWORDS][MAXLEN];
#else
char file[MAXWORDS * MAXLEN];
#endif

uint32_t DJB2_hash(const uint8_t *str)
{
    uint32_t hash = 5381;
    uint8_t c;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    return hash;
}

uint32_t FNV(const void* key, uint32_t h)
{
    // See: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    h ^= 2166136261UL;
    const uint8_t* data = (const uint8_t*)key;
    for(int i = 0; data[i] != '\0'; i++)
    {
        h ^= data[i];
        h *= 16777619;
    }
    return h;
}

uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
    // One-byte-at-a-time hash based on Murmur's mix
    // Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    for (; *str; ++str) {
        h ^= *str;
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}

uint32_t KR_v2_hash(const char *s)
{
    // Source: https://stackoverflow.com/a/45641002/5407270
    // a.k.a. Java String hashCode()
    uint32_t hashval = 0;
    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31*hashval;
    return hashval;
}

uint32_t Jenkins_one_at_a_time_hash(const char *str)
{
    uint32_t hash, i;
    for (hash = i = 0; str[i] != '\0'; ++i)
    {
        hash += str[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

uint32_t crc32b(const uint8_t *str) {
    // Source: https://stackoverflow.com/a/21001712
    unsigned int byte, crc, mask;
    int i = 0, j;
    crc = 0xFFFFFFFF;
    while (str[i] != 0) {
        byte = str[i];
        crc = crc ^ byte;
        for (j = 7; j >= 0; j--) {
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (0xEDB88320 & mask);
        }
        i = i + 1;
    }
    return ~crc;
}

inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
    return x<<bits | x>>(32-bits);      // C idiom: will be optimized to a single operation
}

uint32_t Coffin_hash(char const *input) { 
    // Source: https://stackoverflow.com/a/7666668/5407270
    uint32_t result = 0x55555555;
    while (*input) { 
        result ^= *input++;
        result = _rotl32(result, 5);
    }
    return result;
}

uint32_t x17(const void * key, uint32_t h)
{
    // See: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    const uint8_t * data = (const uint8_t*)key;
    for (int i = 0; data[i] != '\0'; ++i)
    {
        h = 17 * h + (data[i] - ' ');
    }
    return h ^ (h >> 16);
}

uint32_t large_prime(const uint8_t *s, uint32_t hash)
{
    // Better alternative to x33: multiply by a large co-prime instead of a small one
    while (*s != '\0')
        hash = 17000069*hash + *s++;
    return hash;
}

void readfile(FILE* fp, char* buffer) {
    fseek(fp, 0, SEEK_END);
    size_t size = ftell(fp);
    rewind(fp);
    fread(buffer, size, 1, fp);
    buffer[size] = '\0';
}

struct timeval stop, start;
void timing_start() {
    gettimeofday(&start, NULL);
}

void timing_end() {
    gettimeofday(&stop, NULL);
    printf("took %lu us\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

uint32_t apply_hash(int hash, const char* line)
{
    uint32_t result;
#if MODE == 2
    timing_start();
#endif
    switch (hash) {
        case 1: result = crc32b((const uint8_t*)line); break;
        case 2: result = large_prime((const uint8_t *)line, SEED); break;
        case 3: result = MurmurOAAT_32(line, SEED); break;
        case 4: result = FNV(line, SEED); break;
        case 5: result = Jenkins_one_at_a_time_hash(line); break;
        case 6: result = DJB2_hash((const uint8_t*)line); break;
        case 7: result = KR_v2_hash(line); break;
        case 8: result = Coffin_hash(line); break;
        case 9: result = x17(line, SEED); break;
        default: break;
    }
#if MODE == 2
    timing_end();
#endif
    return result;
}

int main(int argc, char* argv[])
{
    // Read arguments
    const int hash_choice = atoi(argv[1]);
    char const* const fname = argv[2];

    printf("Algorithm: %d\n", hash_choice);
    printf("File name: %s\n", fname);

    // Open file
    FILE* f = fopen(fname, "r");

#if MODE == 1
    char line[MAXLEN];
    size_t word_count = 0;
    uint32_t hashes[MAXWORDS];

    // Read file line by line
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = '\0';   // strip newline
        strcpy(words[word_count], line);
        word_count++;
    }
    fclose(f);

    // Calculate hashes
    timing_start();
    for (size_t i = 0; i < word_count; i++) {
        hashes[i] = apply_hash(hash_choice, words[i]);
    }
    timing_end();

    // Output hashes
    for (size_t i = 0; i < word_count; i++) {
        printf("%08x\n", hashes[i]);
    }

#else 
    // Read entire file
    readfile(f, file);
    fclose(f);
    printf("Len: %lu\n", strlen(file)); 

    // Calculate hash
    uint32_t hash = apply_hash(hash_choice, file);
    printf("%08x\n", hash);
#endif

    return 0;
}

PS 对现代哈希函数的速度和质量的更全面的审查可以在 SMHasher 存储库。请注意表中的“质量问题”列。

I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:

Hash function      |     Collisions | Time (words) | Time (file) | Avalanching
===============================================================================
CRC32              |    23 (0.005%) |     96.1 ms  |   120.9 ms  |      99.8%
MurmurOAAT         |    26 (0.006%) |     17.2 ms  |    19.4 ms  |     100.0%
FNV hash           |    32 (0.007%) |     16.4 ms  |    14.0 ms  |      98.6%
Jenkins OAAT       |    36 (0.008%) |     19.9 ms  |    18.5 ms  |     100.0%
DJB2 hash          |   344 (0.074%) |     18.2 ms  |    12.7 ms  |      92.9%
K&R V2             |   356 (0.076%) |     18.0 ms  |    11.9 ms  |      92.8%
Coffin             |   763 (0.164%) |     16.3 ms  |    11.7 ms  |      91.4%
x17 hash           |  2242 (0.481%) |     18.6 ms  |    20.0 ms  |      91.5%
-------------------------------------------------------------------------------
large_prime        |    25 (0.005%) |     16.5 ms  |    18.6 ms  |      96.6%
MurmurHash3_x86_32 |    19 (0.004%) |     20.5 ms  |     4.4 ms  |     100.0%

I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex MurmurHash3_x86_32 into my test for reference. Additionally, I also calculated "avalanching", which is a measure of how unpredictable the hashes of consecutive words are. Anything below 100% means "quite predictable" (which might be fine for hashing tables, but is bad for other uses).

Conclusions:

there is almost no point in using the popular DJB2 hash function for strings on Intel x86-64 (or AArch64 for that matter) architecture. Because it has much more collisions than similar functions (FNV, Jenkins OAAT and MurmurOAAT) while having comparable throughput. Bernstein's DJB2 performs especially bad on short strings. Example collisions: Liz/MHz, Bon/COM, Rey/SEX.
if you have to use "multiply by an odd number" method (such as DJB2 = x33, K&R V2 = x31 or SDBM = x65599), choose a large 32-bit prime number instead of a small 8-bit number. It makes a lot of difference for reducing the number of collisions. For example, hash = 17000069 * hash + s[i] produces only 25 collisions compared to DJB2's 344 collisions.

Test code (compiled with gcc -O2):

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define MODE     1          // 1 = hash each word individually; 2 = hash the entire file
#define MAXLEN   50         // maximum length of a word
#define MAXWORDS 500000     // maximum number of words in a file
#define SEED     0x12345678

#if MODE == 1
char words[MAXWORDS][MAXLEN];
#else
char file[MAXWORDS * MAXLEN];
#endif

uint32_t DJB2_hash(const uint8_t *str)
{
    uint32_t hash = 5381;
    uint8_t c;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    return hash;
}

uint32_t FNV(const void* key, uint32_t h)
{
    // See: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    h ^= 2166136261UL;
    const uint8_t* data = (const uint8_t*)key;
    for(int i = 0; data[i] != '\0'; i++)
    {
        h ^= data[i];
        h *= 16777619;
    }
    return h;
}

uint32_t MurmurOAAT_32(const char* str, uint32_t h)
{
    // One-byte-at-a-time hash based on Murmur's mix
    // Source: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    for (; *str; ++str) {
        h ^= *str;
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}

uint32_t KR_v2_hash(const char *s)
{
    // Source: https://stackoverflow.com/a/45641002/5407270
    // a.k.a. Java String hashCode()
    uint32_t hashval = 0;
    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31*hashval;
    return hashval;
}

uint32_t Jenkins_one_at_a_time_hash(const char *str)
{
    uint32_t hash, i;
    for (hash = i = 0; str[i] != '\0'; ++i)
    {
        hash += str[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

uint32_t crc32b(const uint8_t *str) {
    // Source: https://stackoverflow.com/a/21001712
    unsigned int byte, crc, mask;
    int i = 0, j;
    crc = 0xFFFFFFFF;
    while (str[i] != 0) {
        byte = str[i];
        crc = crc ^ byte;
        for (j = 7; j >= 0; j--) {
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (0xEDB88320 & mask);
        }
        i = i + 1;
    }
    return ~crc;
}

inline uint32_t _rotl32(uint32_t x, int32_t bits)
{
    return x<<bits | x>>(32-bits);      // C idiom: will be optimized to a single operation
}

uint32_t Coffin_hash(char const *input) { 
    // Source: https://stackoverflow.com/a/7666668/5407270
    uint32_t result = 0x55555555;
    while (*input) { 
        result ^= *input++;
        result = _rotl32(result, 5);
    }
    return result;
}

uint32_t x17(const void * key, uint32_t h)
{
    // See: https://github.com/aappleby/smhasher/blob/master/src/Hashes.cpp
    const uint8_t * data = (const uint8_t*)key;
    for (int i = 0; data[i] != '\0'; ++i)
    {
        h = 17 * h + (data[i] - ' ');
    }
    return h ^ (h >> 16);
}

uint32_t large_prime(const uint8_t *s, uint32_t hash)
{
    // Better alternative to x33: multiply by a large co-prime instead of a small one
    while (*s != '\0')
        hash = 17000069*hash + *s++;
    return hash;
}

void readfile(FILE* fp, char* buffer) {
    fseek(fp, 0, SEEK_END);
    size_t size = ftell(fp);
    rewind(fp);
    fread(buffer, size, 1, fp);
    buffer[size] = '\0';
}

struct timeval stop, start;
void timing_start() {
    gettimeofday(&start, NULL);
}

void timing_end() {
    gettimeofday(&stop, NULL);
    printf("took %lu us\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

uint32_t apply_hash(int hash, const char* line)
{
    uint32_t result;
#if MODE == 2
    timing_start();
#endif
    switch (hash) {
        case 1: result = crc32b((const uint8_t*)line); break;
        case 2: result = large_prime((const uint8_t *)line, SEED); break;
        case 3: result = MurmurOAAT_32(line, SEED); break;
        case 4: result = FNV(line, SEED); break;
        case 5: result = Jenkins_one_at_a_time_hash(line); break;
        case 6: result = DJB2_hash((const uint8_t*)line); break;
        case 7: result = KR_v2_hash(line); break;
        case 8: result = Coffin_hash(line); break;
        case 9: result = x17(line, SEED); break;
        default: break;
    }
#if MODE == 2
    timing_end();
#endif
    return result;
}

int main(int argc, char* argv[])
{
    // Read arguments
    const int hash_choice = atoi(argv[1]);
    char const* const fname = argv[2];

    printf("Algorithm: %d\n", hash_choice);
    printf("File name: %s\n", fname);

    // Open file
    FILE* f = fopen(fname, "r");

#if MODE == 1
    char line[MAXLEN];
    size_t word_count = 0;
    uint32_t hashes[MAXWORDS];

    // Read file line by line
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = '\0';   // strip newline
        strcpy(words[word_count], line);
        word_count++;
    }
    fclose(f);

    // Calculate hashes
    timing_start();
    for (size_t i = 0; i < word_count; i++) {
        hashes[i] = apply_hash(hash_choice, words[i]);
    }
    timing_end();

    // Output hashes
    for (size_t i = 0; i < word_count; i++) {
        printf("%08x\n", hashes[i]);
    }

#else 
    // Read entire file
    readfile(f, file);
    fclose(f);
    printf("Len: %lu\n", strlen(file)); 

    // Calculate hash
    uint32_t hash = apply_hash(hash_choice, file);
    printf("%08x\n", hash);
#endif

    return 0;
}

P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.

回复收藏 0 原文

枫以 2024-12-15 12:22:21

维基百科展示了一个很好的字符串哈希函数，称为 Jenkins One At A Time Hash。它还引用了该哈希的改进版本。

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
    uint32_t hash, i;
    for(hash = i = 0; i < len; ++i)
    {
        hash += key[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

回复收藏 0 原文

栖竹 2024-12-15 12:22:21

`djb2` 很好

虽然 djb2 ，正如 cnicutar 在 stackoverflow 上介绍的那样，几乎可以肯定更好，我认为也值得显示 K&R 哈希值：

K&R 哈希值之一是很糟糕，有一个可能相当不错：

显然是一种糟糕的哈希算法，如 K&R 第一版中所介绍的那样。 这只是字符串中所有字节的总和（源):

无符号长哈希(unsigned char *str)
{
    无符号整型哈希 = 0；
    整数c;

    而 (c = *str++)
        散列+=c；

    返回哈希值；
}

可能是一个相当不错的哈希算法，如 K&R 版本 2 中所示（由我在本书第 144 页验证）；注意：如果您打算在哈希算法之外对数组长度进行模数调整，请务必从返回语句中删除 % HASHSIZE 。另外，我建议您将返回和“hashval”类型设置为unsigned long，甚至更好：uint32_t或uint64_t，而不是简单的<代码>无符号（int）。 这是一个简单的算法，通过执行以下算法风格来考虑字符串中每个字节的字节顺序：hashvalue = new_byte + 31*hashvalue，例如字符串中的所有字节：
```
无符号哈希(char *s)
{
    无符号哈希值；

    for (hashval = 0; *s != '\0'; s++)
        哈希值 = *s + 31*哈希值；
    返回 hashval % HASHSIZE;
}
```

请注意，从这两种算法可以清楚地看出，第一版哈希如此糟糕的原因之一是它没有考虑字符串字符顺序，因此hash(“ab”) 因此将返回与 hash("ba") 相同的值。然而，对于第二版哈希来说，情况并非如此，它会（更好！）为这些字符串返回两个不同的值。

`std::unordered_map` 模板容器哈希表使用的 GCC C++11 哈希函数非常优秀。

用于 unordered_map 的 GCC C++11 哈希函数（哈希表模板）和 unordered_set（哈希集模板）如下所示。

这是对使用的 GCC C++11 哈希函数是什么问题的部分答案，指出 GCC 使用“MurmurHashUnaligned2”的实现，作者：Austin艾波比 (http://murmurhash.googlepages.com/)。
在文件“gcc/libstdc++-v3/libsupc++/hash_bytes.cc”中，此处 (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc），我找到了实现。例如，以下是“32 位 size_t”返回值（2017 年 8 月 11 日提取）：

代码：

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

Austin Appleby 的 MurmerHash3 是最好！这甚至比上面使用的 gcc C++11 `std::unordered_map` 哈希有所改进。

Austin 不仅是所有这些中最好的，而且还将 MurmerHash3 发布到公共领域。在这里查看我的其他答案：C++ std::unordered_map 中使用的默认哈希函数是什么？。

另请参阅

尝试和测试的其他哈希表算法：http://www.cse .yorku.ca/~oz/hash.html。那里提到的哈希算法：
1. djb2
2. sdbm
3. 输就输（K&R 第一版）

`djb2` is good

Though djb2, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:

One of the K&R hashes is terrible, one is probably pretty good:

Apparently a terrible hash algorithm, as presented in K&R 1st edition. This is simply a summation of all bytes in the string (source):

unsigned long hash(unsigned char *str)
{
    unsigned int hash = 0;
    int c;

    while (c = *str++)
        hash += c;

    return hash;
}

Probably a pretty decent hash algorithm, as presented in K&R version 2 (verified by me on pg. 144 of the book); NB: be sure to remove % HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" type unsigned long, or even better: uint32_t or uint64_t, instead of the simple unsigned (int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm: hashvalue = new_byte + 31*hashvalue, for all bytes in the string:
```
unsigned hash(char *s)
{
    unsigned hashval;

    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31*hashval;
    return hashval % HASHSIZE;
}
```

Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so hash("ab") would therefore return the same value as hash("ba"). This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.

This is a partial answer to the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (http://murmurhash.googlepages.com/).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017):

Code:

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.

Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.

字符串的哈希函数

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

`djb2` 很好

K&R 哈希值之一是很糟糕，有一个可能相当不错：

`std::unordered_map` 模板容器哈希表使用的 GCC C++11 哈希函数非常优秀。

Austin Appleby 的 MurmerHash3 是最好！这甚至比上面使用的 gcc C++11 `std::unordered_map` 哈希有所改进。

另请参阅

`djb2` is good

One of the K&R hashes is terrible, one is probably pretty good:

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.

See also

关于作者

相关话题

热门标签

推荐作者

此刻的回忆

leejubao

不甘平庸

南巷近海

未蓝澄海的烟

gitee_v1qxdSBNo

友情链接

字符串的哈希函数

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

djb2 很好

K&R 哈希值之一是很糟糕，有一个可能相当不错：

std::unordered_map 模板容器哈希表使用的 GCC C++11 哈希函数非常优秀。

Austin Appleby 的 MurmerHash3 是最好！这甚至比上面使用的 gcc C++11 std::unordered_map 哈希有所改进。

另请参阅

djb2 is good

One of the K&R hashes is terrible, one is probably pretty good:

The GCC C++11 hashing function used by the std::unordered_map<> template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 std::unordered_map<> hash used above.

See also

关于作者

相关话题

热门标签

推荐作者

此刻的回忆

leejubao

不甘平庸

南巷近海

未蓝澄海的烟

gitee_v1qxdSBNo

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

`djb2` 很好

`std::unordered_map` 模板容器哈希表使用的 GCC C++11 哈希函数非常优秀。

Austin Appleby 的 MurmerHash3 是最好！这甚至比上面使用的 gcc C++11 `std::unordered_map` 哈希有所改进。

`djb2` is good

The GCC C++11 hashing function used by the `std::unordered_map<>` template container hash table is excellent.

MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11 `std::unordered_map<>` hash used above.