当前位置：文江博客话题详情

使用 hash_map 时，对 stl 字符串使用的最佳哈希算法是什么？

发布于 2024-07-06 04:21:42 字数 71 浏览 7 评论 0原文

我发现 VS2005 上的标准哈希函数在尝试实现高性能查找时速度非常慢。有哪些快速有效的哈希算法可以避免大多数冲突的好例子？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

燃情 2024-07-13 04:21:42

我与 Microsoft Research 的 Paul Larson 合作实现了一些哈希表。他研究了各种数据集上的多个字符串哈希函数，发现一个简单的乘以 101 和加循环的效果出奇的好。

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

回复收藏 0 原文

旧街凉风 2024-07-13 04:21:42

从我的一些旧代码来看：

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

它很快。确实快得要命。

From some old code of mine:

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

Its fast. Really freaking fast.

回复收藏 0 原文

奈何桥上唱咆哮 2024-07-13 04:21:42

这始终取决于您的数据集。

我通过使用字符串的 CRC32 获得了令人惊讶的良好结果。适用于各种不同的输入集。

在网上很容易找到许多好的 CRC32 实现。

编辑：差点忘了：这个页面有一个很好的哈希函数对比，其中包含性能数据和测试数据：

http://smallcode.weblogs.us/ <-- 页面下方。

回复收藏 0 原文

挽清梦 2024-07-13 04:21:42

Boost 有一个 boost::hash 库，可以提供最常见类型的一些基本哈希函数。

回复收藏 0 原文

三生路 2024-07-13 04:21:42

我使用 Jenkins 哈希编写了一个 Bloom 过滤器库，它具有出色的性能。

详细信息和代码可在此处找到：http://burtleburtle.net/bob/c/lookup3.c< /a>

这就是 Perl 用于散列操作的内容，fwiw。

回复收藏 0 原文

羁绊已千年 2024-07-13 04:21:42

如果您要对一组固定的单词进行哈希处理，则最好的哈希函数通常是完美哈希函数。但是，它们通常要求您尝试散列的单词集在编译时已知。在词法分析器中检测关键字（并将关键字转换为标记）是完美哈希的常见用法使用 gperf 等工具生成的函数。完美的哈希还可以让您用简单的数组或向量替换 hash_map。

如果您没有对一组固定的单词进行哈希处理，那么显然这不适用。

回复收藏 0 原文

热情消退 2024-07-13 04:21:42

对于字符串哈希的一个经典建议是逐个遍历字母，将它们的 ascii/unicode 值添加到累加器中，每次将累加器乘以一个素数。（允许散列值溢出）

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

在不丢弃数据的情况下很难获得比这更快的速度。如果您知道您的字符串只能通过几个字符而不是其全部内容来区分，那么您可以做得更快。

如果计算值过于频繁，您可以尝试通过创建 basic_string 的新子类来更好地缓存哈希值，该子类会记住其哈希值。不过，hash_map 应该在内部执行此操作。

One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.

You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.

回复收藏 0 原文

桃扇骨 2024-07-13 04:21:42

我做了一些搜索，有趣的是，保罗·拉尔森的小算法出现在这里
http://www.strchr.com/hash_functions
在多种条件下进行的测试中，碰撞次数是最少的，而且无论是展开还是工作台驱动，速度都非常快。

Larson 就是上面那个简单的乘以 101 并加循环。

回复收藏 0 原文

衣神在巴黎 2024-07-13 04:21:42

Python 3.4 包含基于 SipHash 的新哈希算法。 PEP 456 内容非常丰富。

回复收藏 0 原文

揽清风入怀 2024-07-13 04:21:42

从哈希函数一路向下：

MurmurHash 非常受欢迎，至少在游戏开发者圈子里，作为“通用哈希函数”。
这是一个不错的选择，但让我们稍后看看是否可以做得更好。另一个不错的选择，特别是如果您对数据的了解不仅仅是“它将是未知数量的字节”，那就是滚动您自己的数据（例如，参见 Won Chun 的回复，或 Rune 的修改后的 xxHash/Murmur，专门用于 4 字节密钥） ETC。）。如果您了解自己的数据，请务必尝试看看这些知识是否可以发挥良好的效果！

如果没有更多信息，我会推荐 MurmurHash 作为通用非加密哈希函数。对于小字符串（程序中平均标识符的大小），非常简单且著名的 djb2 和 FNV 非常好。

在这里（数据大小<10字节）我们可以看到其他算法的ILP智能并没有表现出来，而FNV或djb2的超级简单性在性能上获胜。

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

关于安全性和可用性的说明

哈希函数可能会使您的代码容易受到拒绝服务攻击。如果攻击者能够强制您的服务器处理过多的冲突，您的服务器可能无法处理请求。

一些哈希函数，例如 MurmurHash 接受您可以提供的种子以大幅减少攻击者预测您的服务器软件生成的哈希值的能力。记住这一点。

From Hash Functions all the way down:

MurmurHash got quite popular, at least in game developer circles, as a “general hash function”.
It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

A note about security and availability

Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.

Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.

回复收藏 0 原文