使用 hash_map 时,对 stl 字符串使用的最佳哈希算法是什么?

发布于 2024-07-06 04:21:42 字数 71 浏览 7 评论 0原文

我发现 VS2005 上的标准哈希函数在尝试实现高性能查找时速度非常慢。 有哪些快速有效的哈希算法可以避免大多数冲突的好例子?

I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

燃情 2024-07-13 04:21:42

我与 Microsoft Research 的 Paul Larson 合作实现了一些哈希表。 他研究了各种数据集上的多个字符串哈希函数,发现一个简单的乘以 101 和加循环的效果出奇的好。

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}
旧街凉风 2024-07-13 04:21:42

从我的一些旧代码来看:

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

它很快。 确实快得要命。

From some old code of mine:

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

Its fast. Really freaking fast.

奈何桥上唱咆哮 2024-07-13 04:21:42

这始终取决于您的数据集。

我通过使用字符串的 CRC32 获得了令人惊讶的良好结果。 适用于各种不同的输入集。

在网上很容易找到许多好的 CRC32 实现。

编辑:差点忘了:这个页面有一个很好的哈希函数对比,其中包含性能数据和测试数据:

http://smallcode.weblogs.us/ <-- 页面下方。

That always depends on your data-set.

I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.

Lots of good CRC32 implementations are easy to find on the net.

Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:

http://smallcode.weblogs.us/ <-- further down the page.

挽清梦 2024-07-13 04:21:42

Boost 有一个 boost::hash 库,可以提供最常见类型的一些基本哈希函数。

Boost has an boost::hash library which can provides some basic hash functions for most common types.

三生路 2024-07-13 04:21:42

我使用 Jenkins 哈希编写了一个 Bloom 过滤器库,它具有出色的性能。

详细信息和代码可在此处找到:http://burtleburtle.net/bob/c/lookup3.c< /a>

这就是 Perl 用于散列操作的内容,fwiw。

I've use the Jenkins hash to write a Bloom filter library, it has great performance.

Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c

This is what Perl uses for its hashing operation, fwiw.

羁绊已千年 2024-07-13 04:21:42

如果您要对一组固定的单词进行哈希处理,则最好的哈希函数通常是完美哈希函数。 但是,它们通常要求您尝试散列的单词集在编译时已知。 在词法分析器中检测关键字(并将关键字转换为标记)是完美哈希的常见用法使用 gperf 等工具生成的函数。 完美的哈希还可以让您用简单的数组或向量替换 hash_map

如果您没有对一组固定的单词进行哈希处理,那么显然这不适用。

If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_map with a simple array or vector.

If you're not hashing a fixed set of words, then obviously this doesn't apply.

热情消退 2024-07-13 04:21:42

对于字符串哈希的一个经典建议是逐个遍历字母,将它们的 ascii/unicode 值添加到累加器中,每次将累加器乘以一个素数。 (允许散列值溢出)

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

在不丢弃数据的情况下很难获得比这更快的速度。 如果您知道您的字符串只能通过几个字符而不是其全部内容来区分,那么您可以做得更快。

如果计算值过于频繁,您可以尝试通过创建 basic_string 的新子类来更好地缓存哈希值,该子类会记住其哈希值。 不过,hash_map 应该在内部执行此操作。

One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.

You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.

桃扇骨 2024-07-13 04:21:42

我做了一些搜索,有趣的是,保罗·拉尔森的小算法出现在这里
http://www.strchr.com/hash_functions
在多种条件下进行的测试中,碰撞次数是最少的,而且无论是展开还是工作台驱动,速度都非常快。

Larson 就是上面那个简单的乘以 101 并加循环。

I did a little searching, and funny thing, Paul Larson's little algorithm showed up here
http://www.strchr.com/hash_functions
as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.

Larson's being the simple multiply by 101 and add loop above.

衣神在巴黎 2024-07-13 04:21:42

Python 3.4 包含基于 SipHash 的新哈希算法。 PEP 456 内容非常丰富。

Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.

揽清风入怀 2024-07-13 04:21:42

哈希函数一路向下

MurmurHash 非常受欢迎,至少在游戏开发者圈子里,作为“通用哈希函数”。

这是一个不错的选择,但让我们稍后看看是否可以做得更好。 另一个不错的选择,特别是如果您对数据的了解不仅仅是“它将是未知数量的字节”,那就是滚动您自己的数据(例如,参见 Won Chun 的回复,或 Rune 的修改后的 xxHash/Murmur,专门用于 4 字节密钥) ETC。)。 如果您了解自己的数据,请务必尝试看看这些知识是否可以发挥良好的效果!

如果没有更多信息,我会推荐 MurmurHash 作为通用非加密哈希函数。 对于小字符串(程序中平均标识符的大小),非常简单且著名的 djb2FNV 非常好。

在这里(数据大小<10字节)我们可以看到其他算法的ILP智能并没有表​​现出来,而FNV或djb2的超级简单性在性能上获胜。

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

关于安全性和可用性的说明

哈希函数可能会使您的代码容易受到拒绝服务攻击。 如果攻击者能够强制您的服务器处理过多的冲突,您的服务器可能无法处理请求。

一些哈希函数,例如 MurmurHash 接受您可以提供的种子以大幅减少攻击者预测您的服务器软件生成的哈希值的能力。 记住这一点。

From Hash Functions all the way down:

MurmurHash got quite popular, at least in game developer circles, as a “general hash function”.

It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

A note about security and availability

Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.

Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.

清风夜微凉 2024-07-13 04:21:42

如果您的字符串平均比单个缓存行长,但它们的长度+前缀相当唯一,请考虑仅使用长度+前 8/16 个字符。 (长度包含在 std::string 对象本身中,因此读取起来很便宜)

If your strings are on average longer than a single cache line, but their length+prefix are reasonably unique, consider hasing just the length+first 8/16 characters. (The length is contained in the std::string object itself and therefore cheap to read)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文