针对完整迭代优化的哈希表 +更换钥匙

发布于 2024-11-06 15:08:23 字数 926 浏览 0 评论 0原文

我有一个哈希表,其中运行时的绝大多数访问都遵循以下模式之一:

  • 迭代所有键/​​值对。 (此操作的速度至关重要。)
  • 修改键(即删除键/值对并添加另一个具有相同值但不同键的键/值对。检测重复键并在必要时组合值。)这是在循环中完成的,影响数千个键,但没有其他操作介入。

我还希望它消耗尽可能少的内存。

其他标准操作必须可用,尽管它们使用频率较低,例如

  • 插入新的键/值
  • 对给定一个键,查找相应的值
  • 更改与现有键关联的值

当然所有“标准”哈希表实现,包括大多数高级语言的标准库都具有所有这些功能。我正在寻找的是针对第一个列表中的操作进行优化的实现。

常见实现的问题:

  • 大多数哈希表实现都使用单独的链接(即每个存储桶的链接列表)。这可行,但我希望有一些占用更少内存和更好的引用局部性的东西。注意:我的密钥很小(每个密钥 13 个字节,填充到 16 个字节。)
  • 大多数开放寻址方案对我的应用程序来说都有一个主要缺点:密钥在大组中被删除和替换。这留下了会增加负载因子的删除标记,从而需要经常重新构建表。

有效但不太理想的方案:

  • 每个存储桶使用数组(而不是链表)进行单独链接:
    引用局部性差,由内存碎片导致,因为小数组被重新分配多次
  • 线性探测/二次散列/双散列(有或没有布伦特变体):
    表格很快就填满了删除标记
  • 布谷鸟散列
    仅适用于 <50% 负载因子,并且我想要高 LF 来节省内存并加快迭代速度。

是否有专门的哈希方案适合这种情况?


注意:我有一个很好的哈希函数,它可以很好地处理 2 的幂和素数表大小,并且可以用于双重哈希,所以这不应该是一个问题。

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:

  • Iterate through all key/value pairs. (The speed of this operation is critical.)
  • Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.

I would also like it to consume as little memory as possible.

Other standard operations must be available, though they are used less frequently, e.g.

  • Insert a new key/value pair
  • Given a key, look up the corresponding value
  • Change the value associated with an existing key

Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.

Issues with common implementations:

  • Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
  • Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.

Schemes that work, but are less than ideal:

  • Separate chaining with an array (instead of a linked list) per bucket:
    Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
  • Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
    Table quickly fills up with deletion markers
  • Cuckoo hashing
    Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.

Is there a specialized hashing scheme that would work well for this case?


Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

北城挽邺 2024-11-13 15:08:23

可扩展哈希有帮助吗?通过遍历“目录”来迭代键应该很快。不确定“修改键值”操作是否更好地使用此方案。

Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.

聽兲甴掵 2024-11-13 15:08:23

根据您访问数据的方式,使用哈希表真的有意义吗?

由于您的主要用例涉及迭代 - 排序列表或 btree 可能是更好的数据结构。

您似乎并不真正需要构建哈希表的恒定时间随机数据访问。

Based on how you're accessing the data, does it really make sense to use a hash table at all?

Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.

It doesnt seem like you really need the constant time random data access a hash table is built for.

醉城メ夜风 2024-11-13 15:08:23

使用布谷鸟哈希可以比 50% 的负载系数做得更好。

两个包含四个项目的哈希函数将毫不费力地让您获得超过 90% 的结果。请参阅本文:

http://www.ru.is/faculty/ulfar/CuckooHash.pdf

我正在使用布谷鸟哈希构建一个预先计算的字典,并通过两个哈希函数和每个存储桶七个项目获得优于 99% 的负载因子。

You can do much better than a 50% load factor with cuckoo hashing.

Two hash functions with four items will get you over 90% with little effort. See this paper:

http://www.ru.is/faculty/ulfar/CuckooHash.pdf

I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文