当前位置：文江博客话题详情

针对完整迭代优化的哈希表 +更换钥匙

发布于 2024-11-06 15:08:23 字数 926 浏览 0 评论 0原文

我有一个哈希表，其中运行时的绝大多数访问都遵循以下模式之一：

迭代所有键/值对。（此操作的速度至关重要。）
修改键（即删除键/值对并添加另一个具有相同值但不同键的键/值对。检测重复键并在必要时组合值。）这是在循环中完成的，影响数千个键，但没有其他操作介入。

我还希望它消耗尽可能少的内存。

其他标准操作必须可用，尽管它们使用频率较低，例如

插入新的键/值
对给定一个键，查找相应的值
更改与现有键关联的值

当然所有“标准”哈希表实现，包括大多数高级语言的标准库都具有所有这些功能。我正在寻找的是针对第一个列表中的操作进行优化的实现。

常见实现的问题：

大多数哈希表实现都使用单独的链接（即每个存储桶的链接列表）。这可行，但我希望有一些占用更少内存和更好的引用局部性的东西。注意：我的密钥很小（每个密钥 13 个字节，填充到 16 个字节。）
大多数开放寻址方案对我的应用程序来说都有一个主要缺点：密钥在大组中被删除和替换。这留下了会增加负载因子的删除标记，从而需要经常重新构建表。

有效但不太理想的方案：

每个存储桶使用数组（而不是链表）进行单独链接：
引用局部性差，由内存碎片导致，因为小数组被重新分配多次
线性探测/二次散列/双散列（有或没有布伦特变体）：
表格很快就填满了删除标记
布谷鸟散列
仅适用于 <50% 负载因子，并且我想要高 LF 来节省内存并加快迭代速度。

是否有专门的哈希方案适合这种情况？

注意：我有一个很好的哈希函数，它可以很好地处理 2 的幂和素数表大小，并且可以用于双重哈希，所以这不应该是一个问题。

原文

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:

Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.

I would also like it to consume as little memory as possible.

Other standard operations must be available, though they are used less frequently, e.g.

Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key

Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.

Issues with common implementations:

Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.

Schemes that work, but are less than ideal:

Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.

Is there a specialized hashing scheme that would work well for this case?

Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.

分享到QQ

分享到微博