开放寻址与单独链接
当负载因子接近 1 时,哪种 hashmap 碰撞处理方案更好,以确保最小的内存浪费?
我个人认为答案是使用线性探测的开放寻址,因为在发生冲突时它不需要任何额外的存储空间。这是正确的吗?
Which hashmap collision handling scheme is better when the load factor is close to 1 to ensure minimum memory wastage?
I personally think the answer is open addressing with linear probing, because it doesn't need any additional storage space in case of collisions. Is this correct?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
回答这个问题:当负载因子接近 1 时,哪种哈希图碰撞处理方案更好,以确保最小的内存浪费?
允许高频率的开放寻址/探测因为正如您自己所说,碰撞不需要额外的空间(只是,可能是时间——当然这也是假设散列函数不完美)。
如果您没有指定“负载因子接近 1”或在问题中包含“成本”指标,那么情况将完全不同。
快乐编码。
Answering the question: Which hashmap collision handling scheme is better when the load factor is close to 1 to ensure minimum memory wastage?
Open addressing/probing that allows a high fill. Because as you said so yourself, there is no extra space required for collisions (just, well, possibly time -- of course this is also assuming the hash function isn't perfect).
If you did not specify "load factor close to 1" or included "cost" metrics in the question then it would be entirely different.
Happy coding.
如此完整的哈希图将降级为线性搜索,因此您需要将它们保持在 90% 以下。
您关于使用较少内存的开放寻址是正确的,链接将需要每个节点中的指针或偏移字段。
当我需要非常轻量级的哈希表且不会有大量插入时,我创建了一个哈希数组数据结构。为了保持较低的内存使用率,所有数据都嵌入在同一内存块中,以 HashArray 结构开头,然后是两个用于哈希值和哈希值的数组。价值观。 Hasharray 只能与存储在值中的查找键一起使用。
宏 hasharray_get_hashs & hasharray_get_values 用于获取“哈希”和“哈希”。 “值”数组。
我已使用它来添加对已存储在数组中的复杂对象的快速查找。这些对象有一个用于查找的字符串“名称”字段。名称被散列并与对象索引一起插入到散列数组中。存储在 hasharray 中的值可以是索引/指针/整个对象(我只使用小的 16 位索引值)。
如果你想打包 hasharray 直到它几乎满了,那么你将需要使用完整的 32 位哈希而不是上面定义的 16 位哈希。当散列数组超过 90% 满时,较大的 32 位散列将有助于保持快速搜索。
上面定义的 hasharray 最多只能容纳 65535,这很好,因为我从来没有在任何有超过几百个值的东西上使用它。任何需要更多的东西都应该使用普通的哈希表。但如果内存确实是一个问题,HashSize 类型可以更改为 32 位。我还使用 2 的幂长度来保持哈希查找速度快。有些人更喜欢使用质数桶长度,但只有当哈希函数分布不良时才需要这样做。
哈希冲突将会发生,因此调用 hasharray_search() 的代码需要将搜索键与存储在值对象中的键进行比较。如果它们不匹配,则再次调用 hasharray_search()。也可以存在非唯一键,因为搜索可以继续,直到返回“NULL”以查找与一个键匹配的所有值。搜索功能使用线性探测来友好地进行缓存。
如果数组已满 99% 并且键不存在,最坏情况下的搜索将降级为整个数组的线性搜索。如果键是整数,那么线性搜索应该不会太糟糕,而且只有具有相同哈希值的键才需要比较。我尝试保持散列数组的大小,使它们仅占 70-80% 左右,如果值仅为 16 位值,则在空槽上浪费的空间并不多。通过这种设计,当使用 16 位哈希值时,每个空槽仅浪费 4 个字节。 16 位索引值。对象数组(上例中的 Field 结构)没有空位。
另外,我见过的大多数哈希表实现都不存储计算出的哈希值,并且需要完整的键比较来解决存储桶冲突。比较哈希值有很大帮助,因为只有一小部分哈希值用于查找存储桶。
A hashmap that is that full will degrade into a linear search, so you will want to keep them under 90% full.
You are right about open addressing using less memory, chaining will need a pointer or offset field in each node.
I have created a hasharray data structure for when I need very lightweight hashtables that will not have alot of inserts. To keep memory usage low all data is embedded in the same block of memory, with the HashArray structure at the start, then two arrays for hashs & values. Hasharray can only be used with the lookup key is stored in the value.
The macros hasharray_get_hashs & hasharray_get_values are used to get the 'hashs' & 'values' arrays.
I have used this for adding fast lookup of complex objects that are already stored in an array. The objects have a string 'name' field which is used for the lookup. The names are hashed and inserted into the hasharray with the objects index. The values stored in the hasharray can be indexes/pointers/whole objects (I only use small 16bit index values).
If you want to pack the hasharray till it is almost full, then you will want to use full 32bit Hashs instead of the 16bit ones defined above. Larger 32bit hashs will help keep searchs fast when the hasharray is more then 90% full.
The hasharray as defined above can only hold a maximum of 65535, which is fine since I never use it on anything that would have more the a few hundred values. Anything that needs more that that should just use an normal hashtable. But if memory is really an issue, the HashSize type could be changed to 32bits. Also I use power-of-2 lengths to keep the hash lookup fast. Some people prefer to use prime bucket lengths, but that is only needed if the hash function has bad distribution.
hash collisions will happen so the code that calls hasharray_search() needs to compare the search key with the one stored in the value object. If they don't match then hasharray_search() is called again. Also non-unique keys can exist, since searching can continue until 'NULL' is returned to find all values that match one key. The search function uses linear probing to be cache freindly.
The worst case searching will degrade to a linear search of the whole array if it is 99% full and the key doesn't exist. If the keys are integers, then a linear search shouldn't be to bad, also only keys with the same hash value will need to be compared. I try to keep the hasharrays sized so they are only about 70-80% full, the space wasted on empty slots isn't much if the values are only 16bit values. With this design you only waste 4bytes per empty slot when using 16bit hashs & 16bit index values. The array of objects (Field structs in the above example) has no empty spots.
Also most hashtable implementations that I have seen don't store the computed hashs and require full key compares to resolve bucket collisions. Comparing the hashs helps a lot since only a small part of the hash value is used to lookup the bucket.
正如其他人所说,在线性探测中,当负载因子接近1时,时间复杂度接近线性搜索。 (当它已满时,它是无限的。)这里存在内存效率的权衡。虽然隔离链理论上总是给我们恒定的时间。
通常,在线性探测下,建议将负载系数保持在 1/8 至 1/2 之间。当数组已满 1/2 时,我们将其大小调整为原始数组大小的两倍。 (参考:算法。作者:Robert Sedgewick. Kevin Wayne。)。删除时,我们还将数组大小调整为原始大小的 1/2。如果你真的感兴趣,那么从我上面提到的那本书开始对你来说是有好处的。
实际上,据说0.72是我们通常使用的经验值。
As the others said, in linear probing, when load factor near to 1, the time complexity near to linear search. (When it's full, its infinite.) There is a memory-efficiency trade off here. While segregate chaining always give us theoretically constant time.
Normally, under linear probing, it's recommended to keep the load factor between 1/8 and 1/2. when the array is 1/2 full, we resize it to double the size of original array. (Reference: Algorithms. by Robert Sedgewick. Kevin Wayne. ). When delete, we resize the array to 1/2 of original size as well. If you are really interested, it's good for you to begin with the book I mentioned above.
In practical, it's said that 0.72 is an empirical value we usually use.