非STL哈希表类型结构
有没有一种方法可以编写简单的哈希表,其中键为“字符串”,值作为频率,这样就不会发生冲突?不会从哈希表中删除,如果该对象已经存在于哈希表中,则只需更新其频率(将它们加在一起)。
我在想可能有一种算法可以从字符串中计算出一个唯一的数字,并将其用作索引。
是的,我避免使用所有 STL 构造,包括 unordered_map。
Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用任何完美的哈希生成器,例如 gperf
请参阅此处查看列表: http://en.wikipedia.org/ wiki/Perfect_hash_function
PS。您仍然可能想要使用映射而不是平面数组/向量,以防映射域变得太大/稀疏
You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse
这实际上取决于您所说的“简单”是什么意思。
std::map 是一个相当简单的类。尽管如此,它仍然使用红黑树,所有插入、删除和平衡都很好地隐藏起来,并且它被模板化以处理任何可排序类型作为键和任何类型作为值。大多数映射类使用类似的实现,并避免任何类型的散列功能。
没有碰撞的哈希值无论如何都不是一件小事。也许最简单的方法是Pearson Hashing。
看起来你有 3 个选择:
实现你自己的完美哈希类。这将是一个规模相当大的类,具有很多功能和一些相当复杂的算法。我认为这并不简单。
下载并使用现有的完美哈希库。当然,您必须担心可部署性。
使用STL的地图类。它是嵌入式的、文档齐全、易于使用、类型灵活且完全跨平台。这似乎是“最简单”的解决方案。
如果我可以问,你为什么要避免 STL?
It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?
如果预先知道可能的字符串集,则可以使用完美的哈希函数生成器来执行此操作。但否则的话,你所要求的就是不可能的。
现在,通过使用良好的哈希函数并确保表很大,可以使冲突的可能性极低。您基本上需要一个足够大的表来使调用 生日悖论 的可能性足够低适合你。然后,您只需使用 SHA-1 的 n 位输出,2^n 将是您的表大小。
我还想知道您是否可以使用 Bloom 过滤器 并拥有一个实际的计数器而不是位。保留您填充到布隆过滤器中的所有单词的列表以及它们增加的条目(每次都相同),您自己就有一个巨大的线性函数,您可以解决它以获得所有个人再次倒数。
If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.