对字节字符串进行哈希处理
我正在开发一个个人项目,一个文件压缩程序,并且我的符号字典遇到了问题。我需要将以前遇到的字节字符串存储到一个结构中,以便我可以快速检查它们是否存在并检索它们。我一直在假设哈希表最适合此目的的情况下进行操作,因此我的问题将与哈希函数有关。然而,如果有人能提出一个更好的哈希表替代方案,我会洗耳恭听。 好的。所以问题是我无法为这些字节字符串想出一个好的哈希键。我想到的一切要么分布非常不均匀,要么花费的时间太长。这是我正在处理的情况的列表:
- 所有字节字符串至少 长度为两个字节。
- 哈希表的最大大小为 3839,并且很可能会被填满。
- 测试表明,对于任何给定的字节,与较低的七位相比,最高位被设置的可能性要小得多。
- 否则,字符串中的字节可以是 0 - 255 之间的任何值(我正在使用任何格式的原始字节数据)。
- 我正在 UNIX 环境中使用 C 语言。我更愿意坚持使用标准库,但它不需要移植到其他操作系统。 (IE unistd.h 就可以)。
- 安全性无需担心。
- 速度是一个高度关注的问题。
- 大小并不重要,因为它不会写入文件。然而,考虑到所存储的字节字符串的潜在大小,在压缩期间内存空间可能成为一个问题。
I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
- All byte strings will be at least
two bytes in length. - The hash table will have a maximum size of 3839, and it is very likely it will fill.
- Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
- Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
- I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
- Security is of NO concern.
- Speed is of a HIGH concern.
- The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
trie 更适合这种事情,因为它可以让你将符号存储为树,并且快速解析它以匹配值(或拒绝它们)。
作为奖励,您根本不需要哈希。您一次存储/检索/比较整个序列,同时仍然只保留最少量的内存。
编辑:作为额外的好处,只需第二次解析,您就可以查找与当前序列“接近”的序列,因此您可以摆脱一个序列并为它们使用前一个序列,并带有一些内部符号来保存差异。这将帮助您更好地压缩文件,因为:
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because: