我应该使用哪种数据结构来存储哈希值?
我有一个想要存储到磁盘的哈希表。该列表如下所示:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
有 1-5 百万个条目。目前我只是将它们存储在一个文件中,每个条目 17 字节乘以条目数。该文件有几十兆字节。我的目标是以首先优化磁盘空间然后优化查找时间的方式存储它们。插入时间并不重要。
最好的方法是什么?我希望文件尽可能小。多个文件也可以。帕特里夏尝试?根特里?
无论我得到什么好的建议,我都会实施和测试。我会将结果发布在这里供大家查看。
I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以只按键对条目进行排序并进行二分搜索。
固定大小的键和数据条目意味着您可以非常快速地从一行跳转到另一行,并且仅存储键和数据意味着您不会在元数据上浪费任何空间。
我不认为你会在磁盘空间上做得更好,而且查找时间是 O(log(n))。插入时间非常长,但你说这并不重要。
如果您确实愿意容忍较长的访问时间,请对表进行排序,然后将其分成一定大小的块并压缩它们。将每个块的偏移*和开始/结束键存储在文件开头的部分中。使用这种方案,您可以在线性时间内找到包含所需密钥的块,然后在解压缩的块内执行二分搜索。根据您愿意一次加载到内存中的文件量来选择块大小。
使用现成的压缩方案(如 GZIP),您可以根据需要调整压缩比;较大的文件可能具有更快的查找时间。
我怀疑节省的空间是否会那么大,因为您的结构似乎主要是散列。如果它们实际上是哈希值,那么它们是随机的并且不会很好地压缩。排序将有助于提高压缩率,但不会大幅提高。
*使用标头查找要解压和使用的块的偏移量。
You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.
500 万条记录,大约 81MB - 对于内存中的数组来说是可以接受的。
正如您所描述的问题 - 它是比哈希值更独特的键。
尝试使用哈希表来访问值(请查看此链接)。
如果存在我的误解,并且这是真正的哈希 - 尝试在此之上构建第二个哈希级别。
哈希表也可以在磁盘上成功组织(例如作为单独的文件)。
解决
具有良好搜索性能和很少开销的加法
4.1.使用函数计算它的哈希值
4.2.查找文件中的偏移量
4.3.从该位置开始从文件中读取记录,直到找到键或未到达下一个键的偏移量或文件结束。
还有一些必须指出的附加事项:
您可以在此处找到一些哈希函数的示例。
5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
You can find some examples of hash functions here.
这种简单的方法是否有效并将它们存储在 sqlite 数据库 中?我不认为它会变得更小,但你应该获得非常好的查找性能,而且它很容易实现。
Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.
首先 - 如果您想优化磁盘空间,多个文件是不行的,因为簇大小 - 当您创建大小约为 100 字节的文件时,每个簇大小的磁盘空间会减少 - 例如 2kB。
其次-在你的情况下,我会将所有表存储在单个二进制文件中,按键中的字节值简单地按 ASC 排序。它会给你的文件长度恰好等于entriesNumber * 17,如果你不想使用归档,这是最小的,其次,当你搜索关键分割文件时,你可以使用时间〜log2(entriesNumber)非常快速的搜索分成两部分,并将其边界上的密钥与所需的密钥进行比较。如果“边框键”较大,则取文件的第一部分,如果较大,则取第二部分。并再次将所参加的部分分为两部分,依此类推。
因此,您将需要大约 log2(entriesNumber) 读取操作来搜索单个键。
First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.
您的密钥是 128 位,但如果您最多有 10^7 个条目,则只需要 24 位即可对其进行索引。
您可以创建一个哈希表,或者
使用 Bentley 风格的展开二分搜索(最多 24 次比较),如
下面的展开循环(使用 32 位整数)。
Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
与文件设计一样,您对数据分布了解得(并告诉我们)越多越好。假设您的键值均匀分布在所有 16 字节键的集合中(如果您存储的是哈希表,这应该是正确的),我建议结合其他人已经建议的内容:
二进制数据例如这属于二进制文件;不要让哈希值和值的简单表示形式为十六进制数字字符串这一事实欺骗您,让您认为这是字符串数据;
文件大小使得整个 shebang 可以保存在任何现代 PC 或服务器以及许多其他设备的内存中;
文件
密钥的前 4 个字节将可能的密钥集划分为 16^4 (= 65536) 个子集;如果您的密钥均匀分布并且您有 5x10^6 条目,则每个子集大约有 76 个条目;因此,创建一个文件,其中每个子集有 100 个条目的空间;然后:
在偏移量 0 处开始写入前导 4 个字节 0x0000 的所有条目;用 0 填充到总共 100 个条目(我认为是 1700 个字节);
在偏移量 1700 处开始写入前导 4 个字节 0x0001、pad 的所有条目,
重复直到写完所有条目数据。
现在,您的查找变成了一种计算,找出文件中的偏移量,然后扫描最多 100 个条目以找到您想要的条目。如果这还不够快,则使用 16^5 子集,每个子集允许大约 6 个条目 (6x16^5 = 6291456)。我猜这会比二分搜索更快——但这只是一个猜测。
插入有点问题,由您对数据的了解来决定新条目是否(a)需要对子集重新排序或(b)可以简单地添加到条目列表的末尾在该索引处(这意味着在每次查找时扫描整个子集)。
如果空间非常重要,您当然可以从条目中删除前 4 个字节,因为它们是通过计算文件中的偏移量来计算的。
我描述的不是很好,是一个哈希表。
As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.