大数据集哈希和 C 实现
我有大量范围从 0 - 5463458053 的值。对于每个值,我希望映射一个包含字符串的集合,以便操作查找,即查找该集合中是否存在字符串最短的时间。请注意,这组值可能不包含 (0 - 5463458053) 中的所有值,但确实包含大量值。
我当前的解决方案是对这些值(0 - 5463458053 之间)进行哈希处理,并为每个值提供与该值对应的字符串链接列表。每次,我想检查给定集合中的字符串,我都会对值(0 - 5463458053 之间)进行哈希处理,获取链表,然后遍历它以找出它是否包含上述字符串。
虽然这看起来更容易,但有点耗时。你能想出更快的解决方案吗?此外,碰撞将会是可怕的。它们会导致错误的结果。
另一部分是关于在 C 中实现这一点。我将如何去做呢?
注意:有人建议使用数据库。我想知道这是否有用。
我有点担心 RAM 自然会耗尽。 :-)
I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.
My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.
While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.
The other part is about implementing this in C. How would I go about doing this?
NOTE: Someone suggested using a database instead. I wonder if that'll be useful.
I'm a little worried about running out of RAM naturally. :-)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以有一个哈希集的哈希表。第一个哈希表包含整数的键。它里面的值是哈希集,即键是字符串的哈希表。
您还可以拥有一个散列集,其中键是整数和字符串对。
有许多库实现了此类数据结构(在 C++ 中,标准库正在实现它们,如
std::map
和std::set
)。对于 C,我想到了 GTK 的 Glib 。使用散列技术,内存使用与所考虑的集合(或关系)的大小成正比。例如,您可以接受 30% 的空置率。
You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.
You could also have an hashed set, with the keys being pairs of integers and strings.
There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as
std::map
&std::set
). For C, I was thinking of Glib from GTK.With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.
Judy Array 以及实现它的 C 库可能正是您所需要的基础。这是描述它的引用:
A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:
大量字符串+快速查找+有限内存---->你想要一个前缀特里树、临界位树或该家族的任何东西(非常相似的东西有许多不同的名称,例如PATRICIA...Judy也是这样的东西)。例如,请参阅此。
这些数据结构允许前缀压缩,因此它们能够非常有效地存储大量字符串(以某种方式必然具有公共前缀)。而且,查找速度非常快。由于常见的 big-O 表示法没有考虑到缓存和分页效应,它们可以与哈希一样快甚至更快,但只占用一小部分内存(即使根据 big-O,除了可能是一个数组之外什么都没有)可以击败哈希)。
Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.
These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).
您可以使用单个二叉搜索树(AVL/红黑/...)来包含所有集合中的所有字符串,方法是按字典顺序将它们键入为 (set_number, string)。您不需要在任何地方显式存储集合。例如,定义树的节点顺序的比较器可能如下所示:
使用这样的结构,可以进行一些常见操作(但可能并不简单)。
要查找字符串
s
是否存在于集合set_number
中,只需在树中查找 (set_number
,s
) ,进行精确匹配。查找集合
set_number
中的所有字符串:上面需要
O((k+1)*log(n))
时间,其中<code>k是set_number
中的字符串数量,n
是所有字符串的数量。要查找至少与一个字符串关联的所有集合数字:
上述需要
O((k+1)*log(n))
时间,其中k
是集合至少有一个字符串,n
是所有字符串的数量。请注意,上面的代码假设树在“do_something”调用中没有被修改;如果删除节点,它可能会崩溃。
另外,这里有一些真实的 C 代码演示了这一点,使用 my自己的通用 AVL 树实现。要编译它,只需从 BadVPN 源代码中复制
misc/
和struction/
文件夹并在其中添加包含路径即可。请注意我的 AVL 树在其节点中不包含任何“数据”,以及它如何不执行任何自己的内存分配。当您有大量数据需要处理时,这会很方便。明确一点:下面的程序只执行一个
malloc()
,即分配节点数组的函数。You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:
With such a structure, some common operations are possible (but maybe not straightforward).
To find whether a string
s
exists in the setset_number
, simply lookup (set_number
,s
) in the tree, for an exact match.To find all strings in the set
set_number
:The above requires
O((k+1)*log(n))
time, wherek
is the number of strings inset_number
, andn
is the number of all strings.To find all set numbers with at least one string associated:
The above requires
O((k+1)*log(n))
time, wherek
is the number of sets with at least one string, andn
is the number of all strings.Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.
Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the
misc/
andstructure/
folders from BadVPN source somewhere and add an include path there.Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single
malloc()
, which is the one that allocates the nodes array.如果条目从 0 到 N 并且连续:使用数组。 (索引对您来说足够快吗?)
编辑:数字似乎不连续。有大量的{key,value}对,其中键是一个很大的数字(>32位但<64位),而值是一堆字符串。
如果内存可用,哈希表很容易,如果字符串不是太大,您可以按顺序检查它们。如果相同的字符串出现(多次)多次,您可以枚举这些字符串(将指向它们的指针放入 char * array[] 中,并使用该数组的索引。查找给定字符串的索引可能涉及另一个哈希表)
对于“主”哈希表,条目可能是:
如果您有足够的内存来存储头数组,那么您当然应该这样做:
,否则您可以将头数组与条目数组组合起来。 已知的整数键集所做的那样)
(就像我在此处查找 碰撞很容易:当您遍历溢出链时,只需将您的密钥与条目中的密钥进行比较即可。如果它们不相等:继续前进。如果它们相等:找到;现在去抚弦吧。
If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)
EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.
If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)
For the "master" hashtable an entry would probably be:
if you have enough memory to store a heads-array, you chould certainly do that:
, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)
Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.