为非常大的数据选择数据结构
我有 x(百万)个正整数,它们的值可以尽可能大(+2,147,483,647)。假设它们是唯一的,那么为查找密集型程序存储它们的最佳方式是什么。
到目前为止,我想到使用二叉 AVL 树或哈希表,其中整数是映射数据(名称)的键。然而,我不确定我是否可以使用哈希表实现如此大的键和如此大的数量(除了容易发生冲突之外,这不会创建> 0.8的负载因子吗?)
我可以得到一些关于哪些数据的建议吗 ?结构可能适合我的情况
I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
结构的选择在很大程度上取决于您有多少可用内存。我根据描述假设您需要查找但不循环它们、查找最近的或其他类似的操作。
最好的可能是桶式哈希表。通过将哈希冲突放入存储桶中并在存储桶中为键和值保留单独的数组,您既可以适当减小表的大小,又可以在搜索存储桶时利用 CPU 缓存加速。桶内的线性搜索甚至可能比二分搜索更快!
AVL 树非常适合读取密集型但非只读的数据集,并且需要有序枚举、查找最近的和类似的操作,但正确实现它们需要大量烦人的工作。不过,由于 CPU 缓存行为,您可能会使用 B 树获得更好的性能,尤其是忽略缓存的 B 树算法。
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
你研究过B树吗?效率介于
log_m(n)
和log_(m/2)(n)
之间,因此如果您选择m
约为 8-10或者这样您应该能够将搜索深度保持在 10 以下。Have you looked into B-trees? The efficiency runs between
log_m(n)
andlog_(m/2)(n)
so if you choosem
to be around 8-10 or so you should be able to keep your search depth to below 10.Bit Vector ,如果数字存在则设置索引。您可以调整它以获得每个数字出现的次数。 Bentley 的Programming Pearls 中有一篇关于位向量的精彩专栏。
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
如果内存不是问题,地图可能是您最好的选择。映射的复杂度为 O(1),这意味着当您增加要查找的项目数量时,查找值所需的时间是相同的。
一个映射,其中键是 int,值是名称。
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
请先尝试哈希表。有一些变体可以容忍非常密集而不会显着减速(例如布伦特变体)。
如果您只需要存储 32 位整数而不是任何关联记录,请使用
set
而不是map
,就像大多数情况下的hash_set
C++ 库。它只使用 4 字节记录加上一些恒定的开销和一点松弛以避免 100%。在最坏的情况下,要处理“数百万”的数字,您需要几十兆字节。虽然很大,但没有什么是难以管理的。如果您需要更紧凑,只需将它们排序存储在一个普通数组中,然后使用二分搜索来获取它们。这将是 O(log n) 而不是 O(1),但对于“数百万”条记录,获取其中任何一条记录仍然只需二十几个步骤。在 C 语言中,有
bsearch()
,它的速度是最快的。编辑:刚刚在您的问题中看到您谈到了一些“映射数据(名称)”。这些名字独特吗?它们也必须存在于记忆中吗?如果是的话,它们肯定会主导内存需求。即便如此,如果名称是典型的英文单词,则大多数为 10 字节或更小,使总大小保持在“数十兆字节”;也许高达一百兆,仍然非常容易管理。
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a
set
and not amap
, likehash_set
in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have
bsearch()
, which is as fast as it can get.edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.