存储聚集在原点附近的二维点的数据结构?
我需要为我的应用程序使用空间二维地图。地图通常在 (-200, -200) - (200, 200)
矩形中包含少量值,大部分位于 (0, 0)
周围。
我想过使用哈希映射,但后来我需要一个哈希函数。我想到了 x * 200 + y
但添加 (0, 0)
和 (1, 0)
将需要 800 个字节的哈希值仅表,内存是我的应用程序中的一个问题。
初始设置后映射是不可变的,因此插入时间不是问题,但访问量很大(每秒大约 600 次),而且目标 CPU 速度并不快。
在小区域中,散列映射和普通映射(我相信 stl 中的 RB-Tree)之间的一般内存/访问时间权衡是什么?对于小区域来说,什么是好的哈希函数?
I need to use a spatial 2d map for my application. The map usually contains small amount of values in (-200, -200) - (200, 200)
rectangle, most of them around (0, 0)
.
I thought of using hash map but then I need a hash function. I thought of x * 200 + y
but then adding (0, 0)
and (1, 0)
will require 800 bytes for the hash table only, and memory is a problem in my application.
The map is immutable after initial setup so insertion time isn't a matter, but there is a lot of access (about 600 a second) and the target CPU isn't really fast.
What are the general memory/access time trade-offs between hash map and ordinary map(I believe RB-Tree in stl) in small areas? what is a good hash function for small areas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为有一些事情我需要更详细地解释一下才能回答您的问题。
对于初学者来说,程序中通常使用的哈希函数与哈希表中使用的存储桶数量之间存在很大区别。在哈希函数的大多数实现中,哈希函数是从对象到整数的某种映射。然后哈希表可以自由选择它想要的任意数量的桶,然后从整数映射回这些桶。通常,这是通过获取哈希码然后按桶的数量对其进行修改来完成的。这意味着,如果您想在哈希表中存储点,则无需担心哈希函数生成的值有多大。例如,如果哈希表只有三个存储桶,而您生成哈希码为 0 和 1,000,000,000 的对象,则第一个对象将哈希到第 0 个存储桶,第二个对象将哈希到 1,000,000,000 % 3 = 第一个存储桶。您不需要 1,000,000,000 个桶。因此,您不必担心选择像 x * 200 + y 这样的哈希函数,因为除非您的哈希表实现得非常奇怪,否则您不需要担心空间使用情况。
如果您以仅插入一次然后花费大量时间进行访问的方式创建哈希表,您可能需要查看 完美的哈希函数和完美的哈希表。这些数据结构的工作原理是尝试为您存储的点集找到哈希函数,这样就不会发生冲突。创建它们需要(预期) O(n) 时间,并且可以在最坏情况下 O(1) 时间进行查找。排除计算哈希函数的开销,这是在空间中查找点的最快方法。
不过,如果您只是像 std::map 的大多数实现一样将所有内容转储到基于树的映射中,那么您应该完全没问题。最多有 400x400 = 160,000 个点,查找一个点所需的时间约为 lg 160,000 ≈ 18 次查找。这不太可能成为任何应用程序的瓶颈,但如果您确实需要所有性能,则可以获得上述完美的哈希表可能是最佳选择。
但是,只有当您感兴趣的查询的形式为“点 p 是否存在于集合中?”时,这两种解决方案才有效。如果您想要执行更复杂的几何查询,例如最近邻查找或查找边界框中的所有点,您可能需要研究更复杂的数据结构,例如 kd 树,支持极快 (O(log n)) 查找以及快速最近邻和范围搜索。
希望这有帮助!
I think that there are a few things that I need to explain in a bit more detail to answer your question.
For starters, there is a strong distinction between a hash function as its typically used in a program and the number of buckets used in a hash table. In most implementations of a hash function, the hash function is some mapping from objects to integers. The hash table is then free to pick any number of buckets it wants, then maps back from the integers to those buckets. Commonly, this is done by taking the hash code and then modding it by the number of buckets. This means that if you want to store points in a hash table, you don't need to worry about how large the values that your hash function produces are. For example, if the hash table has only three buckets and you produce objects with hash codes 0 and 1,000,000,000, then the first object would hash to the zeroth bucket and the second object would hash to the 1,000,000,000 % 3 = 1st bucket. You wouldn't need 1,000,000,000 buckets. Consequently, you shouldn't worry about picking a hash function like x * 200 + y, since unless your hash table is implemented very oddly you don't need to worry about space usage.
If you are creating a hash table in a way where you will be inserting only once and then spending a lot of time doing accesses, you may want to look into perfect hash functions and perfect hash tables. These are data structures that work by trying to find a hash function for the set of points that you're storing such that no collisions ever occur. They take (expected) O(n) time to create, and can do lookups in worst-case O(1) time. Barring the overhead from computing the hash function, this is the fastest way to look up points in space.
If you were just to dump everything in a tree-based map like most implementations of
std::map
, though, you should be perfectly fine. With at most 400x400 = 160,000 points, the time required to look up a point would be about lg 160,000 ≈ 18 lookups. This is unlikely to be a bottleneck in any application, though if you really need all the performance you can get the aforementioned perfect hash table is likely to be the best option.However, both of these solutions only work if the queries you are interested in are of the form "does point p exist in the set or not?" If you want to do more complex geometric queries like nearest-neighbor lookups or finding all the points in a bounding box, you may want to look into more complex data structures like the k-d tree, which supports extremely fast (O(log n)) lookups and fast nearest-neighbor and range searches.
Hope this helps!
你的术语有点困惑。
标准库中的“映射”对象是关联数组的实现(通过哈希表或二叉搜索树)。
如果您正在进行 2D 空间处理并希望实现搜索结构,则有许多专用数据对象 - 即 四叉树和kd树。
编辑:有关实现的一些想法,也许可以检查: https://stackoverflow.com/questions/1402014/kdtree-实施-c。
老实说 - 数据结构并没有那么复杂 - 我总是自己推出。
Slightly confused by your terminology.
The "map" objects in the standard library are implementations of associative arrays (either via hash tables or binary search trees).
If you're doing 2D spatial processing and are looking to implement a search structure, there are many dedicated data objects - i.e. quadtrees and k-d trees.
Edit: For a few ideas on implementations, perhaps check: https://stackoverflow.com/questions/1402014/kdtree-implementation-c.
Honestly - the data structures aren't that complex - I've always rolled my own.