使用Hash Maps来表示一个非常大的数据源
我有一个非常大的可能数据集,我试图立即将其可视化。 该集合本身由数十万个段组成,每个段都映射到一个 id。
我收到了第二个数据源,它为每个段提供更多实时信息,但 id 与我拥有的 id 不对应。
我有数据 id(9 个字符的字符串)到当前 id(长整数)的 1:1 映射。 问题是有很多 id,并且传入的数据没有特定的顺序。
我想出的解决方案是使用一个哈希映射将字符串映射到道路 ID。 问题是我不知道哈希映射是否足够有效来拥有所有 166k 数据条目。
有人有任何建议和/或哈希算法可供我使用吗?
I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.
I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.
I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.
The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.
Does anyone have any suggestions and/or hashing algorithms that I can use for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Judy 数组 专为此类事情而设计:“Judy 的主要优点是可扩展性、高性能和内存效率[...]Judy 可以替代许多常见的数据结构,例如数组、稀疏数组、哈希表、B 树、二叉树、线性列表、跳跃列表、其他排序和搜索算法以及计数函数。”
Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."
如果您只处理数十万个数据点,那么采用简单的方法并坚持使用哈希映射可能不会有问题。
即使您有 500,000 个 9 字符字符串和相同数量的
long
,每个项目仍然只有 16 字节左右,即总共 8,000,000 字节。 即使您将开销增加一倍,16 MB 也不会太大而无法一次性存储在内存中。基本上,首先尝试简单的方法,只有当您的分析表明它花费的时间太长时才担心它。
If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.
Even if you have 500,000 9-character strings and an equal number of
long
s, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.
由于您的字符串预先已知并且具有固定长度,因此理论上和实践上最好的解决方案是完美哈希。 您可以使用 cmph 来生成它。
根据 Wikipedia,您的密钥将需要 2.5 位/密钥,即大约 50KB。 与 664KB 的值相比,这可以忽略不计。
Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.
According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.
由于对该问题的评论表明主要关注点可能是内存使用:
您采取的路线应该受到您可以收集的信息的影响 - 尝试了解分配数量和分配大小/对齐开销。
您可以检测您的分配器或插入一些元素,然后看看您在内存使用方面的表现与您认为应该做的相比如何。
Since the comments on the question indicate the primary concern may be memory usage:
What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.
You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.
虽然 166k 数据条目在我看来相当小,但您可以查看 google-sparsehash
Although 166k data entries is rather small IMO you can have a look at google-sparsehash