使用Hash Maps来表示一个非常大的数据源

发布于 2024-07-19 20:58:00 字数 302 浏览 5 评论 0原文

我有一个非常大的可能数据集,我试图立即将其可视化。 该集合本身由数十万个段组成,每个段都映射到一个 id。

我收到了第二个数据源,它为每个段提供更多实时信息,但 id 与我拥有的 id 不对应。

我有数据 id(9 个字符的字符串)到当前 id(长整数)的 1:1 映射。 问题是有很多 id,并且传入的数据没有特定的顺序。

我想出的解决方案是使用一个哈希映射将字符串映射到道路 ID。 问题是我不知道哈希映射是否足够有效来拥有所有 166k 数据条目。

有人有任何建议和/或哈希算法可供我使用吗?

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.

I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.

I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.

The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.

Does anyone have any suggestions and/or hashing algorithms that I can use for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

话少心凉 2024-07-26 20:58:00

Judy 数组 专为此类事情而设计:“Judy 的主要优点是可扩展性、高性能和内存效率[...]Judy 可以替代许多常见的数据结构,例如数组、稀疏数组、哈希表、B 树、二叉树、线性列表、跳跃列表、其他排序和搜索算法以及计数函数。”

Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."

苏别ゝ 2024-07-26 20:58:00

如果您只处理数十万个数据点,那么采用简单的方法并坚持使用哈希映射可能不会有问题。

即使您有 500,000 个 9 字符字符串和相同数量的 long,每个项目仍然只有 16 字节左右,即总共 8,000,000 字节。 即使您将开销增加一倍,16 MB 也不会太大而无法一次性存储在内存中。

基本上,首先尝试简单的方法,只有当您的分析表明它花费的时间太长时才担心它。

If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.

Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.

Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.

牵你手 2024-07-26 20:58:00

由于您的字符串预先已知并且具有固定长度,因此理论上和实践上最好的解决方案是完美哈希。 您可以使用 cmph 来生成它。

根据 Wikipedia,您的密钥将需要 2.5 位/密钥,即大约 50KB。 与 664KB 的值相比,这可以忽略不计。

Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.

According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.

终陌 2024-07-26 20:58:00

由于对该问题的评论表明主要关注点可能是内存使用:

  • 使用池化或其他小对象优化分配器; 假设您有权访问 boost 您可能可以在 。 使用更好的小对象分配器可能是您会发现的最大的内存优势。
  • 如果您知道字符串是固定宽度的,则可能需要确保仅分配足够的空间来存储它们。 例如,使用自定义比较运算符包裹固定长度 char[] 的结构可能比 std::string 更好。 std::string 带有额外的动态分配(并为相应的指针使用空间)以及一些额外的大小和容量跟踪开销。 (通常,尝试减少保留的分配数量;这会减少开销。)
  • (假设 STL)查看 std::map 和 std::unordered_map 之间的开销差异(后者可能或目前可能无法为您服务); 基于 RBtree 的 std::map 可能足够接近“哈希图”的查找性能特征,并且可能(或可能不会)具有更高的内存效率,具体取决于您的标准库实现。

您采取的路线应该受到您可以收集的信息的影响 - 尝试了解分配数量和分配大小/对齐开销。

您可以检测您的分配器或插入一些元素,然后看看您在内存使用方面的表现与您认为应该做的相比如何。

Since the comments on the question indicate the primary concern may be memory usage:

  • Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
  • If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
  • (Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.

What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.

You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.

云之铃。 2024-07-26 20:58:00

虽然 166k 数据条目在我看来相当小,但您可以查看 google-sparsehash

Although 166k data entries is rather small IMO you can have a look at google-sparsehash

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文