当前位置：文江博客话题详情

C++映射插入和查找性能以及存储开销

发布于 2024-08-13 04:34:08 字数 894 浏览 5 评论 0 原文

我想在内存中存储 integer 键到 float 值的映射。

我大约有 1.3 亿个键（相应地，也有 1.3 亿个值）。

我的重点是查找性能——我必须进行数百万次查找。

C++ STL 库有一个用于此类关联数组的 map 类。我有几个关于map的问题。

对于上述大小的数据集，map 的存储开销是多少？一般来说，map 的存储开销如何扩展？

看起来map的底层数据结构是一个红黑平衡二叉树。听起来就像现实世界的性能插入和检索的时间复杂度为O(log n)。

它提到了用于提示插入的 O(1) 。我的输入是预先排序的，所以我相信我应该能够提供插入事件的提示。我如何使用此处列出的方法提供此提示？

有没有提供更好查找性能的STL容器？

是否有其他公开可用的开源框架，其关联数组类使用的底层数据结构的性能比 STL map 更好？

如果编写我自己的容器类可以提供更好的查找性能，我可以研究哪些数据结构？

我使用 GCC 4 来完成这项任务，在 Linux 或 Mac OS X 下运行。

如果这些是愚蠢的问题，我提前道歉。谢谢您的建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小苏打饼 2024-08-20 04:34:08

鉴于您所说的，我会非常认真地考虑使用 std::vector >，并使用 std::lower_bound、std::upper_bound 和/或 std::equal_range 进行查找价值观。

虽然 std::map 的确切开销可能（并且确实）有所不同，但几乎没有任何空间可以质疑它通常会消耗额外的内存和在向量中查找值比二分查找慢。正如您所指出的，它通常（并且几乎不可避免地）实现为某种平衡树，这会增加指针和平衡信息的开销，并且通常意味着每个节点也被单独分配。由于您的节点非常小（通常为 8 字节），因此额外数据可能至少与您实际存储的数据一样多（即至少 100% 的开销）。单独的分配通常意味着引用的局部性较差，从而导致缓存使用率较差。

std::map 的大多数实现都使用红黑树。如果您打算使用 std::map，使用 AVL 树的实现可能会更适合您的目的——AVL 树对平衡的约束稍微严格一些。这提供了稍快的查找速度，但代价是插入和删除稍慢（因为它必须更频繁地重新平衡以维持其对“平衡”的更严格解释）。然而，只要您的数据在使用过程中保持不变，std::vector 几乎肯定会更好。

另一种值得注意的可能性是：如果您的密钥至少相当均匀分布，您可能需要尝试使用插值而不是二分查找。即，您不总是从向量的中间开始，而是进行线性插值来猜测最可能的查找起点。当然，如果您的密钥遵循某些已知的非线性分布，则可以使用匹配插值。

假设键分布合理（或者至少遵循某种适合插值的可预测模式），插值搜索的复杂度为 O(log log N)。对于 1.3 亿个钥匙，大约需要 4 个探测器才能找到一个项目。为了比（正常/非完美）散列做得更好，您需要一个好的算法，并且需要将表中的负载因子保持在相当低的水平（通常约为 75% 左右，即您需要考虑到例如在表中增加 3200 万个额外（空）点，以将预期的复杂性从四个探针提高到三个）。我可能只是守旧，但我觉得这需要大量的额外存储来用于如此小的速度改进。

OTOH，确实，这几乎是完美散列的理想情况——该集合提前已知，并且密钥非常小（很重要，因为散列通常与密钥大小呈线性关系）。即便如此，除非密钥分布相当不均匀，否则我不会期望有任何巨大的改进——完美的哈希函数通常（通常？）相当复杂。

Given what you've said, I'd think very hard about using an std::vector<pair<int, float> >, and using std::lower_bound, std::upper_bound, and/or std::equal_range to look up values.

While the exact overhead of std::map can (and does) vary, there's little or no room for question that it will normally consume extra memory and look up values more slowly than a binary search in a vector. As you've noted, it's normally (and almost unavoidably) implemented as some sort of balanced tree, which imposes overhead for the pointers and the balancing information, and typically means each node is allocated separately as well. Since your nodes are pretty small (typically 8 bytes) that extra data is likely to be at least as much as what you're actually storing (i.e. at least 100% overhead). Separate allocations often mean poor locality of reference, which leads to poor cache usage.

Most implementations of std::map use a red-black tree. If you were going to use an std::map, an implementation that uses an AVL tree would probably suit your purposes better -- an AVL tree has slightly tighter constraints on balancing. This gives slightly faster lookup at the expense of slightly slower insertion and deletion (since it has to re-balance more often to maintain its stricter interpretation of "balanced"). As long as your data remains constant during use, however, an std::vector is still almost certainly better.

One other possibility worth noting: if your keys are at least fairly even distributed, you might want to try looking up using interpolation instead of bisection. i.e. instead of always starting at the middle of the vector, you do a linear interpolation to guess at the most likely starting point for the lookup. Of course, if your keys follow some known non-linear distribution, you can use a matching interpolation instead.

Assuming the keys are reasonably evenly distributed (or at least follow some predictable pattern that's amenable to interpolation), the interpolation search has a complexity of O(log log N). For 130 million keys, that works out to around 4 probes to find an item. To do significantly better than that with (normal/non-perfect) hashing, you need a good algorithm, and you need to keep the load factor in the table quite low (typically around 75% or so -- i.e. you need to allow for something like 32 million extra (empty) spots in your table to improve the expected complexity from four probes to three). I may just be old fashioned, but that strikes me as a lot of extra storage to use for such a small speed improvement.

OTOH, it's true that this is nearly the ideal situation for perfect hashing -- the set is known ahead of time, and the key is quite small (important, since hashing is normally linear on the key size). Even so, unless the keys are distributed pretty unevenly, I wouldn't expect any huge improvement -- a perfect hash function is often (usually?) fairly complex.

回复收藏 0 原文

只想待在家 2024-08-20 04:34:08

假设您不需要在向量中间进行插入，向量绝对会杀死这里的地图。我编写了一个自定义分配器来跟踪内存使用情况，以下是 Visual Studio 2005 中的结果：

float>:

1.3 million insertions
Total memory allocated: 29,859 KB
Total blocks allocated: 1,274,001
Total time: 17.5 seconds

std::map: std::vectorstd::vectorstd::map >:

1.3 million insertions
Total memory allocated: 12,303 KB
Total blocks allocated: 1
Total time: 0.88 seconds

std::map 使用了两倍以上的存储空间，并且插入所有项目的时间延长了 20 倍。

A vector is absolutely going to kill a map here assuming you don't need to do insertions in the middle of the vector. I wrote a custom allocator to track memory usage, and here are the results in Visual Studio 2005:

std::map<int, float>:

1.3 million insertions
Total memory allocated: 29,859 KB
Total blocks allocated: 1,274,001
Total time: 17.5 seconds

std::vector<std::pair<int, float> >:

1.3 million insertions
Total memory allocated: 12,303 KB
Total blocks allocated: 1
Total time: 0.88 seconds

std::map is using over twice the storage and taking 20 times longer to insert all the items.

回复收藏 0 原文

策马西风 2024-08-20 04:34:08

如果您的输入已排序，您应该尝试仅使用向量和二分搜索（即 lower_bound()）。这可能被证明是足够的（也是 O(log n)）。根据密钥的分布和使用的哈希函数，hash_map 也可能有效。我认为这是 gcc 中的 tr1::unordered_map 。

回复收藏 0 原文

单身狗的梦 2024-08-20 04:34:08

大多数编译器都附带一个非标准（但有效）hash_map（或unordered_map），这对您来说可能会更快。它以 C++0x 形式出现（在 tr1 中），并且也（一如既往）在提升已经。

GCC 也这样做了，但我已经 12 年没有在这方面做过 C++ 了，但它应该仍然在某个地方。

回复收藏 0 原文

一袭白衣梦中忆 2024-08-20 04:34:08

您可以查看 std::tr1::unorderd_map。

但是，如果您有 32 位无符号整数键（4294967296 个可能的值）和 1.3 亿个不同的键，那么您应该编写为此任务优化的自己的容器。特别是如果 1.3 亿关键案例是常见情况（而不仅仅是罕见的最大值）。

4294967296 / 130000000 = 33，因此整个空间中大约每第 33 个数字都用于您的数据中。

例如，您可以将键范围划分为固定大小的分区。如果键分布相当均匀，则应将键空间划分为例如 256 大小的存储桶，甚至 32 大小的存储桶，具体取决于仅存储少量值时您希望浪费多少存储空间。

例如，给你一个想法：

#define BUCKET_SIZE  256
#define BUCKET_SIZE_SHIFT  8
struct Bucket {
  uint32_t key;
  float value;
  Bucket* pNext;
};

Bucket data[ 4294967296 / BUCKET_SIZE ];

Bucket* find( uint32_t key ) {
  uint32_t bucket_index = key / BUCKET_SIZE;
  // or faster: uint32_t bucket_index = key >> BUCKET_SIZE_SHIFT;
  Bucket* pBucket = &data[ bucket_index ];
  while( pBucket ) {
    if( pBucket->key == key ) return pBucket;
    pBucket = pBucket->pNext;
  }
  return NULL;
}

You may have a look at std::tr1::unorderd_map.

But if you have 32 bit unsigned integer keys (4294967296 possible values) and 130 million different keys, then you should write your own container optimized for this task. Especially if the 130 million key case is the usual case (and not only a rare maximum).

4294967296 / 130000000 = 33, so about each 33rd number in the whole space is used in your data.

You could for example divide your key range into fixed size partitions. If the keys are rather evenly distributed, you should partition the key space into e.g. 256-sized buckets, or even 32-sized buckets, depends on how much storage you want to waste when only a few values are stored.

Example, to give you an idea:

#define BUCKET_SIZE  256
#define BUCKET_SIZE_SHIFT  8
struct Bucket {
  uint32_t key;
  float value;
  Bucket* pNext;
};

Bucket data[ 4294967296 / BUCKET_SIZE ];

Bucket* find( uint32_t key ) {
  uint32_t bucket_index = key / BUCKET_SIZE;
  // or faster: uint32_t bucket_index = key >> BUCKET_SIZE_SHIFT;
  Bucket* pBucket = &data[ bucket_index ];
  while( pBucket ) {
    if( pBucket->key == key ) return pBucket;
    pBucket = pBucket->pNext;
  }
  return NULL;
}

回复收藏 0 原文

草莓酥 2024-08-20 04:34:08

如果您的密钥不变，您可以考虑使用完美哈希函数作为标准容器的替代方案。

我不知道使用这种大小的数据集会遇到什么障碍，但可能值得花几分钟进行试验。

回复收藏 0 原文

我的鱼塘能养鲲 2024-08-20 04:34:08

考虑到使用的大量内存，您还必须考虑搜索中的任何内存访问都会导致内存缓存故障。

在这方面，将小型哈希图作为第一层和存储桶的排序向量的混合解决方案可能是最好的。

这个想法是将哈希表索引保留在缓存中，并在较小的排序容器中进行搜索以减少缓存故障的数量。

回复收藏 0 原文

生生漫 2024-08-20 04:34:08

作为有关查找性能问题的部分答案，您必须考虑您的插入模式。您注意到 std::map 使用红黑树作为对冲，防止将仔细排序的插入线性化到链表中。因此，尽管插入顺序异常，这样的树仍提供 O(log n) 查找时间。然而，您为此付出了插入、删除和遍历性能方面的代价，并且由于重复读取“附近”数据而失去了引用的局部性。

如果您可以为您的密钥类型（您说的是整数）调整哈希函数以防止冲突，则哈希表可以提供更快的查找速度。如果您的数据集是固定的，这样您可以加载一次，然后只读取它，您可以使用整数和浮点数的并行数组，并使用 std::lower_bound 通过二分搜索找到匹配项。如果您的键与相应的值分离，对并行数组进行正确排序将是一件苦差事，但您会享受比存储 std::pair 数组更严格的存储和引用局部性。