整数哈希问题

发布于 2024-10-31 22:41:09 字数 653 浏览 1 评论 0原文

我有一个 (C++) std::map,其中包含数百万个 MyObject* 类型的对象。我可以拥有的最大对象数量约为 1 亿个。关键是对象的 id。在某个过程中,必须尽快以某种方式标记这些对象(使用 01)。标记不能发生在对象本身上(因此我不能引入成员变量并将其用于标记过程)。由于我知道最小和最大 id(1 到 100_000_000),我首先想到的是使用 std::bit_set<100000000> 并在那里执行标记。这解决了我的问题,并且还使标记并行运行的进程变得更容易,因为它们使用自己的 bit_set 来标记事物,但我想知道如果我必须使用其他东西而不是 0 ,解决方案可能是什么-1 标记,例如,如果我必须用整数标记所有对象,我可以使用什么?

是否有某种形式的数据结构可以以紧凑(内存方面)的方式处理此类问题,并且速度很快?感兴趣的主要查询是对象是否被标记以及标记的内容。

谢谢。

注意:std::map 无法更改。无论我使用什么数据结构,都不能处理地图本身。

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?

Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.

Thank you.

Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

你爱我像她 2024-11-07 22:41:09

如何将地图的 value_type 设为 std::pair 而不是 MyObject*

How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?

铃予 2024-11-07 22:41:09

如果您不关心内存,那么 std::vector (或任何适合您需要的东西来代替 int)应该可以工作。

如果您不喜欢这样,并且无法修改地图,那么为什么不为标记创建并行地图呢?

std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;


如果您无法直接修改对象,您是否考虑过在将对象放入地图之前将其包装起来?例如:

struct
{
    int marker;
    T *p_x;
} T_wrapper;


std::map<int,T_wrapper> my_map;

如果您无论如何都需要进行查找,那么这不会慢。

编辑:正如@tenfour在他/她的回答中所建议的,std::pair在这里可能是一个更干净的解决方案,因为它保存了struct< /代码> 定义。就我个人而言,我不是 std::pair 的忠实粉丝,因为您必须将所有内容都称为 firstsecond,而不是而不是有意义的名字。但这只是我...

If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.

If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?

std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;


If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:

struct
{
    int marker;
    T *p_x;
} T_wrapper;


std::map<int,T_wrapper> my_map;

If you're going to need to do lookups anyway, then this will be no slower.

EDIT: As @tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...

ヤ经典坏疍 2024-11-07 22:41:09

要问自己的最重要的问题是“这 100,000,000 个对象中有多少可能被标记(或保持未标记)?”如果答案大约小于 100,000,000/(2*sizeof(int)),则只需使用另一个 std::setstd::tr1:: unordered_sethash_set 位于 tr1 之前)来跟踪哪些已标记(或未标记)。

2*sizeof(int) 从哪里来?它是对在将标记的项目列表的双端队列中维护堆结构所需的内存开销的估计。

如果它更大,则使用您即将使用的 std::bitset 。对于您所需的数量规模,其管理费用实际上为 0%。您将需要大约 13 MB 的连续 RAM 来保存位集。

如果您需要存储标记和状态,请使用 std::tr1::unordered_map(使用 Object* 键和 marker_type 值)代码>.同样,如果标记节点的百分比高于上述比较,那么您将需要使用某种 bitset 来保存所需的位数,并适当调整大小,即 12.5每比特兆字节。

考虑到需求的澄清,持有bitset的专门构建的对象可能是您的最佳选择。


编辑:这假设您已经对您可接受的解决方案进行了适当的时间复杂度计算,因为不再允许更改基本 std::map 结构。

The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).

Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.

If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.

If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.

A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.


Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.

三五鸿雁 2024-11-07 22:41:09

如果您不介意使用 hack,请查看 内存优化。它可以在存储指针的 LSB 中存储一位。

If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文