整数哈希问题
我有一个 (C++) std::map
,其中包含数百万个 MyObject*
类型的对象。我可以拥有的最大对象数量约为 1 亿个。关键是对象的 id。在某个过程中,必须尽快以某种方式标记这些对象(使用 0
或 1
)。标记不能发生在对象本身上(因此我不能引入成员变量并将其用于标记过程)。由于我知道最小和最大 id(1 到 100_000_000),我首先想到的是使用 std::bit_set<100000000>
并在那里执行标记。这解决了我的问题,并且还使标记并行运行的进程变得更容易,因为它们使用自己的 bit_set 来标记事物,但我想知道如果我必须使用其他东西而不是 0 ,解决方案可能是什么
-1
标记,例如,如果我必须用整数标记所有对象,我可以使用什么?
是否有某种形式的数据结构可以以紧凑(内存方面)的方式处理此类问题,并且速度很快?感兴趣的主要查询是对象是否被标记以及标记的内容。
谢谢。
注意:std::map
无法更改。无论我使用什么数据结构,都不能处理地图本身。
I have a (C++) std::map<int, MyObject*>
that contains a couple of millions of objects of type MyObject*
. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0
or 1
) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000>
and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0
-1
marking, e.g what could I use if I had to mark all objects with an integer number ?
Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.
Thank you.
Note: std::map<int, MyObject*>
cannot be changed. Whatever data structure I use, must not deal with the map itself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如何将地图的
value_type
设为std::pair
而不是MyObject*
?How about making the
value_type
of your map astd::pair<bool, MyObject*>
instead ofMyObject*
?如果您不关心内存,那么
std::vector
(或任何适合您需要的东西来代替int
)应该可以工作。如果您不喜欢这样,并且无法修改地图,那么为什么不为标记创建并行地图呢?
如果您无法直接修改对象,您是否考虑过在将对象放入地图之前将其包装起来?例如:
如果您无论如何都需要进行查找,那么这不会慢。
编辑:正如@tenfour在他/她的回答中所建议的,std::pair
在这里可能是一个更干净的解决方案,因为它保存了struct< /代码> 定义。就我个人而言,我不是
std::pair
的忠实粉丝,因为您必须将所有内容都称为first
和second
,而不是而不是有意义的名字。但这只是我...If you're not concerned with memory, then a
std::vector<int>
(or whatever suits your need in place of anint
) should work.If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?
If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:
If you're going to need to do lookups anyway, then this will be no slower.
EDIT: As @tenfour suggests in his/her answer, a
std::pair
may be a cleaner solution here, as it saves thestruct
definition. Personally, I'm not a big fan ofstd::pair
s, because you have to refer to everything asfirst
andsecond
, rather than by meaningful names. But that's just me...要问自己的最重要的问题是“这 100,000,000 个对象中有多少可能被标记(或保持未标记)?”如果答案大约小于
100,000,000/(2*sizeof(int))
,则只需使用另一个std::set
或std::tr1:: unordered_set
(hash_set
位于tr1
之前)来跟踪哪些已标记(或未标记)。2*sizeof(int)
从哪里来?它是对在将标记的项目列表的双端队列中维护堆结构所需的内存开销的估计。如果它更大,则使用您即将使用的
std::bitset
。对于您所需的数量规模,其管理费用实际上为 0%。您将需要大约 13 MB 的连续 RAM 来保存位集。如果您需要存储标记和状态,请使用
std::tr1::unordered_map
(使用Object*
键和marker_type
值)代码>.同样,如果标记节点的百分比高于上述比较,那么您将需要使用某种bitset
来保存所需的位数,并适当调整大小,即 12.5每比特兆字节。考虑到需求的澄清,持有
bitset
的专门构建的对象可能是您的最佳选择。编辑:这假设您已经对您可接受的解决方案进行了适当的时间复杂度计算,因为不再允许更改基本
std::map
结构。The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly
100,000,000/(2*sizeof(int))
, then just use anotherstd::set
orstd::tr1::unordered_set
(hash_set
previous totr1
) to track which ones are so marked (or remained unmarked).Where does
2*sizeof(int)
come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.If it is larger, then use
std::bitset
as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.If you need to store a marking as well as presence, then use
std::tr1::unordered_map
using the key ofObject*
and value ofmarker_type
. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort ofbitset
to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.A purpose-built object holding the
bitset
might be your best choice, given the clarification of the requirements.Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base
std::map
structure is no longer permitted.如果您不介意使用 hack,请查看 内存优化。它可以在存储指针的 LSB 中存储一位。
If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.