两个向量之间的差异甲和乙
我有两个名为 A
和 B
的 vector
对象。 MyType 类有一个字段 ID
,我想获取位于 A
中但不在 B
中的 MyType*
>。我正在开发图像分析应用程序,我希望找到一个快速/优化的解决方案。
I've got two vector<MyType*>
objects called A
and B
. The MyType class has a field ID
and I want to get the MyType*
which are in A
but not in B
. I'm working on a image analysis application and I was hoping to find a fast/optimized solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
无序方法通常具有二次复杂度,除非数据事先排序(通过 ID 字段),在这种情况下它将是线性的,并且不需要通过 B 重复搜索。
另一种解决方案是使用像 std::set 这样的有序容器将 CompareId 用于 StrictWeakOrdering 模板参数。我认为如果您需要应用大量集合运算,这会更好。这有其自身的开销(作为一棵树),但如果您确实发现这是一个效率问题,您可以实现一个快速内存分配器来超快地插入和删除元素(注意:只有在您分析并确定这是瓶颈)。
警告:进入有些复杂的领域。
您可以考虑另一种解决方案,如果适用,它可能会非常快,而且您永远不必担心数据排序。基本上,使任何共享相同 ID 的 MyType 对象组存储共享计数器(例如:指向 unsigned int 的指针)。
这将需要创建一个 ID 到计数器的映射,并且需要在每次基于其 ID 创建 MyType 对象时从映射中获取计数器。由于您的 MyType 对象具有重复的 ID,因此您不必像创建 MyType 对象那样频繁地插入到映射中(大多数对象可能只是获取现有计数器)。
除此之外,还有一个全局“遍历”计数器,每当获取时该计数器就会递增。
现在让我们回到存储 MyType* 的 A 和 B 向量的位置。为了获取 A 中不在 B 中的元素,我们首先调用 traversal_counter()。假设这是我们第一次调用它,则遍历值为 1。
现在迭代 B 中的每个 MyType* 对象,并将每个对象的共享计数器设置为从 0 到遍历值 1。
现在迭代每个 MyType * A 中的对象。计数器值与当前遍历值(1)不匹配的元素是 A 中不包含在 B 中的元素。
当遍历计数器溢出时会发生什么情况?在这种情况下,我们迭代存储在 ID 映射中的所有计数器,并将它们与遍历计数器本身一起设置回零。如果它是 32 位无符号整数,则只需在大约 40 亿次遍历中发生一次。
这是您可以应用于给定问题的最快解决方案。它可以对未排序的数据以线性复杂度执行任何集合操作(并且总是,不仅仅是在像哈希表这样的最佳情况下),但它确实引入了一些复杂性,因此只有在您确实需要时才考虑它。
The unordered approach will typically have quadratic complexity unless the data is sorted beforehand (by your ID field), in which case it would be linear and would not require repeated searches through B.
Another solution is to use an ordered container like std::set with CompareId used for the StrictWeakOrdering template argument. I think this would be better if you need to apply a lot of set operations. That has its own overhead (being a tree) but if you really find that to be an efficiency problem, you could implement a fast memory allocator to insert and remove elements super fast (note: only do this if you profile and determine this to be a bottleneck).
Warning: getting into somewhat complicated territory.
There is another solution you can consider which could be very fast if applicable and you never have to worry about sorting data. Basically, make any group of MyType objects which share the same ID store a shared counter (ex: pointer to unsigned int).
This will require creating a map of IDs to counters and require fetching the counter from the map each time a MyType object is created based on its ID. Since you have MyType objects with duplicate IDs, you shouldn't have to insert to the map as often as you create MyType objects (most can probably just fetch an existing counter).
In addition to this, have a global 'traversal' counter which gets incremented whenever it's fetched.
Now let's go back to where you have A and B vectors storing MyType*. To fetch the elements in A that are not in B, we first call traversal_counter(). Assuming it's the first time we call it, that will give us a traversal value of 1.
Now iterate through every MyType* object in B and set the shared counter for each object from 0 to the traversal value, 1.
Now iterate through every MyType* object in A. The ones that have a counter value which doesn't match the current traversal value(1) are the elements in A that are not contained in B.
What happens when you overflow the traversal counter? In this case, we iterate through all the counters stored in the ID map and set them back to zero along with the traversal counter itself. This will only need to occur once in about 4 billion traversals if it's a 32-bit unsigned int.
This is about the fastest solution you can apply to your given problem. It can do any set operation in linear complexity on unsorted data (and always, not just in best-case scenarios like a hash table), but it does introduce some complexity so only consider it if you really need it.
对两个向量进行排序 (
std::sort
)到 ID,然后使用std::set_difference
。例如,您需要定义一个自定义比较器来传递给这两种算法Sort both vectors (
std::sort
) according to ID and then usestd::set_difference
. You will need to define a custom comparator to pass to both of these algorithms, for example首先看问题。你想要“A中的一切而不是B中的一切”。这意味着您将必须访问“A 中的所有内容”。您还必须访问 B 中的所有内容才能了解 B 中存在和不存在的内容。因此,这表明应该有一个
O(n) + O(m)
解决方案,或者随意消除 n 和 m 之间的差异,O(2n)
。让我们考虑一下
std::set_difference
方法。每次排序的时间复杂度为O(n log n)
,set_difference 的时间复杂度为O(n)
。因此,sort-sort-set_difference 方法的复杂度是O(n + 2n log n)
。我们称之为O(4n)
。另一种方法是首先将 B 的元素放入集合(或映射)中。遍历 B 来创建集合的时间为
O(n)
加上每个元素的插入O(log n)
,然后遍历 AO(n),并查找A (log n) 的每个元素的总和:O(2n log n)
。我们称之为O(3n)
,这稍微好一些。最后,使用 unordered_set (或 unordered_map),并假设我们得到
O(1)
插入和O(1)
查找的平均情况,我们有一种方法 <代码>O(2n)。啊哈!这里真正的胜利是,unordered_set(或映射)可能是表示数据的最自然的选择,即,正确的设计会产生优化的实现。这种情况并不总是发生,但当它发生时那就太好了!
First look at the problem. You want "everything in A not in B". That means you're going to have to visit "everything in A". You'll also have to visit everything in B to have knowledge of what is and is not in B. So that suggests there should be an
O(n) + O(m)
solution, or taking liberty to elide the difference between n and m,O(2n)
.Let's consider the
std::set_difference
approach. Each sort isO(n log n)
, and set_difference isO(n)
. So the sort-sort-set_difference approach isO(n + 2n log n)
. Let's call thatO(4n)
.Another approach would be to first place the elements of B in a set (or map). Iteration across B to create the set is
O(n)
plus insertionO(log n)
of each element, followed by iteration across A O(n), with a lookup for each element of A (log n), gives a total:O(2n log n)
. Let's call thatO(3n)
, which is slightly better.Finally, using an unordered_set (or unordered_map), and assuming we get average case of
O(1)
insertion andO(1)
lookup, we have an approach that isO(2n)
. A-ha!The real win here is that unordered_set (or map) is probably the most natural choice to represent your data in the first place, i.e., the proper design yields the optimized implementation. That doesn't always happen, but it's nice when it does!
如果 B 先于 A 存在,那么在填充 A 时,您可以在 C 向量中进行记账。
If B preexists to A, then while populating A, you can bookkeep in a C vector.