删除重复项并对向量进行排序的最有效方法是什么?
我需要获取一个可能包含大量元素的 C++ 向量,删除重复项,然后对其进行排序。
我目前有以下代码,但它不起作用。
vec.erase(
std::unique(vec.begin(), vec.end()),
vec.end());
std::sort(vec.begin(), vec.end());
我怎样才能正确地做到这一点?
此外,首先删除重复项(类似于上面的代码)还是先执行排序更快? 如果我先执行排序,是否保证在执行 std::unique 后仍保持排序?
或者还有另一种(也许更有效)的方法来完成这一切?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(28)
我同意R。 佩特和托德·加德纳;
std::set
可能是个好主意这里。 即使您被困在使用向量中,如果您有足够的重复项,您最好创建一个集合来完成这些肮脏的工作。让我们比较三种方法:
仅使用向量、排序+唯一
转换为集合(手动)
转换为集合(使用构造函数)
以下是这些方法的用法当重复项的数量发生变化时执行:
摘要:当数量重复项足够大,转换为集合然后将数据转储回向量实际上更快。
由于某种原因,手动进行集合转换似乎比使用集合构造函数更快——至少在我使用的玩具随机数据上是这样。
I agree with R. Pate and Todd Gardner; a
std::set
might be a good idea here. Even if you're stuck using vectors, if you have enough duplicates, you might be better off creating a set to do the dirty work.Let's compare three approaches:
Just using vector, sort + unique
Convert to set (manually)
Convert to set (using a constructor)
Here's how these perform as the number of duplicates changes:
Summary: when the number of duplicates is large enough, it's actually faster to convert to a set and then dump the data back into a vector.
And for some reason, doing the set conversion manually seems to be faster than using the set constructor -- at least on the toy random data that I used.
我重新制作了 Nate Kohl 的分析并得到了不同的结果。 对于我的测试用例,直接对向量进行排序总是比使用集合更有效。 我添加了一种新的更有效的方法,使用
unordered_set
。请记住,只有当您对需要唯一和排序的类型拥有良好的哈希函数时,
unordered_set
方法才有效。 对于整数来说,这很容易! (标准库提供了一个默认的哈希值,它只是身份函数。)另外,不要忘记在最后进行排序,因为 unordered_set 是无序的:)我在
set
内部进行了一些挖掘和 unordered_set 实现,并发现构造函数实际上为每个元素构造一个新节点,然后检查其值以确定是否应该实际插入它(至少在 Visual Studio 实现中)。以下是 5 种方法:
f1:仅使用
vector
、sort
+unique
f2:转换为
set
(使用构造函数)f3:转换为
set
(手动)f4:转换为
unordered_set
(使用构造函数)f5:转换为
unordered_set
(手动)我使用在范围 [ 中随机选择的 100,000,000 个整数组成的向量进行了测试1,10]、[1,1000] 和 [1,100000]
结果(以秒为单位,越小越好):
I redid Nate Kohl's profiling and got different results. For my test case, directly sorting the vector is always more efficient than using a set. I added a new more efficient method, using an
unordered_set
.Keep in mind that the
unordered_set
method only works if you have a good hash function for the type you need uniqued and sorted. For ints, this is easy! (The standard library provides a default hash which is simply the identity function.) Also, don't forget to sort at the end since unordered_set is, well, unordered :)I did some digging inside the
set
andunordered_set
implementation and discovered that the constructor actually construct a new node for every element, before checking its value to determine if it should actually be inserted (in Visual Studio implementation, at least).Here are the 5 methods:
f1: Just using
vector
,sort
+unique
f2: Convert to
set
(using a constructor)f3: Convert to
set
(manually)f4: Convert to
unordered_set
(using a constructor)f5: Convert to
unordered_set
(manually)I did the test with a vector of 100,000,000 ints chosen randomly in ranges [1,10], [1,1000], and [1,100000]
The results (in seconds, smaller is better):
std::unique
仅删除相邻的重复元素:您必须先对向量进行排序,然后它才能按您的预期工作。std::unique
被定义为稳定的,因此向量在运行 unique 后仍然会被排序。std::unique
only removes duplicate elements if they're neighbours: you have to sort the vector first before it will work as you intend.std::unique
is defined to be stable, so the vector will still be sorted after running unique on it.我不确定你用它做什么,所以我不能 100% 肯定地说,但通常当我想到“排序的、独特的”容器时,我会想到一个 std::set。 它可能更适合您的用例:
否则,在调用 unique 之前进行排序(正如其他答案指出的那样)是正确的方法。
I'm not sure what you are using this for, so I can't say this with 100% certainty, but normally when I think "sorted, unique" container, I think of a std::set. It might be a better fit for your usecase:
Otherwise, sorting prior to calling unique (as the other answers pointed out) is the way to go.
std::unique
仅适用于重复元素的连续运行,因此您最好先进行排序。 但是,它是稳定的,因此您的向量将保持排序。std::unique
only works on consecutive runs of duplicate elements, so you'd better sort first. However, it is stable, so your vector will remain sorted.这是一个可以为您完成此操作的模板:
如下调用:
Here's a template to do it for you:
call it like:
如果您不想更改元素的顺序,那么您可以尝试以下解决方案:
If you do not want to change the order of elements, then you can try this solution:
效率是一个复杂的概念。 有时间与空间的考虑,以及一般测量(您只能得到模糊答案,例如 O(n))与特定测量(例如,冒泡排序可能比快速排序快得多,具体取决于输入特征)。
如果重复项相对较少,那么排序,然后唯一和删除似乎是正确的方法。 如果您有相对较多的重复项,则从向量创建一个集合并让它完成繁重的工作可以轻松击败它。
也不要只关注时间效率。 Sort+unique+erase 在 O(1) 空间中运行,而集合构造在 O(n) 空间中运行。 并且两者都不直接适合映射减少并行化(对于真正巨大的数据集)。
Efficiency is a complicated concept. There's time vs. space considerations, as well as general measurements (where you only get vague answers such as O(n)) vs. specific ones (e.g. bubble sort can be much faster than quicksort, depending on input characteristics).
If you have relatively few duplicates, then sort followed by unique and erase seems the way to go. If you had relatively many duplicates, creating a set from the vector and letting it do the heavy lifting could easily beat it.
Don't just concentrate on time efficiency either. Sort+unique+erase operates in O(1) space, while the set construction operates in O(n) space. And neither directly lends itself to a map-reduce parallelization (for really huge datasets).
假设 a 是一个向量,使用
a.erase(unique(a.begin(),a.end()),a.end()); 删除连续的重复项code> 在 O(n) 时间内运行。
Assuming that a is a vector, remove the contiguous duplicates using
a.erase(unique(a.begin(),a.end()),a.end());
runs in O(n) time.您需要在调用
unique
之前对其进行排序,因为unique< /code>
仅删除彼此相邻的重复项。
编辑:38秒...
You need to sort it before you call
unique
becauseunique
only removes duplicates that are next to each other.edit: 38 seconds...
unique
仅删除连续的重复元素(这是在线性时间内运行所必需的),因此您应该首先执行排序。 在调用unique
后它将保持排序。unique
only removes consecutive duplicate elements (which is necessary for it to run in linear time), so you should perform the sort first. It will remain sorted after the call tounique
.您可以按如下方式执行此操作:
You can do this as follows:
使用 Ranges v3 库,您可以简单地使用
Note,它实际上会删除重复的元素,而不仅仅是移动它们。
不幸的是,C++20 中的操作并未标准化,因为范围库的其他部分即使在 C++20 中您仍然必须使用原始库。
With the Ranges v3 library, you can simply use
Note that it actually removes the duplicate elements, not just move them.
Unfortunately, actions weren’t standardized in C++20 as other parts of the ranges library were you still have to use the original library even in C++20.
如前所述,
unique
需要一个排序的容器。 此外,unique
实际上并不从容器中删除元素。 相反,它们被复制到末尾,unique
返回一个指向第一个此类重复元素的迭代器,并且您应该调用erase
以实际删除元素。As already stated,
unique
requires a sorted container. Additionally,unique
doesn't actually remove elements from the container. Instead, they are copied to the end,unique
returns an iterator pointing to the first such duplicate element, and you are expected to callerase
to actually remove the elements.下面的跟踪迭代器位置的简单方法的时间复杂度为 O(nlogn) ,空间复杂度为 O(1) ,之前似乎没有提到过:
The following simple method of tracking iterator positions that has time complexity of O(nlogn) and space complexity of O(1) seems like has not been mentioned earlier:
Nate Kohl 建议的标准方法,仅使用向量、排序+唯一:
不适用于指针向量。
仔细查看cplusplus.com 上的此示例。
在他们的示例中,移到末尾的“所谓的重复项”实际上显示为? (未定义的值),因为那些“所谓的重复项”有时是“额外元素”,有时原始向量中存在“缺失元素”。
在对象指针向量上使用 std::unique() 时会出现问题(内存泄漏、从 HEAP 读取数据错误、重复释放,从而导致分段错误等)。
这是我对问题的解决方案:将
std::unique()
替换为ptgi::unique()
。请参阅下面的文件 ptgi_unique.hpp:
这是我用来测试它的 UNIT Test 程序:
The standard approach suggested by Nate Kohl, just using vector, sort + unique:
doesn't work for a vector of pointers.
Look carefully at this example on cplusplus.com.
In their example, the "so called duplicates" moved to the end are actually shown as ? (undefined values), because those "so called duplicates" are SOMETIMES "extra elements" and SOMETIMES there are "missing elements" that were in the original vector.
A problem occurs when using
std::unique()
on a vector of pointers to objects (memory leaks, bad read of data from HEAP, duplicate frees, which cause segmentation faults, etc).Here's my solution to the problem: replace
std::unique()
withptgi::unique()
.See the file ptgi_unique.hpp below:
And here is the UNIT Test program that I used to test it:
如果您正在寻找性能并使用
std::vector
,我推荐这个 文档链接提供。If you are looking for performance and using
std::vector
, I recommend the one that this documentation link provides.更容易理解的代码来自: https://en.cppreference.com/w/cpp/algorithm /unique
输出:
More understandable code from: https://en.cppreference.com/w/cpp/algorithm/unique
ouput:
这就是 std 中命名的问题...
std::unique
应该被称为std::trim_consecutive_duplicates
恕我直言,这将清楚地表明您需要对向量首先具有彼此相邻的具有相同值的元素。 在这种情况下,我怀疑从向量到达时与集合相关的任何内容都会更快,但如果您有机会从一开始就将所有内容放入集合中,那么您绝对应该这样做。这是一个现代 C++20 示例 (演示):
输出:
That's the problem with naming in the std...
std::unique
should rather be calledstd::trim_consecutive_duplicates
imho, that would make it clear that you need to sort the vector first to have elements with the same value adjacent to each other. In this case I doubt that anything related to a set is fasteor when arriving from vector, but if you have the opportunity to put everything into a set from the start you should definitely do that.Here's a modern C++20 example (Demo):
Output:
关于 alexK7 基准测试。 我尝试了它们并得到了类似的结果,但是当值的范围为 100 万时,使用 std::sort (f1) 和使用 std::unordered_set (f5) 的情况会产生相似的时间。 当取值范围为1000万时,f1比f5快。
如果值的范围有限并且值是 unsigned int,则可以使用 std::vector,其大小对应于给定范围。 这是代码:
About alexK7 benchmarks. I tried them and got similar results, but when the range of values is 1 million the cases using std::sort (f1) and using std::unordered_set (f5) produce similar time. When the range of values is 10 million f1 is faster than f5.
If the range of values is limited and the values are unsigned int, it is possible to use std::vector, the size of which corresponds to the given range. Here is the code:
如果你的类很容易转换为 int,并且你有一些内存,
unique 可以在之前不排序的情况下完成,而且速度更快:
典型结果:
持续时间排序 + 唯一 == 38985
唯一持续时间 + 排序 == 2500
结果是一样的
If your class is easily convertible to an int, and you got some memory,
unique can be done without sorting before, and it's much faster :
Typical results :
duration sort + unique == 38985
duration unique + sort == 2500
and results are same
大多数答案似乎是使用
O(nlogn)
,但通过使用unordered_set
,我们可以将其减少到O(n)
。 我看到了一些使用sets
的解决方案,但我发现了这个,并且使用set
和iterators
似乎更优雅。Most of the answer seems to be using
O(nlogn)
but with the use of theunordered_set
we can decrease it toO(n)
. I saw some of the solutions usingsets
, but I found this one and it seems more elegant to useset
anditerators
.取决于用例。 如果您预计正整数唯一值的数量少于 100 个,并且您的 CPU 能够支持 avx512f 指令集,那么您可以以每个元素约 15 个时钟周期或每秒 300-5 亿次插入的速率插入元素,方法是使用与小型查找表的简单比较。
以下实现使用 CPU 寄存器对约 50 个唯一值进行值查找,并使用 L1 缓存对约 1000 个唯一值进行值查找。 对于 L1 缓存版本,每次插入大约需要 160 个时钟周期,相当于每秒约 25M 次插入,并且比使用 std::set 慢。 对于只有 4 个唯一值,它以每个元素 5.8 个周期的速率插入,高于 500M/s。
Depends on use-case. If you expect less than 100 amount of positive integer unique values and if you have a cpu capable of avx512f instruction set, then you can insert elements at a rate of ~15 clock cycles per element or 300-500 million inserts per second, by using simple comparison against a small look-up-table.
Following implementation uses CPU registers to do value lookups for ~50 unique values and L1 cache for ~1000 unique values. For L1 cache version, it takes about 160 clock cycles per insertion which is equivalent to about ~25M insertions per second and becomes slower than using std::set. For only 4 unique values, it inserts at a rate of 5.8 cycles per element which is higher than 500M/s.
根据我的基准测试,基于 这个答案是实现重复数据删除最快的方法。
std::unordered_set
、phmap:flat_hash_set
)。size_t
(由于stable_sort
)。size_t
)。哈希集什么时候更快?
答案如下。
我将其转换为更通用的、带注释的实现:
nh2/cpp-dedup-benchmark
基准测试结果来自 https://github.com/nh2/cpp-dedup-benchmark:
内存使用
测量:
According to my benchmark, a vector index sort based on this answer is fastest to implement deduplication.
std::unordered_set
,phmap:flat_hash_set
).size_t
per element (due tostable_sort
).size_t
per element).When are hash sets faster?
A minimal implementation from that answer is below.
I converted it to a more general, commented implementation here:
nh2/cpp-dedup-benchmark
Benchmark results from https://github.com/nh2/cpp-dedup-benchmark:
Memory usage
Measured:
下面是 std::unique() 发生的重复删除问题的示例。 在 LINUX 机器上,程序崩溃。 阅读评论了解详情。
Here's the example of the duplicate delete problem that occurs with std::unique(). On a LINUX machine, the program crashes. Read the comments for details.
这是我创建的一个函数,您可以使用它来删除重复。 所需的头文件只是
和
。This is a function that I created that you can use to delete repeats. The header files needed are just
<iostream>
and<vector>
.