如何使向量的元素唯一? (删除不相邻的重复项)
我有一个包含一些不相邻重复项的向量。
作为一个简单的示例,请考虑:
2 1 6 1 4 6 2 1 1
我试图通过删除不相邻的重复项并保持元素的顺序来使此向量唯一。
结果是:
2 1 6 4
我尝试的解决方案是:
- 插入 std::set 但这种方法的问题是它会扰乱元素的顺序。
- 使用 std::sort 和 std::unique 的组合。但同样的订单问题又出现了。
手动重复消除:
定义一个临时向量TempVector。 对于(向量中的每个元素) { if(该元素不存在于 TempVector 中) { 添加到 TempVector; } } 将原始向量与 TempVector 交换。
我的问题是:
是否有任何STL算法可以从向量中删除不相邻的重复项?它的复杂性是什么?
I have a vector containing few non-adjacent duplicates.
As a simple example, consider:
2 1 6 1 4 6 2 1 1
I am trying to make this vector
unique by removing the non-adjacent duplicates and maintaining the order of elements.
Result would be:
2 1 6 4
The solutions I tried are:
- Inserting into a std::set but the problem with this approach is that it will disturb the order of elements.
- Use the combination of std::sort and std::unique. But again same order problem.
Manual duplicate elimination:
Define a temporary vector TempVector. for (each element in a vector) { if (the element does not exists in TempVector) { add to TempVector; } } swap orginial vector with TempVector.
My question is:
Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
我想你会这样做:
我会在向量上使用两个迭代器:
第一个迭代器读取数据并将其插入临时集。
当读取的数据不在集合中时,您将其从第一个迭代器复制到第二个迭代器并递增它。
最后,您只保留第二个迭代器之前的数据。
复杂度为 O( n .log( n ) ),因为重复元素的查找使用集合,而不是向量。
I think you would do it like this:
I would use two iterators on the vector :
The first of one reads the data and inserts it a temporary set.
When the read data was not in the set you copy it from the first iterator to the second and increment it.
At the end you keep only the data up to the second iterator.
The complexity is O( n .log( n ) ) as the lookup for duplicated elements uses the set, not the vector.
如果不使用临时
集
,则可能会(可能)造成一些性能损失:用于:
对于较小的数据集,实现的简单性和所需的额外分配的缺乏可能会抵消理论上较高的复杂性使用额外的
集
。不过,使用代表性输入进行测量是唯一确定的方法。Without using a temporary
set
it's possible to do this with (possibly) some loss of performance:used as in:
For smaller data sets, the implementation simplicity and lack of extra allocation required may offset the theoretical higher complexity of using an additional
set
. Measurement with a representative input is the only way to be sure, though.因为问题是“有没有STL算法……?它的复杂度是多少?”实现像 std::unique 这样的函数是有意义的:
所以这就是
std::unique
的实现方式以及额外的集合。unordered_set
应比常规set
更快。所有与它们前面的元素相等的元素都被删除(第一个元素被保留,因为我们不能统一到任何东西)。迭代器返回指向[first,last)
范围内新末端的点。编辑:最后一句意味着容器本身没有被
unique
修改。这可能会令人困惑。下面的例子实际上将容器简化为统一集。第 1 行创建一个向量
{ 5, 5, 5 }
。在第 2 行中,调用unique
返回一个指向第二个元素的迭代器,这是第一个不唯一的元素。因此,distance
返回 1,并且resize
修剪向量。As the question was "is there any STL algorithm...? what is its complexity?" it makes sense to implement the function like
std::unique
:So this is how
std::unique
is implemented plus an extra set. Theunordered_set
shall be faster than a regularset
. All elements are removed that compare equal to the element right preceding them (the first element is kept because we cannot unify to nothing). The iterator returned points to the new end within the range[first,last)
.EDIT: The last sentence means that the container itself is NOT modified by
unique
. This can be confusing. The following example actually reduces the container to the unified set.Line 1 creates a vector
{ 5, 5, 5 }
. In line 2 callingunique
returns an iterator to the 2nd element, which is the first element that is not unique. Hencedistance
returns 1 andresize
prunes the vector.您可以删除 fa's 使用
remove_copy_if
的回答:这对算法的复杂性没有影响(即,如所写,它也是 O(n log n))。您可以使用 unordered_set 对此进行改进,或者如果值的范围足够小,您可以简单地使用数组或位数组。
You can remove some of the loops in fa's answer using
remove_copy_if
:This has no affect on the complexity of the algorithm (ie. as written it's also O(n log n)). You can improve upon this using unordered_set, or if the range of your values is small enough you could simply use an array or a bitarray.
没有任何 STL 算法可以做您想要保留序列原始顺序的事情。
您可以在向量中创建一个迭代器或索引的
std::set
,并使用比较谓词使用引用的数据而不是迭代器/索引来对内容进行排序。然后,从向量中删除集合中未引用的所有内容。 (当然,您也可以使用迭代器/索引的另一个std::vector
、std::sort
和std::unique
并以此作为保留内容的参考。)There's no STL algorithm doing what you want preserving the sequence's original order.
You could create a
std::set
of iterators or indexes into the vector, with a comparison predicate that uses the referenced data rather than the iterators/indexes to sort stuff. Then you delete everything from the vector that isn't referenced in the set. (Of course, you could just as well use anotherstd::vector
of iterators/indexes,std::sort
andstd::unique
that, and use this as a reference as to what to keep.)基于@fa的回答。它还可以使用 STL 算法 std::stable_partition 进行重写:
这样它会更紧凑,而且我们不需要关心迭代器。
似乎甚至不会带来太多的性能损失。我在我的项目中使用它,该项目需要经常处理相当大的复杂类型向量,但它没有真正的区别。
另一个不错的功能是,可以通过使用 std::set来调整唯一性。 tmpSet;。例如,在我的项目中忽略某些舍入错误。
Based on the answer of @fa. It can also get rewritten using the STL algorithm
std::stable_partition
:This way it is more compact and we don't need to care of the iterators.
It seems even not to introduce to much performance penalty. I use it in my project which needs to handle quite large vectors of complex types often and it makes no real difference.
Another nice feature is, that it is possible to adjust the uniqueness by using
std::set<int, myCmp_> tmpSet;
. For instance, in my project to ignore certain rounding errors.John Torjo 写了一篇很好的文章,系统地讨论了这个问题。他提出的结果似乎比迄今为止建议的任何解决方案更通用、更高效:
http://www.builderau.com.au/program/java/soa/C-Removing-duplicates-from- a-range/0,339024620,320271583,00.htm
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1052159.html
不幸的是,John 的解决方案的完整代码似乎不再可用,并且 John 也没有回复 5 封电子邮件。因此,我编写了自己的代码,该代码基于与他类似的理由,但故意在一些细节上有所不同。如果您愿意,请随时联系我 (vschoech think-cell com) 并讨论详细信息。
为了使代码能够为您编译,我添加了一些我自己经常使用的库内容。另外,我没有使用简单的 stl,而是大量使用 boost 来创建更通用、更高效、更可读的代码。
玩得开心!
There is a nice article by John Torjo which deals with this very question in a systematic way. The result he comes up with seems more generic and more efficient than any of the solutions suggested here so far:
http://www.builderau.com.au/program/java/soa/C-Removing-duplicates-from-a-range/0,339024620,320271583,00.htm
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1052159.html
Unfortunately, the complete code of John's solution seems to be no longer available, and John did not respond to may email. Therefore, I wrote my own code which is based on similar grounds like his, but intentionally differs in some details. Feel free to contact me (vschoech think-cell com) and discuss the details if you wish.
To make the code compile for you, I added some of my own library stuff which I use regularly. Also, instead of going with plain stl, I use boost a lot to create more generic, more efficient, and more readable code.
Have fun!
STL 选项是您提到的:将项目放入
std::set
中,或调用std::sort
、std::unique
> 并在容器上调用erase()
。这些都不能满足您的“删除不相邻重复项并保持元素顺序”的要求。那么为什么 STL 不提供其他选择呢?没有一个标准库能够满足每个用户的需求。 STL 的设计考虑因素包括“对几乎所有用户来说足够快”、“对几乎所有用户有用”和“尽可能提供异常安全”(以及“对标准委员会来说足够小”,正如 Stepanov 库最初所说的那样)写的要大得多,Stroustrup 砍掉了大约 2/3 的内容)。
我能想到的最简单的解决方案如下所示:
该解决方案应该是 O(M * N),其中 M 是唯一元素的数量,N 是元素的总数。
The STL options are the ones you mentioned: put the items in a
std::set
, or callstd::sort
,std::unique
and callingerase()
on the container. Neither of these fulfills your requirement of "removing the non-adjacent duplicates and maintaining the order of elements."So why doesn't the STL offer some other option? No standard library will offer everything for every user's needs. The STL's design considerations include "be fast enough for nearly all users," "be useful for nearly all users," and "provide exception safety as much as possible" (and "be small enough for the Standards Committee" as the library Stepanov originally wrote was much larger, and Stroustrup axed out something like 2/3 of it).
The simplest solution I can think of would look like this:
This solution should be O(M * N) where M is the number of unique elements and N is the total number of elements.
据我所知,stl中没有。查找参考。
As far as i know there is none in stl. Look up reference.
基于 @Corden 的答案,但使用 lambda 表达式并删除重复项,而不是将它们写入输出向量
Based on the @Corden's answer, but uses lambda expression and removes duplicates instead of writing them in the output vector
鉴于您的输入位于
vector中, foo
您可以使用remove
在这里为您做腿部工作,那么如果您想缩小矢量,只需使用erase
否则,当您希望删除重复项但保留顺序的向量时,只需使用last
作为最后一位迭代器:实例
就时间复杂度而言,这将是 O(nm) 。其中n是元素的数量,m是唯一元素的数量。就空间复杂度而言,这将使用O(n),因为它会就地进行删除。
Given your your input is in
vector<int> foo
you can useremove
to do the leg work for you here, then if you want to shrink the vector just useerase
otherwise just uselast
as your one-past-the-end iterator when you want the vector with duplicates removed but order preserved:Live Example
As far as time complexity this will be O(nm). Where n is the number of elements and m is the number of unique elements. As far as space complexity this will use O(n) because it does the removal in place.