如何有效地只保留重复项?
给定一个 STL 向量,仅按排序顺序输出重复项,例如,
INPUT : { 4, 4, 1, 2, 3, 2, 3 }
OUTPUT: { 2, 3, 4 }
该算法很简单,但目标是使其与 std::unique() 一样高效。我的幼稚实现就地修改了容器:
我的幼稚实现:
void not_unique(vector<int>* pv)
{
if (!pv)
return;
// Sort (in-place) so we can find duplicates in linear time
sort(pv->begin(), pv->end());
vector<int>::iterator it_start = pv->begin();
while (it_start != pv->end())
{
size_t nKeep = 0;
// Find the next different element
vector<int>::iterator it_stop = it_start + 1;
while (it_stop != pv->end() && *it_start == *it_stop)
{
nKeep = 1; // This gets set redundantly
++it_stop;
}
// If the element is a duplicate, keep only the first one (nKeep=1).
// Otherwise, the element is not duplicated so erase it (nKeep=0).
it_start = pv->erase(it_start + nKeep, it_stop);
}
}
如果您可以使其更加高效、优雅或通用,请告诉我。例如,自定义排序算法,或复制第二个循环中的元素以消除擦除()调用。
Given an STL vector, output only the duplicates in sorted order, e.g.,
INPUT : { 4, 4, 1, 2, 3, 2, 3 }
OUTPUT: { 2, 3, 4 }
The algorithm is trivial, but the goal is to make it as efficient as std::unique(). My naive implementation modifies the container in-place:
My naive implementation:
void not_unique(vector<int>* pv)
{
if (!pv)
return;
// Sort (in-place) so we can find duplicates in linear time
sort(pv->begin(), pv->end());
vector<int>::iterator it_start = pv->begin();
while (it_start != pv->end())
{
size_t nKeep = 0;
// Find the next different element
vector<int>::iterator it_stop = it_start + 1;
while (it_stop != pv->end() && *it_start == *it_stop)
{
nKeep = 1; // This gets set redundantly
++it_stop;
}
// If the element is a duplicate, keep only the first one (nKeep=1).
// Otherwise, the element is not duplicated so erase it (nKeep=0).
it_start = pv->erase(it_start + nKeep, it_stop);
}
}
If you can make this more efficient, elegant, or general, please let me know. For example, a custom sorting algorithm, or copy elements in the 2nd loop to eliminate the erase() call.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我的第一次尝试惨遭失败,假设 std::unique 将所有重复项移动到范围的末尾(事实并非如此)。哎呀。这是另一种尝试:
这是
not_unique
的实现。它会删除排序范围内仅出现一次的所有元素以及出现多次的任何元素的重复项。因此,结果范围是出现多次的元素的唯一列表。该函数具有线性复杂度,并且在范围内执行一次传递(std::unique 具有线性复杂度)。
It
必须满足前向迭代器的要求。返回结果范围的末尾。I miserably failed with my first attempt, assuming that
std::unique
moves all the duplicates to the end of the range (it doesn't). Oops. Here's another attempt:Here is an implementation of
not_unique
. It removes any elements that appear only once in the sorted range and duplicates of any elements that appear more than once. The resulting range is therefore the unique list of elements that appear more than once.The function has linear complexity and does a single pass over the range (
std::unique
has linear complexity).It
must meet the requirements of a forward iterator. The end of the resulting range is returned.比以前的条目更短、更接近 STL。假设输入已排序。
Shorter and more STL-ish than previous entries. Assumes sorted input.
您甚至可以使用不匹配来获得额外积分!
顺便说一句:很好的锻炼。
You can even use mismatch, for extra points!
Btw: nice exercise.
我的建议是修改插入排序,以便您可以排序和排序。同时过滤欺骗。
My suggestion would be a modified insertion sort, so that you can sort & filter dupes at the same time.
这是标准库的风格。算法的功劳归功于 James! (如果你+1我,你最好+1他,否则)。我所做的只是使其成为标准库风格:
This is in the style of the standard library. Credit for algorithm goes to James! (If you +1 me, you better +1 him, or else). All I did was make it standard library style:
调用“erase(it_start + keep, it_stop);”在 while 循环中将导致一遍又一遍地复制剩余元素。
我建议将所有独特的元素交换到向量的前面,然后一次性删除剩余的元素。
Calling "erase(it_start + keep, it_stop);" from within the while loop is going to result in copying the remaining elements over and over again.
I'd suggest swapping all unique elements to the front of the vector, then erasing the remaining elements all at once.
我认为,从大 O 的角度来看,你已经尽可能地实现了它。最重要的成本是排序,其时间复杂度为 O(N log N)。不过,一种可能性是使用重复条目构建一个新向量,而不是使用现有向量并通过删除操作删除非重复项。然而,只有当重复的不同数量相对于条目总数较小时,这才会更好。
考虑一下极端的例子。如果原始数组由 1,000 个条目组成,只有一个重复项,那么输出将是一个只有一个值的向量。使用一个条目创建新向量可能比从原始向量中删除其他 999 个条目更有效。然而,我怀疑在现实世界的测试中,这种改变所节省的成本可能很难衡量。
编辑我只是在考虑“面试”问题。换句话说,这不是一个非常有用的答案。但可以在 O(N)(线性时间)而不是 O(N Log N) 内解决这个问题。使用存储空间代替CPU。创建两个“位”数组,最初将它们清除。循环遍历整数值向量。查找第一位数组中的每个值。如果未设置,则设置该位(将其设置为 1)。如果已设置,则设置第二个数组中的相应位(表示重复)。处理完所有向量条目后,扫描第二个数组并输出重复的整数(由第二个位数组中设置的位表示)。使用位数组的原因只是为了空间效率。如果处理 4 字节整数,则所需的原始空间为
(2 * 2^32 / 8 )
。但实际上可以通过使其成为稀疏数组来将其转变为可用的算法。非常伪的伪代码将是这样的:I think that from a big O standpoint, you have it implemented as good as it gets. The overriding cost is the sort, which is O(N log N). One possibility, though, would be to build up a new vector with the duplicate entries rather than use the existing vector with the delete operation removing the non-duplicates. However, this would only be better if the distinct number of duplicates is small relative to the total number of entries.
Consider the extreme example. If the original array consisted of 1,000 entries with only one duplicate, then the output would be a vector with just one value. It might be a bit more efficient to create the new vector with one entry rather than deleting the other 999 entries from the original vector. I suspect, however, that in real world testing, the savings of that change could be difficult to measure.
Edit I was just thinking about this in terms of "interview" question. In other words, this is not a terribly useful answer. But it would be possible to solve this in O(N) (linear time) as opposed to O(N Log N). Use storage space instead of CPU. Create two "bit" arrays with them cleared initially. Loop through your vector of integer values. Look up each value in the first bit array. If it is not set, then set the bit (set it to 1). If it is set, then set the corresponding bit in the second array (indicating a duplicate). After all vector entries are processed, scan through the second array and output the integers that are duplicates (indicated by the bits set in the second bit array). The reason for using bit arrays is just for space efficiency. If dealing with 4-byte integers, then the raw space required is
(2 * 2^32 / 8 )
. But this could actually be turned into a usable algorithm by making it a sparse array. The very pseudo pseudo-code would be something like this:另一张:
Another one:
这修复了 James McNellis 原始版本中的错误。我还提供就地和异地版本。
This fixes the bugs in James McNellis's original version. I also provide in-place and out-of-place versions.
“与 std::unique 一样高效”是什么意思?在运行时、开发时间、内存使用等方面高效吗?
正如其他人指出的那样, std::unique 需要排序输入,而您尚未提供,因此这不是一个公平的测试。
就我个人而言,我只会让 std::map 为我完成所有工作。它有很多属性,我们可以使用它来实现最大程度的优雅/简洁。它保持其元素已排序,如果键尚不存在,operator[] 将插入零值。通过利用这些属性,我们可以用两到三行代码完成此任务,并且仍然实现合理的运行时复杂性。
基本上,我的算法是这样的:对于向量中的每个元素,将由该元素的值键入的映射条目加一。然后,简单地遍历地图,输出任何值大于 1 的键。再简单不过了。
因此,我们访问向量中的每个元素一次,然后访问我的地图中的每个元素一次,其中我的地图中的元素数量最坏不大于你的向量。我的解决方案的缺点是比涉及就地重新排列矢量的解决方案需要更多的存储空间。然而,优点是显而易见的。它非常短且简单,显然是正确的,无需进行大量测试或代码审查,并且具有合理的性能属性。
将我的函数作为模板,并使其在 STL 样式范围上运行,而不仅仅是整数向量,这将作为练习。
What is meant by "as efficient as std::unique"? Efficient in terms of runtime, development time, memory usage, or what?
As others pointed out, std::unique requires sorted input, which you haven't provided, so it's not a fair test to begin with.
Personally I would just have a std::map do all of my work for me. It has a lot of properties we can use for maximal elegance/brevity. It keeps its elements sorted already, and operator[] will insert a zero value if the key doesn't already exist. By leveraging those properties, we can get this done in two or three lines of code, and still achieve reasonable runtime complexity.
Basically, my algorithm is this: For each element in the vector, increment by one the map entry keyed by the value of that element. Afterwards, simply walk the map, outputting any key whose value is more than 1. Couldn't be simpler.
So, we visit each element in your vector once, and then each element in my map once, where the number of elements in my map is at worst no bigger than your vector. The drawbacks to my solution are a lot more storage space than the solutions that involve rearranging your vector in-place. The advantages, however, are clear. It's incredibly short and simple, it's obviously correct without the need for much testing or code review, and it has reasonable performance properties.
Making my function a template, and making it operate on STL-style ranges instead of just vectors of ints, is left as an exercise.