当前位置：文江博客话题详情

在集合中查找重复元素并对它们进行分组的快速算法是什么？

发布于 2024-08-03 03:33:58 字数 14031 浏览 6 评论 0 原文

假设你有一个元素集合，如何以最少的比较找出那些重复的元素并将它们放入每个组中？最好是C++，但算法比语言更重要。例如给定 {E1,E2,E3,E4,E4,E2,E6,E4,E3}，我希望提取出 {E2,E2}，{E3,E3}，{E4,E4,E4}。你会选择什么数据结构和算法？还请包括设置数据结构的成本，例如，如果它是像 std::multimap 之类的预排序结构，则

更新

以使事情按照建议更清晰。有一个限制：元素必须自行比较以确保它们是重复的。

所以散列不适用，因为实际上它们将比较从重元素（例如数据块）转移到轻元素（整数），并减少了一些比较，但并没有消除它们，最后，我们回到了我们最初的问题是，何时位于一个碰撞桶内。

假设你有一堆潜在的重复文件，每个文件都有 GB 大小，它们通过人类已知的每个哈希算法具有相同的哈希值。现在您将发现真正的重复项。

不，这不可能是现实生活中的问题（即使 MD5 也足以为现实生活中的文件生成唯一的哈希值）。但假装这样我们就可以专注于寻找涉及最少比较量的数据结构+算法。

我正在做的是

表示为 STL std::list 数据结构（其中 1）它的元素删除比向量更便宜 2）它的插入更便宜，不需要排序。）
弹出一个元素并比较如果发现重复项，则将其从列表中删除。一旦到达列表末尾，就会发现一组重复项（如果有）。
重复上述2个步骤，直到列表为空。

最好的情况需要N-1，但是(N-1)！在最坏的情况下。

有哪些更好的选择？

我的代码使用上面解释的方法：

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

感谢下面的答案，但是他们似乎被我的例子误导了，因为它是关于整数的。事实上，元素类型不可知（不一定是整数、字符串或任何其他 POD），并且相等谓词是自定义的，即比较可能非常重 。

所以也许我的问题应该是：使用哪种数据结构+算法涉及较少的比较。

根据我的测试，使用像 multiset、multimap 这样的预排序容器并没有更好，因为

插入时的排序已经完成了比较，
下面的相邻发现再次进行比较，
这些数据结构更喜欢小于操作而不是等于操作，它们执行 2 less-than(a

我不明白它如何保存比较。

下面的一些答案忽略了另一件事，我需要将重复的组彼此区分开来，而不仅仅是将它们保留在容器中。

结论

经过所有讨论，我最初的天真的方法似乎有 3 种方法，

如上所述，
从像 std::vector 这样的线性容器开始，对其进行排序，然后找到
从关联容器开始的相等范围就像 std::map> 一样，按照 Charles Bailey 的建议在关联容器的设置过程中挑选出重复项。

我编写了一个示例来测试下面发布的所有方法。

重复项的数量以及它们的分发时间可能会影响最佳选择。

当他们在前面严重摔倒时，方法 1 是最好的，而在最后时，方法 1 是最差的。排序不会改变分布，而是改变字节序。
方法3的性能最平均
方法2永远不是最好的选择

感谢所有参与讨论的人。

以下代码的一个输出包含 20 个示例项目。

使用 [ 20 10 6 5 4 3 2 2 2 2 1 进行测试 1 1 1 1 1 1 1 1 1]

和 [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] 分别

使用 std::vector ->排序（） - > 相邻的查找（）：

比较：[ '<' = 139, '==' = 23 ]

比较：[ '<' = 38, '==' = 23]

使用 std::list ->排序（） - >收缩列表：

比较：[ '<' = 50，'==' = 43]

比较：[ '<' = 52, '==' = 43]

使用 std::list ->缩小列表：

比较：[ '<' = 0, '==' = 121]

比较：[ '<' = 0, '==' = 43]

使用 std::vector -> std::map>:

比较：[ '<' = 79, '==' = 0]

比较：[ '<' = 53, '==' = 0]

使用 std::vector -> std::multiset -> 相邻的查找（）：

比较：[ '<' = 79, '==' = 7]

比较：[ '<' = 53, '==' = 7]

代码

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

原文

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language.
For Example
given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}.
what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

What I am doing is to

represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2 steps until the list is empty.

It needs N-1 in the best case, but (N-1)! in the worse case.

what are the better alternatives?

My code using method explained above:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

Using a pre-sorted container like multiset, multimap is not better according to my test, since

the sorting while inserting already does the comparisons,
the following adjacent finding does comparison again,
these data structure prefer less-than operations to equal operations, they perform 2 less-than(a

I do not see how it can save comparisons.

one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

Conclusion

After all the discussion, there seem to be 3 ways

my original naive method as explained above
Start with a linear container like std::vector , sort it and then locate the equal ranges
start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.

I've coded a sample to test all the methods as posted below.

the number of duplicates and when they are distributed may influence the best choice.

Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
Method 3 has the most average performance
Method 2 is never the best choice

Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

Test with [ 20 10 6 5 4 3 2 2 2 2 1
1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
4 5 6 10 20 ] respectively

using std::vector -> sort() ->
adjacent_find():

comparisons: [ '<' = 139, '==' = 23
]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink
list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector ->
std::multiset ->
adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷情妓 2024-08-10 03:33:58

是的，你可以做得更好。

对它们进行排序（简单整数为 O(n)，一般为 O(n*log n)），然后保证重复项是相邻的，从而快速找到它们 O(n)
使用哈希表，也是 O(n ）。对于每个项目，(a) 检查它是否已经在哈希表中；如果是这样，则它是重复的；如果没有，则将其放入哈希表中。

编辑

您使用的方法似乎进行 O(N^2) 比较：

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

因此，对于长度 5，您需要进行 4+3+2+1=10 比较； 6 则执行 15，等等。准确地说，是 (N^2)/2 - N/2。对于任何相当高的 N 值，N*log(N) 较小。

在您的情况下，N 有多大？

至于减少哈希冲突，最好的方法是获得更好的哈希函数:-D。假设这是不可能的，如果您可以进行变体（例如，不同的模数），您也许可以进行嵌套哈希。

Yes, you can do much better.

Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)
Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

edit

The method you're using seems to do O(N^2) comparisons:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

How big is N in your case?

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

回复收藏 0 原文

咽泪装欢 2024-08-10 03:33:58

1.最坏情况下对数组进行排序 O(n log n) - 合并排序/堆排序/二叉树排序等

2。比较邻居并找出匹配项 O(n)

回复收藏 0 原文

往事随风而去 2024-08-10 03:33:58

保持从值到计数的基于哈希表的结构；如果您的 C++ 实现不提供 std::hash_map （到目前为止还不是 C++ 标准的一部分！-），请使用 Boost 或从网络上获取版本。通过一次集合（即 O(N)），您可以进行值->计数映射；再次遍历哈希表（显然 <= O(N)）以识别计数 > 的值。 1 并适当地发出它们。总体 O(N)，您的建议并非如此。

回复收藏 0 原文

你げ笑在眉眼 2024-08-10 03:33:58

您可以使用从代表性元素到其他元素的列表/向量/双端队列的映射。这在插入容器时需要相对较少的比较，并且意味着您可以迭代结果组而无需执行任何比较。

此示例始终将第一个代表元素插入到映射的双端队列存储中，因为这使得通过组的后续迭代在逻辑上变得简单，但如果这种重复证明存在问题，那么很容易只执行 push_back if (!ins_pair.second)。

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

然后，迭代各组（相对）简单且便宜：

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

我进行了一些实验来进行比较和对象计数。在以随机顺序排列 100000 个对象形成 50000 个组（即每组平均 2 个对象）的测试中，上述方法花费了以下数量的比较和副本：（

1630674 comparisons, 443290 copies

我尝试降低副本数量，但只真正做到了比较的费用在您的场景中似乎是成本较高的操作。）

使用多重映射并在最终迭代中保留前一个元素来检测组转换的成本如下：

1756208 comparisons, 100000 copies

使用单个列表并弹出前面的元素并执行对其他组成员的线性搜索成本：

1885879088 comparisons, 100000 copies

是的，与我的最佳方法的 ~1.6m 比较相比，这大约是 1.9b 比较。为了让列表方法执行接近最佳数量的比较，必须对它进行排序，这将花费与构建固有有序容器相似的比较次数。

编辑

我采用了您发布的代码并在我之前使用的相同测试数据集上运行了隐含算法（我必须对代码做出一些假设，因为有一些假设的定义），并且我计算了：

1885879088 comparisons, 420939 copies

即与我的哑列表算法完全相同的比较次数，但副本更多。我认为这意味着我们在这种情况下使用本质上相似的算法。我看不到任何替代排序顺序的证据，但看起来您想要一个包含多个等效元素的组列表。这可以通过添加 if (i->size > 1) 子句在我的 Iterate 函数中简单地实现。

我仍然看不到任何证据表明构建一个排序容器（例如这个双端队列映射）不是一个好的（即使不是最佳的）策略。

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

回复收藏 0 原文

十六岁半 2024-08-10 03:33:58

你尝试过排序吗？例如使用快速排序等算法？如果性能对您来说足够好，那么这将是一个简单的方法。

回复收藏 0 原文

伪心 2024-08-10 03:33:58

最简单的可能是对列表进行排序，然后迭代它以查找重复项。

如果您对数据有所了解，则可以使用更有效的算法。

例如，如果您知道列表很大，并且只包含 1..n 中的整数，其中 n 相当小，则可以使用一对布尔数组（或位图），并执行如下操作：

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

现在，many[ ] 包含一个数组，其中的值被多次看到。

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

If you know something about the data, more efficient algorithms are possible.

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

回复收藏 0 原文

九八野马 2024-08-10 03:33:58

如果已知它是一个整数列表，并且如果已知它们都在A和B之间（例如A=0，B=9），则创建BA元素的数组，并创建BA容器。

在非常具体的情况下（普通整数列表），我建议您只对它们进行计数，因为无论如何您无法区分不同的整数：

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

如果它们可以区分，则创建一个列表数组，然后将它们相加到相应的列表。

如果它们不是数字，请使用 std::map 或 std::hash_map，将键映射到值列表。

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

回复收藏 0 原文

且行且努力 2024-08-10 03:33:58

从 C++11 开始，哈希表由 STL 提供 std::unordered_map 。
因此，O(N) 解决方案是将您的值放入 unordered_map< T，<向量 >。

回复收藏 0 原文

画尸师 2024-08-10 03:33:58

大多数人提到哈希/无序映射解决方案时都假设插入和查询时间为 O(1)，但最坏情况可能是 O(N)。此外，您还可以避免对象散列的成本。

就我个人而言，我会将对象插入二叉树（每个对象的插入时间为 O(logn)），并在每个节点处保持计数器。这将产生 O(nlogn) 构造时间和 O(n) 遍历来识别所有重复项。

回复收藏 0 原文

苏别ゝ 2024-08-10 03:33:58

如果我正确理解了这个问题，那么这是我能想到的最简单的解决方案：

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

总运行时间：O(n log n)。
您有一个排序过程 O(n lg n)，然后是第二个过程，其中为每个组执行 O(lg n) 比较（所以这最多执行n次，也产生O(n lg n)

请注意，这取决于输入是否为向量。只有随机访问迭代器在第二遍中具有对数复杂度，双向迭代器将是线性的，

这不依赖于散列（根据要求），并且它保留所有原始元素（而不是只为每个组返回一个元素）。发生频率的计数）

当然，对输出向量进行一些较小的常量优化是一个好主意，以避免。 input.end() 的获取频率也超出了必要的范围，并且可以轻松缓存，具体

取决于假设的组有多大，equal_range 可能不会。我认为它会进行二分搜索以获得对数复杂度，但如果每个组只有几个元素，那么简单的线性扫描会更快。无论如何，初始分类在成本中占主导地位。

If I've understood the question correctly, then this is the simplest solution I can think of:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n).
You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

回复收藏 0 原文

别闹i 2024-08-10 03:33:58

只是为了注册我在我正在使用的三重存储标准化过程中遇到了同样的问题。我使用 Allegro Common Lisp 的哈希表功能在 Common Lisp 中实现了 Charles Bailey 总结的方法 3。

函数“等于代理？”用于测试 TS 中的两个代理是否相同。函数“merge-nodes”合并每个集群上的节点。在下面的代码中，“...”用于删除不那么重要的部分。

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

回复收藏 0 原文

~没有更多了~

关于作者

茶色山野

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

在集合中查找重复元素并对它们进行分组的快速算法是什么？

更新

结论

感谢所有参与讨论的人。

代码

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

编辑

edit

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

在集合中查找重复元素并对它们进行分组的快速算法是什么？

更新

结论

感谢所有参与讨论的人。

代码

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

编辑

edit

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。