当前位置：文江博客话题详情

C++如何将排序向量合并为排序向量/从所有向量中弹出最小元素？

发布于 2024-12-28 19:13:33 字数 2348 浏览 2 评论 0原文

我有大约一百个左右排序的向量的集合虽然大多数向量中都包含少量整数，但有些向量包含大量（> 10K）整数（因此向量不一定具有相同的大小）。

我想做的基本上是迭代所有这些排序向量中包含的最小到最大整数。

一种方法是将所有这些排序向量合并到一个排序向量中。只需迭代即可。因此，

问题 1：将排序向量合并为排序向量的最快方法是什么？

另一方面，我确信有更快/更聪明的方法可以实现这一点，而无需合并和合并。重新排序整个事情——也许从这个排序向量集合中迭代地弹出最小的整数；不先合并它们......所以：

问题2：从一堆排序的向量中弹出最小元素的快速/最佳方法是什么？

根据下面的答复以及对问题的评论，我实现了一种方法，为排序的向量创建迭代器的优先级队列。我不确定这是否具有性能效率，但它似乎非常节省内存。我认为这个问题仍然悬而未决，因为我不确定我们是否已经建立了最快的方法。

// compare vector pointers by integers pointed
struct cmp_seeds {
    bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const {
        return *(p1.first) >  *(p2.first);      
    }
};

int pq_heapsort_trial() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int));
    vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int));
    vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int));

    vector< vector <int> * > sorted_vectors;
    sorted_vectors.push_back(&v1);
    sorted_vectors.push_back(&v2);
    sorted_vectors.push_back(&v3);
    /* the above simulates the "for" i have in my own code that gives me sorted vectors */

    pair< vector<int>::iterator, vector<int>::iterator> c_lead;
    cmp_seeds mycompare;

    priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare);


    for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) {
        cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ));
    }


    while ( cluster_feeder.empty() != true) {
        c_lead = cluster_feeder.top();
        cluster_feeder.pop();
        // sorted output
        cout << *(c_lead.first) << endl;

        c_lead.first++;
        if (c_lead.first != c_lead.second) {
            cluster_feeder.push(c_lead);
        }
    }

    return 0;
}

原文

I have a collection of about a hundred or so sorted vector<int>'s Although most vectors have a small number of integers in them, some of the vectors contain a large (>10K) of them (thus the vectors don't necessarily have the same size).

What I'd like to do essentially iterate through smallest to largest integer, that are contained in all these sorted vectors.

One way to do it would be to merge all these sorted vectors into a sorted vector & simply iterate. Thus,

Question 1: What is the fastest way to merge sorted vectors into a sorted vector?

I'm sure on the other hand there are faster / clever ways to accomplish this without merging & re-sorting the whole thing -- perhaps popping the smallest integer iteratively from this collection of sorted vectors; without merging them first.. so:

Question 2: What is the fasted / best way to pop the least element from a bunch of sorted vector<int>'s?

Based on replies below, and the comments to the question I've implemented an approach where I make a priority queue of iterators for the sorted vectors. I'm not sure if this is performance-efficient, but it seems to be very memory-efficient. I consider the question still open, since I'm not sure we've established the fastest way yet.

// compare vector pointers by integers pointed
struct cmp_seeds {
    bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const {
        return *(p1.first) >  *(p2.first);      
    }
};

int pq_heapsort_trial() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int));
    vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int));
    vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int));

    vector< vector <int> * > sorted_vectors;
    sorted_vectors.push_back(&v1);
    sorted_vectors.push_back(&v2);
    sorted_vectors.push_back(&v3);
    /* the above simulates the "for" i have in my own code that gives me sorted vectors */

    pair< vector<int>::iterator, vector<int>::iterator> c_lead;
    cmp_seeds mycompare;

    priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare);


    for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) {
        cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ));
    }


    while ( cluster_feeder.empty() != true) {
        c_lead = cluster_feeder.top();
        cluster_feeder.pop();
        // sorted output
        cout << *(c_lead.first) << endl;

        c_lead.first++;
        if (c_lead.first != c_lead.second) {
            cluster_feeder.push(c_lead);
        }
    }

    return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吹梦到西洲 2025-01-04 19:13:33

一种选择是使用 std :: 优先级队列维护一个迭代器堆，其中迭代器根据它们指向的值在堆中冒泡。

您还可以考虑使用 std :: inplace_merge。这将涉及将所有数据一起附加到一个大向量中，并记住每个不同排序块开始和结束的偏移量，然后将它们传递到 inplace_merge 中。这可能会比堆解决方案更快，尽管我认为从根本上来说复杂性是相同的。

更新：我已经实现了刚才描述的第二个算法。反复进行适当的合并排序。此代码位于 ideone 上。

这是通过首先将所有排序列表连接在一起形成一个长列表来实现的。如果存在三个源列表，则意味着存在四个“偏移量”，即完整列表中的四个点，元素在这四个点之间进行排序。然后，该算法将一次完成其中的三个，将两个相应的相邻排序列表合并为一个排序列表，然后记住这三个偏移量中的两个以在 new_offsets 中使用。

这在循环中重复，将相邻的排序范围对合并在一起，直到只剩下一个排序范围。

最终，我认为最好的算法首先将相邻范围的最短对合并在一起。

// http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

template<typename T, size_t N>
vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of
                                            // the array, all the type information, including the
                                            // length of the array, is known at compiler. 
        vector<T> v( *array, &((*array)[N]));
        return v;
}   

void merge_sort_many_vectors() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1  = array_to_vector(&a1);
    vector<int> v2  = array_to_vector(&a2);
    vector<int> v3  = array_to_vector(&a3);


    vector<int> full_vector;
    vector<size_t> offsets;
    offsets.push_back(0);

    full_vector.insert(full_vector.end(), v1.begin(), v1.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v2.begin(), v2.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v3.begin(), v3.end());
    offsets.push_back(full_vector.size());

    assert(full_vector.size() == v1.size() + v2.size() + v3.size());

    cout << "before:\t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }       
    cout << endl;
    while(offsets.size()>2) {
            assert(offsets.back() == full_vector.size());
            assert(offsets.front() == 0);
            vector<size_t> new_offsets;
            size_t x = 0;
            while(x+2 < offsets.size()) {
                    // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2])
                    inplace_merge(&full_vector.at(offsets.at(x))
                                 ,&full_vector.at(offsets.at(x+1))
                                 ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end
                                 );
                    // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets.
                    // offsets[x+1] is not relevant any more
                    new_offsets.push_back(offsets.at(x));
                    new_offsets.push_back(offsets.at(x+2));
                    x += 2;
            }
            // if the number of offsets was odd, there might be a dangling offset
            // which we must remember to include in the new_offsets
            if(x+2==offsets.size()) {
                    new_offsets.push_back(offsets.at(x+1));
            }
            // assert(new_offsets.front() == 0);
            assert(new_offsets.back() == full_vector.size());
            offsets.swap(new_offsets);

    }
    cout << "after: \t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }
    cout << endl;
}

int main() {
        merge_sort_many_vectors();
}

One option is to use a std :: priority queue to maintain a heap of iterators, where the iterators bubble up the heap depending on the values they point at.

You could also consider using repeating applications of std :: inplace_merge. This would involve appending all the data together into a big vector and remembering the offsets at which each distinct sorted block begins and ends, and then passing those into inplace_merge. This would probably be faster then the heap solution, although I think fundamentally the complexity is equivalent.

Update: I've implemented the second algorithm I just described. Repeatedly doing a mergesort in place. This code is on ideone.

This works by first concatenating all the sorted lists together into one long list. If there were three source lists, this means there are four 'offsets', which are four points in the full list between which the elements are sorted. The algorithm will then pull off three of these at a time, merging the two corresponding adjacent sorted lists into one sorted list, and then remembering two of those three offsets to be used in the new_offsets.

This repeats in a loop, with pairs of adjacent sorted ranges merged together, until only one sorted range remains.

Ultimately, I think the best algorithm would involve merging the shortest pairs of adjacent ranges together first.

// http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

template<typename T, size_t N>
vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of
                                            // the array, all the type information, including the
                                            // length of the array, is known at compiler. 
        vector<T> v( *array, &((*array)[N]));
        return v;
}   

void merge_sort_many_vectors() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1  = array_to_vector(&a1);
    vector<int> v2  = array_to_vector(&a2);
    vector<int> v3  = array_to_vector(&a3);


    vector<int> full_vector;
    vector<size_t> offsets;
    offsets.push_back(0);

    full_vector.insert(full_vector.end(), v1.begin(), v1.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v2.begin(), v2.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v3.begin(), v3.end());
    offsets.push_back(full_vector.size());

    assert(full_vector.size() == v1.size() + v2.size() + v3.size());

    cout << "before:\t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }       
    cout << endl;
    while(offsets.size()>2) {
            assert(offsets.back() == full_vector.size());
            assert(offsets.front() == 0);
            vector<size_t> new_offsets;
            size_t x = 0;
            while(x+2 < offsets.size()) {
                    // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2])
                    inplace_merge(&full_vector.at(offsets.at(x))
                                 ,&full_vector.at(offsets.at(x+1))
                                 ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end
                                 );
                    // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets.
                    // offsets[x+1] is not relevant any more
                    new_offsets.push_back(offsets.at(x));
                    new_offsets.push_back(offsets.at(x+2));
                    x += 2;
            }
            // if the number of offsets was odd, there might be a dangling offset
            // which we must remember to include in the new_offsets
            if(x+2==offsets.size()) {
                    new_offsets.push_back(offsets.at(x+1));
            }
            // assert(new_offsets.front() == 0);
            assert(new_offsets.back() == full_vector.size());
            offsets.swap(new_offsets);

    }
    cout << "after: \t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }
    cout << endl;
}

int main() {
        merge_sort_many_vectors();
}

回复收藏 0 原文

暖心男生 2025-01-04 19:13:33

我首先想到的是创建一个堆结构，其中包含每个向量的迭代器，并按它们当前指向的值排序。（当然，每个条目也需要包含结束迭代器）

当前元素位于堆的根部，要前进，您只需弹出它或增加其键即可。（后者可以通过弹出、递增、然后推送来完成）

我相信这应该具有渐近复杂度 O(E log M) 其中 E 是元素总数， M 是向量的数量。

如果您确实要从向量中弹出所有内容，则可以创建一堆指向向量的指针，您可能也希望将它们视为堆，以避免从向量前面擦除的性能损失。（或者，您可以先将所有内容复制到双端队列中）

如果您小心顺序，则通过一次合并对将它们全部合并在一起具有相同的渐近复杂性。如果将所有向量排列在一个完整、平衡的二叉树中，然后在树上向上进行成对合并，则每个元素将被复制 log M 次，也会导致 O( E log M) 算法。

为了获得额外的实际效率，您应该重复合并最小的两个向量，而不是树，直到只剩下一个。（同样，将指向向量的指针放入堆中是可行的方法，但这次按长度排序）

（实际上，您希望按“复制成本”而不是长度排序。针对某些值类型进行优化的额外操作）

如果我不得不猜测，最快的方法是使用第二种想法，但是使用 N 元合并而不是成对合并，对于一些合适的 N （我猜这将是一个小常数，或者大致向量数量的平方根），并使用上面的第一种算法进行N元合并，一次性枚举N个向量的内容。

回复收藏 0 原文

回首观望 2025-01-04 19:13:33

我使用了这里给出的算法并做了一些抽象；转换为模板。我在 VS2010 中编写了这个版本，并使用 lambda 函数而不是仿函数。我不知道这是否比以前的版本“更好”，但也许对某人有用？

#include <queue>
#include <vector>

namespace priority_queue_sort
{
    using std::priority_queue;
    using std::pair;
    using std::make_pair;
    using std::vector;

    template<typename T>
    void value_vectors(const vector< vector <T> * >& input_sorted_vectors, vector<T> &output_vector)
    {
        typedef vector<T>::iterator iter;
        typedef pair<iter, iter>    iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;
        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }

    template<typename U, typename V>
    void pair_vectors(const vector< vector < pair<U, V> > * >& input_sorted_vectors, vector< pair<U, V> > &output_vector)
    {
        typedef vector< pair<U, V> >::iterator iter;
        typedef pair<iter, iter> iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;

        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);  
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }
}

算法 priority_queue_sort::value_vectors 对仅包含值的向量进行排序；而priority_queue_sort::pair_vectors根据第一个数据元素对包含数据对的向量进行排序。希望有一天有人可以使用这个:-)

I've used the algorithm given here and did a little abstracting; converting to templates. I've coded this version in VS2010 and used a lambda function instead of the functor. I don't know if this is in any sense 'better' than the previous version, but maybe it will be useful someone?

#include <queue>
#include <vector>

namespace priority_queue_sort
{
    using std::priority_queue;
    using std::pair;
    using std::make_pair;
    using std::vector;

    template<typename T>
    void value_vectors(const vector< vector <T> * >& input_sorted_vectors, vector<T> &output_vector)
    {
        typedef vector<T>::iterator iter;
        typedef pair<iter, iter>    iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;
        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }

    template<typename U, typename V>
    void pair_vectors(const vector< vector < pair<U, V> > * >& input_sorted_vectors, vector< pair<U, V> > &output_vector)
    {
        typedef vector< pair<U, V> >::iterator iter;
        typedef pair<iter, iter> iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;

        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);  
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }
}

The algorithm priority_queue_sort::value_vectors sorts vectors containing values only; whereas priority_queue_sort::pair_vectors sorts vectors containing pairs of data according to the first data-element. Hope someone can use this someday :-)

回复收藏 0 原文

~没有更多了~