整数的快速集合并集

发布于 2024-12-08 13:05:34 字数 1319 浏览 3 评论 0原文

我需要创建大量有序整数集的并集（我想避免重复，但如果有的话也没关系）。

这是迄今为止性能最佳的代码：

// some code added for better understanding
std::vector< std::pair<std::string, std::vector<unsigned int> > vec_map;
vec_map.push_back(std::make_pair("hi", std::vector<unsigned int>({1, 12, 1450});
vec_map.push_back(std::make_pair("stackoverflow", std::vector<unsigned int>({42, 1200, 14500});

std::vector<unsigned int> match(const std::string & token){
    auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
    auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

    std::vector<unsigned int> result;

    for(; lower != upper; ++lower){
        std::vector<unsigned int> other = lower->second;
        result.insert(result.end(), other.begin(), other.end());
    }
    std::sort(result.begin(), result.end()); // This function eats 99% of my running time

    return result;
}

valgrind（使用工具 callgrind）告诉我，我花了 99% 的时间进行排序。

这是我到目前为止尝试过的：

使用 std::set （性能非常差）
使用 std::set_union （性能差）使用
std::push_heap 维护堆（慢 50%）

是否有希望以某种方式获得一些性能？我可以更改我的容器并使用 boost，也许还有其他一些库（取决于其许可证）。

编辑整数可以大到 10 000 000 编辑2给出了一些我如何使用它的例子，因为有些混乱

原文

I need to make lots of unions of ordered set of integers (I would like to avoid duplicates, but it is okay if there are).

This is the code with the best performance so far :

// some code added for better understanding
std::vector< std::pair<std::string, std::vector<unsigned int> > vec_map;
vec_map.push_back(std::make_pair("hi", std::vector<unsigned int>({1, 12, 1450});
vec_map.push_back(std::make_pair("stackoverflow", std::vector<unsigned int>({42, 1200, 14500});

std::vector<unsigned int> match(const std::string & token){
    auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
    auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

    std::vector<unsigned int> result;

    for(; lower != upper; ++lower){
        std::vector<unsigned int> other = lower->second;
        result.insert(result.end(), other.begin(), other.end());
    }
    std::sort(result.begin(), result.end()); // This function eats 99% of my running time

    return result;
}

valgrind (using the tool callgrind) tells me that I spend 99% of the time doing the sort.

This is what I tried so far :

Using std::set (very bad performance)
Using std::set_union (bad performance)
maintaining a heap with std::push_heap (50% slower)

Is there any hope to gain somehow some performance? I can change my containers and use boost, and maybe some other lib (depending on its licence).

EDIT integers can be as big a 10 000 000
EDIT 2 gave some example of how I use it, because of some confusion

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

多情出卖 2024-12-15 13:05:34

这看起来像一个多路合并的实例。根据输入（配置文件和时间！），最好的算法可能是您拥有的算法，或者通过从所有容器中选择最小整数或更复杂的算法来逐步构建结果的算法。

回复收藏 0 原文

说谎友 2024-12-15 13:05:34

自定义合并排序可能提供一点帮助。

#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <iostream>
#include <climits>

typedef std::multimap<std::string, std::vector<unsigned int> > vec_map_type;
vec_map_type vec_map;
struct comp {
    bool operator()(const std::string& lhs, const std::pair<std::string, std::vector<unsigned int> >& rhs) const
    { return lhs < rhs.first; }
    bool operator()(const std::pair<std::string, std::vector<unsigned int> >& lhs, const std::string& rhs) const
    { return lhs.first < rhs; }
};
typedef comp comp2;

    std::vector<unsigned int> match(const std::string & token){
        auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
        auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

        unsigned int num_vecs = std::distance(lower, upper);
        typedef std::vector<unsigned int>::const_iterator iter_type;
        std::vector<iter_type> curs;
        curs.reserve(num_vecs);
        std::vector<iter_type> ends;
        ends.reserve(num_vecs);
        std::vector<unsigned int> result;
        unsigned int result_count = 0;

        //keep track of current position and ends
        for(; lower != upper; ++lower){
            const std::vector<unsigned int> &other = lower->second;
            curs.push_back(other.cbegin());
            ends.push_back(other.cend());
            result_count += other.size();
        }
        result.reserve(result_count);
        //merge sort
        unsigned int last = UINT_MAX;
        if (result_count) {
            while(true) {
                //find which current position points to lowest number
                unsigned int least=0;
                for(unsigned int i=0; i< num_vecs; ++i ){
                    if (curs[i] != ends[i] && (curs[least]==ends[least] || *curs[i]<*curs[least]))
                        least = i;
                } 
                if (curs[least] == ends[least])
                    break;
                //push back lowest number and increase that vectors current position
                if( *curs[least] != last || result.size()==0) {
                    last = *curs[least];
                    result.push_back(last);
                            }
                ++curs[least];
            }
        }
        return result;
    }

    int main() {
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>(10, 10)));
        std::vector<unsigned int> t;
        t.push_back(1); t.push_back(2); t.push_back(11); t.push_back(12);
        vec_map.insert(vec_map_type::value_type("apple", t));
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>()));
        std::vector<unsigned int> res = match("apple");
        for(unsigned int i=0; i<res.size(); ++i)
            std::cout << res[i] << ' ';
        return 0;
    }

http://ideone.com/1rYTi

A custom merge sort may give a tiny amount of help.

#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <iostream>
#include <climits>

typedef std::multimap<std::string, std::vector<unsigned int> > vec_map_type;
vec_map_type vec_map;
struct comp {
    bool operator()(const std::string& lhs, const std::pair<std::string, std::vector<unsigned int> >& rhs) const
    { return lhs < rhs.first; }
    bool operator()(const std::pair<std::string, std::vector<unsigned int> >& lhs, const std::string& rhs) const
    { return lhs.first < rhs; }
};
typedef comp comp2;

    std::vector<unsigned int> match(const std::string & token){
        auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
        auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

        unsigned int num_vecs = std::distance(lower, upper);
        typedef std::vector<unsigned int>::const_iterator iter_type;
        std::vector<iter_type> curs;
        curs.reserve(num_vecs);
        std::vector<iter_type> ends;
        ends.reserve(num_vecs);
        std::vector<unsigned int> result;
        unsigned int result_count = 0;

        //keep track of current position and ends
        for(; lower != upper; ++lower){
            const std::vector<unsigned int> &other = lower->second;
            curs.push_back(other.cbegin());
            ends.push_back(other.cend());
            result_count += other.size();
        }
        result.reserve(result_count);
        //merge sort
        unsigned int last = UINT_MAX;
        if (result_count) {
            while(true) {
                //find which current position points to lowest number
                unsigned int least=0;
                for(unsigned int i=0; i< num_vecs; ++i ){
                    if (curs[i] != ends[i] && (curs[least]==ends[least] || *curs[i]<*curs[least]))
                        least = i;
                } 
                if (curs[least] == ends[least])
                    break;
                //push back lowest number and increase that vectors current position
                if( *curs[least] != last || result.size()==0) {
                    last = *curs[least];
                    result.push_back(last);
                            }
                ++curs[least];
            }
        }
        return result;
    }

    int main() {
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>(10, 10)));
        std::vector<unsigned int> t;
        t.push_back(1); t.push_back(2); t.push_back(11); t.push_back(12);
        vec_map.insert(vec_map_type::value_type("apple", t));
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>()));
        std::vector<unsigned int> res = match("apple");
        for(unsigned int i=0; i<res.size(); ++i)
            std::cout << res[i] << ' ';
        return 0;
    }

http://ideone.com/1rYTi

回复收藏 0 原文

江南月 2024-12-15 13:05:34

替代解决方案：

方法 std::sort （如果它基于快速排序）非常适合对未排序的向量进行排序O(logN)，对于排序的向量甚至更好，但是如果你的向量是倒置的排序的时间复杂度为 O(N^2)。
当您执行并集时，可能会出现很多操作数，其中第一个操作数包含比跟随者更大的值。

我会尝试以下操作（我想输入向量中的元素已经排序）：

按照其他人的建议，您应该在开始填充结果向量之前在结果向量上重新保留所需的大小。
如果 std::distance(lower, upper) == 1 则没有理由合并，只需复制单个操作数的内容。
对联合的操作数进行排序，可能是按大小（较大的优先），或者如果范围不重叠或仅按第一个值部分重叠），以便最大化下一个已排序的元素数量步。也许最好的策略是同时考虑联合中每个操作数的大小和范围。很大程度上取决于实际数据。
对
如果每个操作数都包含很多元素，请继续将元素附加到结果向量的后面，但是在附加每个向量（从第二个开始）之后，您可以尝试合并（std::inplace_merge）旧向量如果操作数很少，
如果操作数的数量很大（与元素总数相比），那么您应该保留之前的排序策略，但在排序后调用 std::unique 来消除重复。在这种情况下，您应该按包含的元素范围进行排序。

回复收藏 0 原文

荒路情人 2024-12-15 13:05:34

如果元素的数量是可能的int范围的相对较大的百分比，您可能会从本质上是简化的“散列连接”（使用数据库术语）。

（如果与可能值的范围相比，整数的数量相对较少，这可能不是最好的方法。）

本质上，我们制作一个巨大的位图，然后仅在与该位图对应的索引上设置标志输入 int 并最终根据这些标志重建结果：

#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>

template <typename ForwardIterator>
std::vector<int> IntSetUnion(
    ForwardIterator begin1,
    ForwardIterator end1,
    ForwardIterator begin2,
    ForwardIterator end2
) {

    int min = std::numeric_limits<int>::max();
    int max = std::numeric_limits<int>::min();

    for (auto i = begin1; i != end1; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    for (auto i = begin2; i != end2; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    if (min < std::numeric_limits<int>::max() && max > std::numeric_limits<int>::min()) {

        std::vector<int>::size_type result_size = 0;
        std::vector<bool> bitmap(max - min + 1, false);

        for (auto i = begin1; i != end1; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        for (auto i = begin2; i != end2; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        std::vector<int> result;
        result.reserve(result_size);
        for (std::vector<bool>::size_type index = 0; index != bitmap.size(); ++index)
            if (bitmap[index])
                result.push_back(index + min);

        return result;

    }

    return std::vector<int>();

}

void main() {

    // Basic sanity test...
    {

        std::vector<int> v1;
        v1.push_back(2);
        v1.push_back(2000);
        v1.push_back(229013);
        v1.push_back(-2243);
        v1.push_back(-530);

        std::vector<int> v2;
        v1.push_back(2);
        v2.push_back(243);
        v2.push_back(90120);
        v2.push_back(329013);
        v2.push_back(-530);

        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());

        for (auto i = result.begin(); i != result.end(); ++i)
            std::cout << *i << std::endl;

    }

    // Benchmark...
    {

        const auto count = 10000000;

        std::vector<int> v1(count);
        std::vector<int> v2(count);

        for (int i = 0; i != count; ++i) {
            v1[i] = i;
            v2[i] = i - count / 2;
        }

        std::random_shuffle(v1.begin(), v1.end());
        std::random_shuffle(v2.begin(), v2.end());

        auto start_time = clock();
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        auto end_time = clock();
        std::cout << "Time: " << (((double)end_time - start_time) / CLOCKS_PER_SEC) << std::endl;
        std::cout << "Union element count: " << result.size() << std::endl;

    }

}

这

Time: 0.402

在我的机器上打印... ...。

如果您想从 std::vector 以外的其他地方获取输入 int，您可以实现自己的迭代器并将其传递给 IntSetUnion。

If the number of elements is relatively large percentage of range of possible ints, you might get a decent performance out of what is essentially a simplified "hash join" (to use DB terminology).

(If there is a relatively small number of integers compared to range of possible values, this is probably not the best approach.)

Essentially, we make a giant bitmap, then set flags only on indexes corresponding to the input ints and finally reconstruct the result based on these flags:

#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>

template <typename ForwardIterator>
std::vector<int> IntSetUnion(
    ForwardIterator begin1,
    ForwardIterator end1,
    ForwardIterator begin2,
    ForwardIterator end2
) {

    int min = std::numeric_limits<int>::max();
    int max = std::numeric_limits<int>::min();

    for (auto i = begin1; i != end1; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    for (auto i = begin2; i != end2; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    if (min < std::numeric_limits<int>::max() && max > std::numeric_limits<int>::min()) {

        std::vector<int>::size_type result_size = 0;
        std::vector<bool> bitmap(max - min + 1, false);

        for (auto i = begin1; i != end1; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        for (auto i = begin2; i != end2; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        std::vector<int> result;
        result.reserve(result_size);
        for (std::vector<bool>::size_type index = 0; index != bitmap.size(); ++index)
            if (bitmap[index])
                result.push_back(index + min);

        return result;

    }

    return std::vector<int>();

}

void main() {

    // Basic sanity test...
    {

        std::vector<int> v1;
        v1.push_back(2);
        v1.push_back(2000);
        v1.push_back(229013);
        v1.push_back(-2243);
        v1.push_back(-530);

        std::vector<int> v2;
        v1.push_back(2);
        v2.push_back(243);
        v2.push_back(90120);
        v2.push_back(329013);
        v2.push_back(-530);

        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());

        for (auto i = result.begin(); i != result.end(); ++i)
            std::cout << *i << std::endl;

    }

    // Benchmark...
    {

        const auto count = 10000000;

        std::vector<int> v1(count);
        std::vector<int> v2(count);

        for (int i = 0; i != count; ++i) {
            v1[i] = i;
            v2[i] = i - count / 2;
        }

        std::random_shuffle(v1.begin(), v1.end());
        std::random_shuffle(v2.begin(), v2.end());

        auto start_time = clock();
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        auto end_time = clock();
        std::cout << "Time: " << (((double)end_time - start_time) / CLOCKS_PER_SEC) << std::endl;
        std::cout << "Union element count: " << result.size() << std::endl;

    }

}

This prints...

Time: 0.402

...on my machine.

If you want to get your input ints from something else other than std::vector<int>, you can implement your own iterator and pass it to IntSetUnion.

回复收藏 0 原文