整数的快速集合并集

发布于 2024-12-08 13:05:34 字数 1319 浏览 3 评论 0原文

我需要创建大量有序整数集的并集(我想避免重复,但如果有的话也没关系)。

这是迄今为止性能最佳的代码:

// some code added for better understanding
std::vector< std::pair<std::string, std::vector<unsigned int> > vec_map;
vec_map.push_back(std::make_pair("hi", std::vector<unsigned int>({1, 12, 1450});
vec_map.push_back(std::make_pair("stackoverflow", std::vector<unsigned int>({42, 1200, 14500});

std::vector<unsigned int> match(const std::string & token){
    auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
    auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

    std::vector<unsigned int> result;

    for(; lower != upper; ++lower){
        std::vector<unsigned int> other = lower->second;
        result.insert(result.end(), other.begin(), other.end());
    }
    std::sort(result.begin(), result.end()); // This function eats 99% of my running time

    return result;
}

valgrind(使用工具 callgrind)告诉我,我花了 99% 的时间进行排序。

这是我到目前为止尝试过的:

  • 使用 std::set (性能非常差)
  • 使用 std::set_union (性能差) 使用
  • std::push_heap 维护堆(慢 50%)

是否有希望以某种方式获得一些性能?我可以更改我的容器并使用 boost,也许还有其他一些库(取决于其许可证)。

编辑整数可以大到 10 000 000 编辑2给出了一些我如何使用它的例子,因为有些混乱

I need to make lots of unions of ordered set of integers (I would like to avoid duplicates, but it is okay if there are).

This is the code with the best performance so far :

// some code added for better understanding
std::vector< std::pair<std::string, std::vector<unsigned int> > vec_map;
vec_map.push_back(std::make_pair("hi", std::vector<unsigned int>({1, 12, 1450});
vec_map.push_back(std::make_pair("stackoverflow", std::vector<unsigned int>({42, 1200, 14500});

std::vector<unsigned int> match(const std::string & token){
    auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
    auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

    std::vector<unsigned int> result;

    for(; lower != upper; ++lower){
        std::vector<unsigned int> other = lower->second;
        result.insert(result.end(), other.begin(), other.end());
    }
    std::sort(result.begin(), result.end()); // This function eats 99% of my running time

    return result;
}

valgrind (using the tool callgrind) tells me that I spend 99% of the time doing the sort.

This is what I tried so far :

  • Using std::set (very bad performance)
  • Using std::set_union (bad performance)
  • maintaining a heap with std::push_heap (50% slower)

Is there any hope to gain somehow some performance? I can change my containers and use boost, and maybe some other lib (depending on its licence).

EDIT integers can be as big a 10 000 000
EDIT 2 gave some example of how I use it, because of some confusion

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

多情出卖 2024-12-15 13:05:34

这看起来像一个多路合并的实例。根据输入(配置文件和时间!),最好的算法可能是您拥有的算法,或者通过从所有容器中选择最小整数或更复杂的算法来逐步构建结果的算法。

This looks like an instance of multi-way merge. Depending on the input (profile and time!), the best algorithm might be what you have or something that builds the result incrementally by selecting the smallest integer from all the containers or something more complex.

说谎友 2024-12-15 13:05:34

自定义合并排序可能提供一点帮助。

#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <iostream>
#include <climits>

typedef std::multimap<std::string, std::vector<unsigned int> > vec_map_type;
vec_map_type vec_map;
struct comp {
    bool operator()(const std::string& lhs, const std::pair<std::string, std::vector<unsigned int> >& rhs) const
    { return lhs < rhs.first; }
    bool operator()(const std::pair<std::string, std::vector<unsigned int> >& lhs, const std::string& rhs) const
    { return lhs.first < rhs; }
};
typedef comp comp2;

    std::vector<unsigned int> match(const std::string & token){
        auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
        auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

        unsigned int num_vecs = std::distance(lower, upper);
        typedef std::vector<unsigned int>::const_iterator iter_type;
        std::vector<iter_type> curs;
        curs.reserve(num_vecs);
        std::vector<iter_type> ends;
        ends.reserve(num_vecs);
        std::vector<unsigned int> result;
        unsigned int result_count = 0;

        //keep track of current position and ends
        for(; lower != upper; ++lower){
            const std::vector<unsigned int> &other = lower->second;
            curs.push_back(other.cbegin());
            ends.push_back(other.cend());
            result_count += other.size();
        }
        result.reserve(result_count);
        //merge sort
        unsigned int last = UINT_MAX;
        if (result_count) {
            while(true) {
                //find which current position points to lowest number
                unsigned int least=0;
                for(unsigned int i=0; i< num_vecs; ++i ){
                    if (curs[i] != ends[i] && (curs[least]==ends[least] || *curs[i]<*curs[least]))
                        least = i;
                } 
                if (curs[least] == ends[least])
                    break;
                //push back lowest number and increase that vectors current position
                if( *curs[least] != last || result.size()==0) {
                    last = *curs[least];
                    result.push_back(last);
                            }
                ++curs[least];
            }
        }
        return result;
    }

    int main() {
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>(10, 10)));
        std::vector<unsigned int> t;
        t.push_back(1); t.push_back(2); t.push_back(11); t.push_back(12);
        vec_map.insert(vec_map_type::value_type("apple", t));
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>()));
        std::vector<unsigned int> res = match("apple");
        for(unsigned int i=0; i<res.size(); ++i)
            std::cout << res[i] << ' ';
        return 0;
    }

http://ideone.com/1rYTi

A custom merge sort may give a tiny amount of help.

#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <iostream>
#include <climits>

typedef std::multimap<std::string, std::vector<unsigned int> > vec_map_type;
vec_map_type vec_map;
struct comp {
    bool operator()(const std::string& lhs, const std::pair<std::string, std::vector<unsigned int> >& rhs) const
    { return lhs < rhs.first; }
    bool operator()(const std::pair<std::string, std::vector<unsigned int> >& lhs, const std::string& rhs) const
    { return lhs.first < rhs; }
};
typedef comp comp2;

    std::vector<unsigned int> match(const std::string & token){
        auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
        auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());

        unsigned int num_vecs = std::distance(lower, upper);
        typedef std::vector<unsigned int>::const_iterator iter_type;
        std::vector<iter_type> curs;
        curs.reserve(num_vecs);
        std::vector<iter_type> ends;
        ends.reserve(num_vecs);
        std::vector<unsigned int> result;
        unsigned int result_count = 0;

        //keep track of current position and ends
        for(; lower != upper; ++lower){
            const std::vector<unsigned int> &other = lower->second;
            curs.push_back(other.cbegin());
            ends.push_back(other.cend());
            result_count += other.size();
        }
        result.reserve(result_count);
        //merge sort
        unsigned int last = UINT_MAX;
        if (result_count) {
            while(true) {
                //find which current position points to lowest number
                unsigned int least=0;
                for(unsigned int i=0; i< num_vecs; ++i ){
                    if (curs[i] != ends[i] && (curs[least]==ends[least] || *curs[i]<*curs[least]))
                        least = i;
                } 
                if (curs[least] == ends[least])
                    break;
                //push back lowest number and increase that vectors current position
                if( *curs[least] != last || result.size()==0) {
                    last = *curs[least];
                    result.push_back(last);
                            }
                ++curs[least];
            }
        }
        return result;
    }

    int main() {
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>(10, 10)));
        std::vector<unsigned int> t;
        t.push_back(1); t.push_back(2); t.push_back(11); t.push_back(12);
        vec_map.insert(vec_map_type::value_type("apple", t));
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>()));
        std::vector<unsigned int> res = match("apple");
        for(unsigned int i=0; i<res.size(); ++i)
            std::cout << res[i] << ' ';
        return 0;
    }

http://ideone.com/1rYTi

江南月 2024-12-15 13:05:34

替代解决方案

方法 std::sort (如果它基于快速排序)非常适合对未排序的向量进行排序O(logN),对于排序的向量甚至更好,但是如果你的向量是倒置的排序的时间复杂度为 O(N^2)。
当您执行并集时,可能会出现很多操作数,其中第一个操作数包含比跟随者更大的值。

我会尝试以下操作(我想输入向量中的元素已经排序):

  • 按照其他人的建议,您应该在开始填充结果向量之前在结果向量上重新保留所需的大小。

  • 如果 std::distance(lower, upper) == 1 则没有理由合并,只需复制单个操作数的内容。

  • 对联合的操作数进行排序,可能是按大小(较大的优先),或者如果范围不重叠或仅按第一个值部分重叠),以便最大化下一个已排序的元素数量步。也许最好的策略是同时考虑联合中每个操作数的大小和范围。很大程度上取决于实际数据。

  • 如果每个操作数都包含很多元素,请继续将元素附加到结果向量的后面,但是在附加每个向量(从第二个开始)之后,您可以尝试合并(std::inplace_merge)旧向量 如果操作数很少,

  • 如果操作数的数量很大(与元素总数相比),那么您应该保留之前的排序策略,但在排序后调用 std::unique 来消除重复。在这种情况下,您应该按包含的元素范围进行排序。

ALTERNATIVE SOLUTION:

Method std::sort (if it is based on quick-sort) is very good sorting non-sorted vectorsO(logN), is even better with sorted vector, but if your vector is inverted sorted have O(N^2).
It may happen that when you perform the union you have a lot of operands, with the first one containing bigger values than the followers.

I would try the following (I suppose elements from input vectors are already sorted):

  • As suggested by other people here you should reseve the required size on the results vector before start filling it.

  • In case std::distance(lower, upper) == 1 there is no reason to union, just copy the contento of the single operand.

  • Sort the operands of your union, maybe by size (bigger first), or if the ranges don't overlap or partially overlap just by first value), so that you maximize the number of elements that are already sorted on the next step. Probably the best is a strategy that keep in consideration both things SIZE and RANGE of every operand of your union. A lot depends on the real data.

  • If there are few operands with a lot of elements each, keep appending the elements on the back of the result vector, but after appending each vector (from the second) you can try to merge (std::inplace_merge) the old content with the appendend one, this also deduplicates elements for you.

  • If the number of operands is big (comparing to the total number of elements) then you should stay with your former strategy of sorting afterward but call std::unique to deduplicate after sort. In this case you should sort by the range of elements contained.

荒路情人 2024-12-15 13:05:34

如果元素的数量是可能的int范围的相对较大的百分比,您可能会从本质上是简化的“散列连接”(使用数据库术语)。

(如果与可能值的范围相比,整数的数量相对较少,这可能不是最好的方法。)

本质上,我们制作一个巨大的位图,然后仅在与该位图对应的索引上设置标志输入 int 并最终根据这些标志重建结果:

#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>

template <typename ForwardIterator>
std::vector<int> IntSetUnion(
    ForwardIterator begin1,
    ForwardIterator end1,
    ForwardIterator begin2,
    ForwardIterator end2
) {

    int min = std::numeric_limits<int>::max();
    int max = std::numeric_limits<int>::min();

    for (auto i = begin1; i != end1; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    for (auto i = begin2; i != end2; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    if (min < std::numeric_limits<int>::max() && max > std::numeric_limits<int>::min()) {

        std::vector<int>::size_type result_size = 0;
        std::vector<bool> bitmap(max - min + 1, false);

        for (auto i = begin1; i != end1; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        for (auto i = begin2; i != end2; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        std::vector<int> result;
        result.reserve(result_size);
        for (std::vector<bool>::size_type index = 0; index != bitmap.size(); ++index)
            if (bitmap[index])
                result.push_back(index + min);

        return result;

    }

    return std::vector<int>();

}

void main() {

    // Basic sanity test...
    {

        std::vector<int> v1;
        v1.push_back(2);
        v1.push_back(2000);
        v1.push_back(229013);
        v1.push_back(-2243);
        v1.push_back(-530);

        std::vector<int> v2;
        v1.push_back(2);
        v2.push_back(243);
        v2.push_back(90120);
        v2.push_back(329013);
        v2.push_back(-530);

        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());

        for (auto i = result.begin(); i != result.end(); ++i)
            std::cout << *i << std::endl;

    }

    // Benchmark...
    {

        const auto count = 10000000;

        std::vector<int> v1(count);
        std::vector<int> v2(count);

        for (int i = 0; i != count; ++i) {
            v1[i] = i;
            v2[i] = i - count / 2;
        }

        std::random_shuffle(v1.begin(), v1.end());
        std::random_shuffle(v2.begin(), v2.end());

        auto start_time = clock();
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        auto end_time = clock();
        std::cout << "Time: " << (((double)end_time - start_time) / CLOCKS_PER_SEC) << std::endl;
        std::cout << "Union element count: " << result.size() << std::endl;

    }

}

Time: 0.402

在我的机器上打印... ...。

如果您想从 std::vector 以外的其他地方获取输入 int,您可以实现自己的迭代器并将其传递给 IntSetUnion

If the number of elements is relatively large percentage of range of possible ints, you might get a decent performance out of what is essentially a simplified "hash join" (to use DB terminology).

(If there is a relatively small number of integers compared to range of possible values, this is probably not the best approach.)

Essentially, we make a giant bitmap, then set flags only on indexes corresponding to the input ints and finally reconstruct the result based on these flags:

#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>

template <typename ForwardIterator>
std::vector<int> IntSetUnion(
    ForwardIterator begin1,
    ForwardIterator end1,
    ForwardIterator begin2,
    ForwardIterator end2
) {

    int min = std::numeric_limits<int>::max();
    int max = std::numeric_limits<int>::min();

    for (auto i = begin1; i != end1; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    for (auto i = begin2; i != end2; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }

    if (min < std::numeric_limits<int>::max() && max > std::numeric_limits<int>::min()) {

        std::vector<int>::size_type result_size = 0;
        std::vector<bool> bitmap(max - min + 1, false);

        for (auto i = begin1; i != end1; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        for (auto i = begin2; i != end2; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }

        std::vector<int> result;
        result.reserve(result_size);
        for (std::vector<bool>::size_type index = 0; index != bitmap.size(); ++index)
            if (bitmap[index])
                result.push_back(index + min);

        return result;

    }

    return std::vector<int>();

}

void main() {

    // Basic sanity test...
    {

        std::vector<int> v1;
        v1.push_back(2);
        v1.push_back(2000);
        v1.push_back(229013);
        v1.push_back(-2243);
        v1.push_back(-530);

        std::vector<int> v2;
        v1.push_back(2);
        v2.push_back(243);
        v2.push_back(90120);
        v2.push_back(329013);
        v2.push_back(-530);

        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());

        for (auto i = result.begin(); i != result.end(); ++i)
            std::cout << *i << std::endl;

    }

    // Benchmark...
    {

        const auto count = 10000000;

        std::vector<int> v1(count);
        std::vector<int> v2(count);

        for (int i = 0; i != count; ++i) {
            v1[i] = i;
            v2[i] = i - count / 2;
        }

        std::random_shuffle(v1.begin(), v1.end());
        std::random_shuffle(v2.begin(), v2.end());

        auto start_time = clock();
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        auto end_time = clock();
        std::cout << "Time: " << (((double)end_time - start_time) / CLOCKS_PER_SEC) << std::endl;
        std::cout << "Union element count: " << result.size() << std::endl;

    }

}

This prints...

Time: 0.402

...on my machine.

If you want to get your input ints from something else other than std::vector<int>, you can implement your own iterator and pass it to IntSetUnion.

橘味果▽酱 2024-12-15 13:05:34

您正在对有限范围内的整数进行排序,这是基数排序的罕见情况之一可以使用。不幸的是,知道这是否优于广义排序的唯一方法就是尝试一下。

You're sorting integers with a limited range, this is one of those rare circumstances where a radix sort could be used. Unfortunately the only way to know if this beats the generalized sort is to try it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文