如何在 Thrust 中减少一维二维数据

发布于 2025-01-10 09:26:29 字数 1453 浏览 4 评论 0原文

我是 CUDA 和推力库的新手。我正在学习并尝试实现一个函数，该函数将有一个 for 循环执行推力函数。有没有办法将这个循环转换为另一个推力函数？或者我应该使用 CUDA 内核来实现这一点？

我想出了这样的代码

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

我的目标是找到具有最多大于 2 的元素的行。我正在努力解决如何在没有循环的情况下做到这一点。任何建议将不胜感激。谢谢。

原文

I'm new to CUDA and the thrust library. I'm learning and trying to implement a function that will have a for loop doing a thrust function. Is there a way to convert this loop into another thrust function? Or should I use a CUDA kernel to achieve this?

I have come up with code like this

// thrust functor
struct GreaterthanX
{
    const float _x;
    GreaterthanX(float x) : _x(x) {}

    __host__ __device__ bool operator()(const float &a) const
    {
        return a > _x;
    }
};

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int row = 3;
    int col = 4;
    thrust::device_vector<int> vec(row * col);
    thrust::device_vector<int> count(row);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec
    for (int i = 0; i < row; i++)
    {
        count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
    }

    thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
    int max_val = *result;
    unsigned int position = result - count.begin();

    printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

My goal is to find the row that has the most elements greater than 2. I'm struggling at how to do this without a loop. Any suggestions would be very appreciated. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

め七分饶幸 2025-01-17 09:26:30

使用 Thrust 的解决方案

这是使用 thrust::reduce_by_key 与多个“奇特迭代器”结合使用的实现。

为了优雅和可读性，我还自由地添加了一些 const、auto 和 lambda。由于 lambda，您需要对 nvcc 使用 -extended-lambda 标志。

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __host__ __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

使用 MatX 的额外解决方案

正如评论中提到的，NVIDIA 发布了一个新的高级 C++17 库，名为 MatX< /a> 其目标涉及（密集）多维数据（即张量）的问题。该库试图将 CUFFT、CUSOLVER 和 CUTLASS 等多个低级库统一到一个类似 python/matlab 的界面中。在撰写本文时（v0.2.2），该库仍处于初始开发阶段，因此可能无法保证稳定的 API。因此，性能不如更成熟的 Thrust 库那样优化，并且文档/示例也不是很详尽，MatX 还不应该在生产代码中使用。在构建这个解决方案时，我实际上偶然发现了一个 bug ，该错误立即得到修复。因此，此代码仅适用于主分支，不适用于当前版本 v0.2.2，并且某些使用的功能可能不会出现在文档尚未。

使用 MatX 的解决方案如下所示：

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

由于 MatX 采用延迟执行运算符，matx::as_int(tensor > 2) 有效地融合到内核中，实现与使用 thrust 相同的效果： Thrust 中的 :transform_iterator。

由于 MatX 了解问题的规律性，而 Thrust 不了解，因此 MatX 解决方案可能比 Thrust 解决方案性能更高。它当然更加优雅。也可以在已经分配的内存中构造张量，因此可以混合这些库，例如我通过传递在名为 vec 的 thrust::vector 的内存中构造一个张量thrust::raw_pointer_cast(vec.data()) 到张量的构造函数。

Solution using Thrust

Here is an implementation using thrust::reduce_by_key in conjunction with multiple "fancy iterators".

I also took the freedom to sprinkle in some const, auto and lambdas for elegance and readability. Due to the lambdas, you will need to use the -extended-lambda flag for nvcc.

#include <cassert>
#include <cstdio>

#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>

int main(void)
{
    // fill a device_vector with
    // 3 2 4 5
    // 0 -2 3 1
    // 9 8 7 6
    int const row = 3;
    int const col = 4;
    thrust::device_vector<int> vec(row * col);
    vec[0] = 3;
    vec[1] = 2;
    vec[2] = 4;
    vec[3] = 5;
    vec[4] = 0;
    vec[5] = -2;
    vec[6] = 3;
    vec[7] = 1;
    vec[8] = 9;
    vec[9] = 8;
    vec[10] = 7;
    vec[11] = 6;
    thrust::device_vector<int> count(row);

    // Goal: For each row, count the number of elements greater than 2. 
    // And then find the row with the max count

    // count the element greater than 2 in vec

    // counting iterator avoids read from global memory, gives index into vec
    auto keys_in_begin = thrust::make_counting_iterator(0);
    auto keys_in_end = thrust::make_counting_iterator(row * col);
    
    // transform vec on the fly
    auto vals_in_begin = thrust::make_transform_iterator(
        vec.cbegin(), 
        [] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
    
    // discard to avoid write to global memory
    auto keys_out_begin = thrust::make_discard_iterator();
    
    auto vals_out_begin = count.begin();
    
    // transform keys (indices) into row indices and then compare
    // the divisions are one reason one might rather
    // use MatX for higher dimensional data
    auto binary_predicate = [col] __host__ __device__ (int i, int j){
        return i / col == j / col;
    };
    
    // this function returns a new end for count 
    // b/c the final number of elements is often not known beforehand
    auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
                                         vals_in_begin,
                                         keys_out_begin,
                                         vals_out_begin,
                                         binary_predicate);
    // make sure that we didn't provide too small of an output vector
    assert(thrust::get<1>(new_ends) == count.end());

    auto const result = thrust::max_element(count.begin(), count.end());
    int const max_val = *result;
    auto const position = thrust::distance(count.begin(), result);

    std::printf("result = %d at position %d\r\n", max_val, position);
    // result = 4 at position 2

    return 0;
}

Bonus solution using MatX

As mentioned in the comments NVIDIA has released a new high-level, C++17 library called MatX which targets problems involving (dense) multi-dimensional data (i.e. tensors). The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface. At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet. While constructing this solution I actually stumbled upon a bug which was instantly fixed. So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation yet.

A solution using MatX looks the following way:

#include <iostream>
#include <matx.h>

int main(void)
{
    int const row = 3;
    int const col = 4;
    auto tensor = matx::make_tensor<int, 2>({row, col});
    tensor.SetVals({{3, 2, 4, 5},
                    {0, -2, 3, 1},
                    {9, 8, 7, 6}});
    // tensor.Print(0,0); // print full tensor

    auto count = matx::make_tensor<int, 1>({row});
    // count.Print(0); // print full count

    // Goal: For each row, count the number of elements greater than 2.
    // And then find the row with the max count

    // the kind of reduction is determined through the shapes of tensor and count
    matx::sum(count, matx::as_int(tensor > 2));

    // A single value (scalar) is a tensor of rank 0: 
    auto result_idx = matx::make_tensor<matx::index_t>();
    auto result = matx::make_tensor<int>();
    matx::argmax(result, result_idx, count);

    cudaDeviceSynchronize();
    std::cout << "result = " << result() 
              << " at position " << result_idx() << "\r\n";
    // result = 4 at position 2

    return 0;
}

As MatX employs deferred execution operators, matx::as_int(tensor > 2) is effectively fused into the kernel achieving the same as using a thrust::transform_iterator in Thrust.

Due to MatX knowing about the regularity of the problem while Thrust does not, the MatX solution could potentially be more performant than the Thrust solution. It certainly is more elegant. It is also possible to construct tensors in already allocated memory, so one can mix the libraries e.g. my constructing a tensor in the memory of a thrust::vector named vec via passing thrust::raw_pointer_cast(vec.data()) to the constructor of the tensor.

回复收藏 0 原文

~没有更多了~