当前位置：文江博客话题详情

推入用户编写的内核

发布于 2024-10-28 03:43:22 字数 122 浏览 5 评论 0原文

我是 Thrust 的新手。我看到所有 Thrust 演示文稿和示例仅显示主机代码。

我想知道我是否可以将 device_vector 传递给我自己的内核？如何？如果是，那么内核/设备代码中允许对其进行哪些操作？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

-柠檬树下少年和吉他 2024-11-04 03:43:22

正如最初编写的那样，Thrust 纯粹是主机端抽象。它不能在内核内部使用。您可以将封装在 thrust::device_vector 中的设备内存传递给您自己的内核，如下所示：

thrust::device_vector< Foo > fooVector;
// Do something thrust-y with fooVector

Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() );

// Pass raw array and its size to kernel
someKernelCall<<< x, y >>>( fooArray, fooVector.size() );

您还可以通过使用推力算法实例化推力::device_ptr来使用未由推力分配的设备内存裸cuda设备内存指针。

四年半后编辑补充说，根据 @JackOLantern 的回答，thrust 1.8 添加了顺序执行策略，这意味着您可以在设备上运行 Thrust 算法的单线程版本。请注意，仍然无法将推力设备向量直接传递到内核，并且设备向量不能直接在设备代码中使用。

请注意，在某些情况下，还可以使用 Thrust::device 执行策略来让内核作为子网格启动并行推力执行。这需要单独的编译/设备链接和支持动态并行性的硬件。我不确定所有推力算法是否实际上都支持这一点，但肯定适用于某些算法。

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this:

thrust::device_vector< Foo > fooVector;
// Do something thrust-y with fooVector

Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() );

// Pass raw array and its size to kernel
someKernelCall<<< x, y >>>( fooArray, fooVector.size() );

and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.

Edited four and half years later to add that as per @JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.

Note that it is also possible to use the thrust::device execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.

回复收藏 0 原文

揽清风入怀 2024-11-04 03:43:22

编辑：Thrust 中的动态并行性已弃用，推力 1.15.0。请参阅在设备代码中使用 thrust::device 执行策略应该无法编译的推理和替代方案。

这是我之前答案的更新。

从 Thrust 1.8.1 开始，CUDA Thrust 原语可以与 thrust::device 执行策略结合使用，利用 CUDA 动态并行性在单个 CUDA 线程中并行运行。下面报道一个例子。

#include <stdio.h>

#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

#define BLOCKSIZE_1D    256
#define BLOCKSIZE_2D_X  32
#define BLOCKSIZE_2D_Y  32

/*************************/
/* TEST KERNEL FUNCTIONS */
/*************************/
__global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {

    const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);

}

__global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {

    const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);

}

/********/
/* MAIN */
/********/
int main() {

    const int Nrows = 64;
    const int Ncols = 2048;

    gpuErrchk(cudaFree(0));

//    size_t DevQueue;
//    gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount));
//    DevQueue *= 128;
//    gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue));

    float *h_data       = (float *)malloc(Nrows * Ncols * sizeof(float));
    float *h_results    = (float *)malloc(Nrows *         sizeof(float));
    float *h_results1   = (float *)malloc(Nrows *         sizeof(float));
    float *h_results2   = (float *)malloc(Nrows *         sizeof(float));
    float sum = 0.f;
    for (int i=0; i<Nrows; i++) {
        h_results[i] = 0.f;
        for (int j=0; j<Ncols; j++) {
            h_data[i*Ncols+j] = i;
            h_results[i] = h_results[i] + h_data[i*Ncols+j];
        }
    }

    TimingGPU timerGPU;

    float *d_data;          gpuErrchk(cudaMalloc((void**)&d_data,     Nrows * Ncols * sizeof(float)));
    float *d_results1;      gpuErrchk(cudaMalloc((void**)&d_results1, Nrows         * sizeof(float)));
    float *d_results2;      gpuErrchk(cudaMalloc((void**)&d_results2, Nrows         * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice));

    timerGPU.StartCounter();
    test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter());

    gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<Nrows; i++) {
        if (h_results1[i] != h_results[i]) {
            printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
            return 0;
        }
    }

    timerGPU.StartCounter();
    test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter());

    gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<Nrows; i++) {
        if (h_results1[i] != h_results[i]) {
            printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
            return 0;
        }
    }

    printf("Test passed!\n");

}

上面的示例以与使用 CUDA 减少矩阵行<相同的方式执行矩阵行的减少/a>，但它的完成方式与上面的帖子不同，即直接从用户编写的内核调用 CUDA Thrust 原语。此外，上面的示例还用于比较使用两种执行策略（即 thrust::seq 和 thrust::device）完成相同操作的性能。下面的一些图表显示了性能差异。

性能已在 Kepler K20c 和 Maxwell GeForce GTX 850M 上进行了评估。

Edit: Dynamic parallelism in Thrust was deprecated with Thrust 1.15.0. See Using thrust::device execution policy in device code should fail to compile for the reasoning and alternatives.

This is an update to my previous answer.

Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism. Below, an example is reported.

#include <stdio.h>

#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

#define BLOCKSIZE_1D    256
#define BLOCKSIZE_2D_X  32
#define BLOCKSIZE_2D_Y  32

/*************************/
/* TEST KERNEL FUNCTIONS */
/*************************/
__global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {

    const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);

}

__global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) {

    const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols);

}

/********/
/* MAIN */
/********/
int main() {

    const int Nrows = 64;
    const int Ncols = 2048;

    gpuErrchk(cudaFree(0));

//    size_t DevQueue;
//    gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount));
//    DevQueue *= 128;
//    gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue));

    float *h_data       = (float *)malloc(Nrows * Ncols * sizeof(float));
    float *h_results    = (float *)malloc(Nrows *         sizeof(float));
    float *h_results1   = (float *)malloc(Nrows *         sizeof(float));
    float *h_results2   = (float *)malloc(Nrows *         sizeof(float));
    float sum = 0.f;
    for (int i=0; i<Nrows; i++) {
        h_results[i] = 0.f;
        for (int j=0; j<Ncols; j++) {
            h_data[i*Ncols+j] = i;
            h_results[i] = h_results[i] + h_data[i*Ncols+j];
        }
    }

    TimingGPU timerGPU;

    float *d_data;          gpuErrchk(cudaMalloc((void**)&d_data,     Nrows * Ncols * sizeof(float)));
    float *d_results1;      gpuErrchk(cudaMalloc((void**)&d_results1, Nrows         * sizeof(float)));
    float *d_results2;      gpuErrchk(cudaMalloc((void**)&d_results2, Nrows         * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice));

    timerGPU.StartCounter();
    test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter());

    gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<Nrows; i++) {
        if (h_results1[i] != h_results[i]) {
            printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
            return 0;
        }
    }

    timerGPU.StartCounter();
    test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter());

    gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<Nrows; i++) {
        if (h_results1[i] != h_results[i]) {
            printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]);
            return 0;
        }
    }

    printf("Test passed!\n");

}

The above example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seq and thrust::device. Below, some graphs showing the difference in performance.

The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.

回复收藏 0 原文

墨离汐 2024-11-04 03:43:22

我想提供这个问题的更新答案。

从 Thrust 1.8 开始，CUDA Thrust 原语可以与 thrust::seq 执行策略结合使用，以便在单个 CUDA 线程中顺序运行（或在单个 CPU 线程中顺序运行）。下面报道一个例子。

如果您希望在线程内并行执行，那么您可以考虑使用 CUB 它提供了可以减少的例程从线程块内调用，前提是您的卡启用动态并行性。

这是 Thrust 的示例

#include <stdio.h>

#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ void test(float *d_A, int N) {

    float sum = thrust::reduce(thrust::seq, d_A, d_A + N);

    printf("Device side result = %f\n", sum);

}

int main() {

    const int N = 16;

    float *h_A = (float*)malloc(N * sizeof(float));
    float sum = 0.f;
    for (int i=0; i<N; i++) {
        h_A[i] = i;
        sum = sum + h_A[i];
    }
    printf("Host side result = %f\n", sum);

    float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice));

    test<<<1,1>>>(d_A, N);

}

I would like to provide an updated answer to this question.

Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.

If you want parallel execution within a thread, then you may consider using CUB which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.

Here is the example with Thrust

#include <stdio.h>

#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ void test(float *d_A, int N) {

    float sum = thrust::reduce(thrust::seq, d_A, d_A + N);

    printf("Device side result = %f\n", sum);

}

int main() {

    const int N = 16;

    float *h_A = (float*)malloc(N * sizeof(float));
    float sum = 0.f;
    for (int i=0; i<N; i++) {
        h_A[i] = i;
        sum = sum + h_A[i];
    }
    printf("Host side result = %f\n", sum);

    float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice));

    test<<<1,1>>>(d_A, N);

}

回复收藏 0 原文

我为君王 2024-11-04 03:43:22

如果您的意思是使用推力分配/处理的数据，是的，您可以，只需获取分配数据的原始指针即可。

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

如果你想在内核中分配推力矢量，我从未尝试过，但我认为不会起作用
而且，如果它有效，我认为它不会带来任何好处。

If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

if you want to allocate thrust vectors in the kernel I never tried but I don't think will work
and also if it works I don't think it will provide any benefit.

回复收藏 0 原文

始终不够爱げ你 2024-11-04 03:43:22

如今，Thrust 作为 CCCL（CUDA C++ 核心库）的一部分，其中还包括 libcu++ 及其非-拥有cuda::std::span。遗憾的是，Thrust 矢量和 libcu++ span 之间的接口仍然需要那个丑陋的 thrust::raw_pointer_cast。但是，将带有指针和大小的 cuda::std::span 从 Thrust 向量传递到自定义内核或设备函子是我们可能最接近传递实际向量的方法，因为 span 具有所有优点我们习惯的成员函数如.begin()、.end()、.size()。等等，并且 cuda:: 版本也将它们标记为 __host__ __device__，因此它们可以在主机和设备代码中使用。与通过 thrust::device_ptr 处理 thrust::device_vector 不同，主机代码不处理对设备内存的访问。

#include <thrust/device_vector.h>

#include <cuda/std/span>

int main() {
    thrust::device_vector<float> my_vec(10);
    cuda::std::span<float> my_span{thrust::raw_pointer_cast(my_vec.data()),
                                   my_vec.size()};
    // ... (use my_span in device code)
}

我希望 CCCL 能够更新为允许使用推力向量直接初始化 cuda::std::span，即 cuda::std::span。 my_span{my_vec};。虽然 span 已经有一个接受范围的构造函数（即传递一个 std::vector 应该可以工作），但它仍然与 Thrust 的包装器（如 thrust::device_ptr）作斗争。。

Nowadays Thrust comes as a part of the CCCL (CUDA C++ Core Libraries) that also includes libcu++ with its non-owning cuda::std::span. Sadly, interfacing between a Thrust vector and a libcu++ span still needs that ugly thrust::raw_pointer_cast. But passing a cuda::std::span with pointer and size from a Thrust vector to a custom kernel or device functor is the closest we will probably get to passing the actual vector because span has all the nice member functions we are used to like .begin(), .end(), .size(). etc. and the cuda:: version has marked them as __host__ __device__ as well so they can be used both in host and in device code. Access to device memory from host code is not handled unlike with thrust::device_vector through thrust::device_ptr.

#include <thrust/device_vector.h>

#include <cuda/std/span>

int main() {
    thrust::device_vector<float> my_vec(10);
    cuda::std::span<float> my_span{thrust::raw_pointer_cast(my_vec.data()),
                                   my_vec.size()};
    // ... (use my_span in device code)
}

I hope that CCCL will be updated to allow directly initializing a cuda::std::span with a Thrust vector, i.e. cuda::std::span<float> my_span{my_vec};. While span already has a constructor taking a range (i.e. passing e.g. a std::vector should work), it still struggles with Thrust's wrappers like thrust::device_ptr.

回复收藏 0 原文

~没有更多了~