CUDA：结果总和

发布于 2024-10-05 20:04:33 字数 567 浏览 1 评论 0原文

我正在使用 CUDA 来运行一个问题，其中我需要一个包含许多输入矩阵的复杂方程。每个矩阵都有一个 ID，具体取决于其集合（1 到 30 之间，有 100,000 个矩阵），每个矩阵的结果存储在 float[N] 数组中，其中 N 是输入矩阵的数量。

之后，我想要的结果是该数组中每个 ID 的每个浮点数的总和，因此如果有 30 个 ID，则有 30 个结果浮点数。

关于我应该如何执行此操作有什么建议吗？

现在，我将浮点数组（400kb）从设备读回主机并在主机上运行：

// Allocate result_array for 100,000 floats on the device
// CUDA process input matrices
// Read from the device back to the host into result_array
float result[10] = { 0 };
for (int i = 0; i < N; i++)
{
    result[input[i].ID] += result_array[i];
}

但我想知道是否有更好的方法。

原文

I'm using CUDA to run a problem where I need a complex equation with many input matrices. Each matrix has an ID depending on its set (between 1 and 30, there are 100,000 matrices) and the result of each matrix is stored in a float[N] array where N is the number of input matrices.

After this, the result I want is the sum of every float in this array for each ID, so with 30 IDs there are 30 result floats.

Any suggestions on how I should do this?

Right now, I read the float array (400kb) back to the host from the device and run this on the host:

// Allocate result_array for 100,000 floats on the device
// CUDA process input matrices
// Read from the device back to the host into result_array
float result[10] = { 0 };
for (int i = 0; i < N; i++)
{
    result[input[i].ID] += result_array[i];
}

But I'm wondering if there's a better way.

分享到QQ

分享到微博