使用寄存器减少 CUDA

发布于 2024-11-25 07:59:45 字数 297 浏览 1 评论 0原文

我需要使用归约来计算 N 个信号的平均值。输入是大小为 MN 的一维数组,其中 M 是每个信号的长度。

最初,我有额外的共享内存来首先复制数据并对每个信号进行减少。但是,原始数据已损坏。

我的程序尝试最小化共享内存。所以我想知道如何使用寄存器对 N 个信号进行求和。我有N个线程,一个共享内存(浮点)s_m[N*M],0....M-1是第一个信号,等等。

我需要N个寄存器(或一个)来存储N个不同的平均值吗信号? (我知道如何使用多线程编程和 1 个寄存器进行顺序加法)。我想做的下一步是将输入中的每个值从其对应信号的平均值中减去。

I need to calculate N signals' mean values using reduction. The input is a 1D array of size MN, where M is the length of each signal.

Originally I had additional shared memory to first copy the data and do the reduction on each signal. However, the original data is corrupted.

My program tries to minimize the shared memory. So I was wondering how I can use registers to do a reduction sum on N signals. I have N threads, a shared memory (float) s_m[N*M], 0....M-1 is the first signal, etc.

Do I need N registers (or one) to store do mean value of N different signals? (I know how to do with sequential addition using multi-thread programming and 1 register). The next step I want to do is subtract every value in the input from its correspondent signal's mean.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风月客 2024-12-02 07:59:45

你的问题非常小(N = 32 且 M < 128)。但是,有一些指导原则:

假设您要减少 N 个线程中每个线程的 N 个值。

  • 如果 N 非常大(> 数十万),则只需在每个线程中按顺序对 M 进行归约即可。
  • 如果N<N数十或数千,考虑使用一个 warp 或一个线程块来执行 N 个缩减中的每一个。
  • 如果 N 非常小但 M 非常大,请考虑在每次 N 缩减中使用多个线程块。
  • 如果 N 非常小且 M 非常小(如您的数字所示),则仅在生成和/或消耗归约的输入/输出的计算也在 GPU 上运行时才考虑使用 GPU 进行归约。

Your problem is very small (N = 32 and M < 128). However, some guidelines:

Assuming you are reducing across N values for each of N threads.

  • If N is very large (> 10s of thousands) large, just do the reductions over M sequentially in each thread.
  • If N is < 10s of thousands, consider using one warp or one thread block to perform each of the N reductions.
  • If N is very small but M is very large, consider using multiple thread blocks per each of the N reductions.
  • If N is very small and M is very small (as your numbers are), only consider using the GPU for the reductions if the computations that generate and / or consume the input / output of the reductions are also running on the GPU.
酒解孤独 2024-12-02 07:59:45

根据我对这个问题的理解,我说你不需要N个寄存器来存储N个不同信号的平均值。

如果你已经有 N 个线程 [假设每个线程只对一个信号进行归约],那么你不需要 N 个寄存器来存储一个信号的归约。您只需要一个寄存器来存储平均值。

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   int id = threadIdx.x;
   float meanValue = 0.0;

   for(int i = 0; i < M; i++)
          meanValue = signals[id*M +i];

   meanValue =  meanValue/M;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= meanValue;
}

如果您需要对 N 个不同信号的所有平均值进行全局缩减,那么您需要使用 2 个寄存器 [一个用于存储本地平均值,另一个用于存储全局平均值] 和共享内存,

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   __shared__ float means[N];      // shared value
   int id = threadIdx.x;
   float meanValue = 0.0;
   float globalMean = 0.0;

   for(int i = 0; i < M; i++)
          meanValue += signals[id*M +i];

   means[id] =  meanValue/M;

   __syncthreads();

   // do the global reduction
   for(int i = 0; i < N; i++)
          globalMean += means[i];

   globalMean = globalMean/N;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= globalMean;
}

我希望这对您有帮助。如有任何疑问,请告诉我。

Based on my understanding of the question, I say that you don't need N registers to store the mean value of N different signals.

If you already have N threads [Given that each thread do reduction on only one signal], then you don't need N registers to store the reduction of one signal. All you need is one register to store the mean value.

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   int id = threadIdx.x;
   float meanValue = 0.0;

   for(int i = 0; i < M; i++)
          meanValue = signals[id*M +i];

   meanValue =  meanValue/M;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= meanValue;
}

If you need to do Kind of global reduction of all the meanValues of N different signals, then you need to use 2 registers [one to store the local mean and another to store the global mean] and the shared memory

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   __shared__ float means[N];      // shared value
   int id = threadIdx.x;
   float meanValue = 0.0;
   float globalMean = 0.0;

   for(int i = 0; i < M; i++)
          meanValue += signals[id*M +i];

   means[id] =  meanValue/M;

   __syncthreads();

   // do the global reduction
   for(int i = 0; i < N; i++)
          globalMean += means[i];

   globalMean = globalMean/N;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= globalMean;
}

I hope this helps you. Any doubts, let me know.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文