当前位置：文江博客话题详情

使用寄存器减少 CUDA

发布于 2024-11-25 07:59:45 字数 297 浏览 1 评论 0原文

我需要使用归约来计算 N 个信号的平均值。输入是大小为 MN 的一维数组，其中 M 是每个信号的长度。

最初，我有额外的共享内存来首先复制数据并对每个信号进行减少。但是，原始数据已损坏。

我的程序尝试最小化共享内存。所以我想知道如何使用寄存器对 N 个信号进行求和。我有N个线程，一个共享内存（浮点）s_m[N*M]，0....M-1是第一个信号，等等。

我需要N个寄存器（或一个）来存储N个不同的平均值吗信号？（我知道如何使用多线程编程和 1 个寄存器进行顺序加法）。我想做的下一步是将输入中的每个值从其对应信号的平均值中减去。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风月客 2024-12-02 07:59:45

你的问题非常小（N = 32 且 M < 128）。但是，有一些指导原则：

假设您要减少 N 个线程中每个线程的 N 个值。

如果 N 非常大（> 数十万），则只需在每个线程中按顺序对 M 进行归约即可。
如果N＜N数十或数千，考虑使用一个 warp 或一个线程块来执行 N 个缩减中的每一个。
如果 N 非常小但 M 非常大，请考虑在每次 N 缩减中使用多个线程块。
如果 N 非常小且 M 非常小（如您的数字所示），则仅在生成和/或消耗归约的输入/输出的计算也在 GPU 上运行时才考虑使用 GPU 进行归约。

回复收藏 0 原文

酒解孤独 2024-12-02 07:59:45

根据我对这个问题的理解，我说你不需要N个寄存器来存储N个不同信号的平均值。

如果你已经有 N 个线程 [假设每个线程只对一个信号进行归约]，那么你不需要 N 个寄存器来存储一个信号的归约。您只需要一个寄存器来存储平均值。

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   int id = threadIdx.x;
   float meanValue = 0.0;

   for(int i = 0; i < M; i++)
          meanValue = signals[id*M +i];

   meanValue =  meanValue/M;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= meanValue;
}

如果您需要对 N 个不同信号的所有平均值进行全局缩减，那么您需要使用 2 个寄存器 [一个用于存储本地平均值，另一个用于存储全局平均值] 和共享内存，

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   __shared__ float means[N];      // shared value
   int id = threadIdx.x;
   float meanValue = 0.0;
   float globalMean = 0.0;

   for(int i = 0; i < M; i++)
          meanValue += signals[id*M +i];

   means[id] =  meanValue/M;

   __syncthreads();

   // do the global reduction
   for(int i = 0; i < N; i++)
          globalMean += means[i];

   globalMean = globalMean/N;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= globalMean;
}

我希望这对您有帮助。如有任何疑问，请告诉我。

Based on my understanding of the question, I say that you don't need N registers to store the mean value of N different signals.

If you already have N threads [Given that each thread do reduction on only one signal], then you don't need N registers to store the reduction of one signal. All you need is one register to store the mean value.

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   int id = threadIdx.x;
   float meanValue = 0.0;

   for(int i = 0; i < M; i++)
          meanValue = signals[id*M +i];

   meanValue =  meanValue/M;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= meanValue;
}

If you need to do Kind of global reduction of all the meanValues of N different signals, then you need to use 2 registers [one to store the local mean and another to store the global mean] and the shared memory

dim3 threads (N,1);
reduction<<<threads,1>>>(signals);  // signals is the [N*M] array 

__global__ reduction (int *signals)
{
   __shared__ float means[N];      // shared value
   int id = threadIdx.x;
   float meanValue = 0.0;
   float globalMean = 0.0;

   for(int i = 0; i < M; i++)
          meanValue += signals[id*M +i];

   means[id] =  meanValue/M;

   __syncthreads();

   // do the global reduction
   for(int i = 0; i < N; i++)
          globalMean += means[i];

   globalMean = globalMean/N;

   // Then do the subtraction
   for(int i = 0; i < M; i++)
          signals[id*M +i] -= globalMean;
}

I hope this helps you. Any doubts, let me know.

回复收藏 0 原文

~没有更多了~