使用寄存器减少 CUDA
我需要使用归约来计算 N 个信号的平均值。输入是大小为 MN 的一维数组,其中 M 是每个信号的长度。
最初,我有额外的共享内存来首先复制数据并对每个信号进行减少。但是,原始数据已损坏。
我的程序尝试最小化共享内存。所以我想知道如何使用寄存器对 N 个信号进行求和。我有N个线程,一个共享内存(浮点)s_m[N*M],0....M-1是第一个信号,等等。
我需要N个寄存器(或一个)来存储N个不同的平均值吗信号? (我知道如何使用多线程编程和 1 个寄存器进行顺序加法)。我想做的下一步是将输入中的每个值从其对应信号的平均值中减去。
I need to calculate N signals' mean values using reduction. The input is a 1D array of size MN, where M is the length of each signal.
Originally I had additional shared memory to first copy the data and do the reduction on each signal. However, the original data is corrupted.
My program tries to minimize the shared memory. So I was wondering how I can use registers to do a reduction sum on N signals. I have N threads, a shared memory (float) s_m[N*M], 0....M-1 is the first signal, etc.
Do I need N registers (or one) to store do mean value of N different signals? (I know how to do with sequential addition using multi-thread programming and 1 register). The next step I want to do is subtract every value in the input from its correspondent signal's mean.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的问题非常小(N = 32 且 M < 128)。但是,有一些指导原则:
假设您要减少 N 个线程中每个线程的 N 个值。
Your problem is very small (N = 32 and M < 128). However, some guidelines:
Assuming you are reducing across N values for each of N threads.
根据我对这个问题的理解,我说你不需要N个寄存器来存储N个不同信号的平均值。
如果你已经有 N 个线程 [假设每个线程只对一个信号进行归约],那么你不需要 N 个寄存器来存储一个信号的归约。您只需要一个寄存器来存储平均值。
如果您需要对 N 个不同信号的所有平均值进行全局缩减,那么您需要使用 2 个寄存器 [一个用于存储本地平均值,另一个用于存储全局平均值] 和共享内存,
我希望这对您有帮助。如有任何疑问,请告诉我。
Based on my understanding of the question, I say that you don't need N registers to store the mean value of N different signals.
If you already have N threads [Given that each thread do reduction on only one signal], then you don't need N registers to store the reduction of one signal. All you need is one register to store the mean value.
If you need to do Kind of global reduction of all the meanValues of N different signals, then you need to use 2 registers [one to store the local mean and another to store the global mean] and the shared memory
I hope this helps you. Any doubts, let me know.