CUDA:将相同的内存位置流式传输到所有线程

发布于 2024-12-13 03:44:37 字数 229 浏览 1 评论 0原文

这是我的问题:我有相当大的双精度数集(它是一个包含 77.500 个双精度数的数组)要存储在 cuda 中的某个位置。现在,我需要一大堆线程来顺序地对该数组执行一系列操作。每个线程都必须读取该数组的相同元素,执行任务,将结果存储在共享内存中并读取数组的下一个元素。请注意,每个线程都必须同时从同一内存位置读取(仅读取)。所以我想知道:有没有什么方法可以通过一次内存读取将相同的双精度广播到所有线程?读很多遍是毫无用处的......有什么想法吗?

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

下雨或天晴 2024-12-20 03:44:37

这是一个常见的优化。这个想法是让每个线程与其块伙伴合作读取数据:

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

您可以一次迭代地“滑动”窗口穿过数组 block_size (或者可能是一些整数因子)元素以消耗整个事情。当您想以同步方式存储数据时,可以使用相同的技术。

This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文