CUDA：将相同的内存位置流式传输到所有线程

发布于 2024-12-13 03:44:37 字数 229 浏览 1 评论 0原文

这是我的问题：我有相当大的双精度数集（它是一个包含 77.500 个双精度数的数组）要存储在 cuda 中的某个位置。现在，我需要一大堆线程来顺序地对该数组执行一系列操作。每个线程都必须读取该数组的相同元素，执行任务，将结果存储在共享内存中并读取数组的下一个元素。请注意，每个线程都必须同时从同一内存位置读取（仅读取）。所以我想知道：有没有什么方法可以通过一次内存读取将相同的双精度广播到所有线程？读很多遍是毫无用处的......有什么想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

下雨或天晴 2024-12-20 03:44:37

这是一个常见的优化。这个想法是让每个线程与其块伙伴合作读取数据：

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

您可以一次迭代地“滑动”窗口穿过数组 block_size （或者可能是一些整数因子）元素以消耗整个事情。当您想以同步方式存储数据时，可以使用相同的技术。

This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:

// choose some reasonable block size
const unsigned int block_size = 256;

__global__ void kernel(double *ptr)
{
  __shared__ double window[block_size];

  // cooperate with my block to load block_size elements
  window[threadIdx.x] = ptr[threadIdx.x];

  // wait until the window is full
  __syncthreads();

  // operate on the data
  ...
}

You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

回复收藏 0 原文

~没有更多了~