全局内存中的大型常量数组

发布于 2024-11-29 11:10:16 字数 257 浏览 1 评论 0原文

是否可以通过在 GPU 上运行具有以下特性的算法来提高性能:

  1. 有数百甚至数千个独立线程,在计算过程中不需要任何同步
  2. 每个线程具有相对较小(小于 200Kb)的本地内存包含线程特定数据的区域。读/写
  3. 每个线程访问一个大内存块(数百兆甚至千兆字节)。该内存是只读的
  4. 对于每次访问全局内存,都会有至少两次对本地内存的访问
  5. 算法中会有很多分支

不幸的是,这里展示的算法相当复杂。

Is it possible to increase performance by running on a GPU for the algorithm with the following properties:

  1. There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
  2. Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
  3. Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
  4. For each access to the global memory there will be at least two accesses to the local memory
  5. There will be a lot of branches in the algorithm

Unfortunately the algorithm is rather complicated to be show here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

二智少女 2024-12-06 11:10:16

我的本能是积极使用纹理内存。缓存的好处将远远超过未合并的全局内存读取。

写入时您可能需要添加一些填充等以避免银行冲突。

对数百兆或千兆数据的依赖有些令人担忧。你能以某种方式把它分割掉吗?希望您拥有一辆强大的 Tesla/Quadro 和大量 RAM。

也就是说,CUDA 优化的游戏名称始终是实验、分析/测量、冲洗和重复。

My instinct is to use texture memory aggressively. The caching benefits will beat uncoalesced global memory reads by a mile.

The writes you may need to add some padding etc. to avoid bank conflicts.

The reliance on hundreds of meg or gigs of data is somewhat concerning. Can you carve it up somehow? Hope you have a big beefy Tesla/Quadro w/ oodles of RAM.

That said, the name of game for CUDA optimization is always to experiment, profile/measure, rinse and repeat.

北座城市 2024-12-06 11:10:16

在开始之前,请记住 CUDA 中有两层并行性:块和线程。

有数百甚至数千个独立线程,
计算过程中不需要任何同步

由于每个维度可以启动多达 65535 个块,因此您可以将 cuda 中的每个块视为相当于您的一个“线程”。

每个线程都有相对较小(小于200Kb)的本地内存
包含线程特定数据的区域。读/写

不幸的是,大多数卡的共享内存限制为每块 16k。因此,如果您能弄清楚如何处理这个下限,那就太好了。如果没有,您将需要使用全局内存访问。

每个线程访问一个大内存块(数百兆字节和
甚至千兆字节)。该内存是只读的

您无法将如此大的数组绑定到纹理或常量内存。因此,在给定的块中,尝试使线程读取连续的数据块以获得最佳性能。

对于全局内存的每次访问,至少有两次
访问本地内存中会有很多分支
算法

您可能需要稍微修改代码以尝试实现“每线程代码”的并行版本。

乍一看可能不太清楚,但仔细思考一下。任何具有数百/数千个独立部分且无需同步的算法都非常适合并行实现,即使使用 cuda 也是如此。

Before I start, please remember that there are two layers of parallelism in CUDA: blocks and threads.

There are hundreds and even thousands of independent threads, which do
not require any synchronization during calculations

Since you can launch as many as 65535 blocks per dimension, you can treat each block in cuda to be equivalent to a "thread" of yours.

Each thread has a relatively small (less than 200Kb) local memory
region containing thread-specific data. Read/Write

Unfortunately most cards have a shared memory limit of 16k per block. So if you can figure out how to handle with this lower limit, great. If not, you will need to use global memory accesses..

Each thread accesses a large memory block (hundreds of megabytes and
even gigabytes). This memory is read-only

You can not bind such large arrays to textures or constant memory. So in a given block, try to make the threads read contiguous chunks of data for the best performance.

For each access to the global memory there will be at least two
accesses to the local memory There will be a lot of branches in the
algorithm

Since you are essentially replacing a single thread in your original implementation with a block in cuda, you may want to revise the code a little bit to try and implement a parallel version of the "per thread code" too.

This may not be clear at first glance, but think it through a little. Any algorithm that has hundreds / thousands of independent parts with no synchronization needed is great for a parallel implementation, even with cuda.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文