CUDA - 对矩阵单个元素的操作 - 获得想法

发布于 2024-10-30 04:54:37 字数 460 浏览 3 评论 0原文

我正在编写一个 CUDA 内核来对矩阵的每个元素执行单个操作(例如,对每个元素求平方,或求幂,或者如果所有数字都在 [-1;1] 之间,则计算正弦/余弦,等等。 .)

我选择了块/线程网格尺寸,我认为代码非常简单明了,但我问自己......我能做些什么来最大化合并/SM占用率?

我的第一个想法是:让所有 semiwarp(16 个线程)从全局内存加载数据集合,然后将它们全部放入计算中,但发现没有足够的内存传输/计算并行化。我的意思是所有线程都加载数据,然后计算,然后再次加载数据,然后再次计算..这在性能方面听起来确实很差。

我认为使用共享内存会很棒,也许使用某种局部性使线程加载比实际需要更多的数据以促进其他线程的工作,但这听起来也很愚蠢,因为第二个线程会等待前者完成加载数据在开始工作之前。

我不太确定我对我的问题给出了正确的想法,我只是在开始做具体的事情之前得到一些想法。

每一条评论/建议/批评都被广泛接受,谢谢。

I'm about writing a CUDA kernel to perform a single operation on every single element of a matrix (e.g. squarerooting every element, or exponentiation, or calculating the sine/cosine if all the numbers are between [-1;1], etc..)

I chose the blocks/threads grid dimensions and I think the code is pretty straightforward and simple, but I'm asking myself... what can I do to maximize coalescence/SM occupancy?

My first idea was: making all semiwarp (16 threads) load data ensemble from global memory and then putting them all to compute, but it finds out that there are no enough memory-transfer/calculations parallelization.. I mean all threads load data, then compute, then load again data, then calculate again.. this sounds really poor in terms of performance.

I thought using shared memory would be great, maybe using some sort of locality to make a thread load more data than it actually needs to facilitate other threads' work, but this sounds stupid too because the second would wait for the former to finish loading data before starting its work.

I'm not really sure I gave the right idea regarding my problem, I'm just getting ideas before commencing to work on something concrete.

Every comment/suggestion/critic is well accepted, and thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

勿忘心安 2024-11-06 04:54:37

如果您已经定义了网格,以便线程沿着包含矩阵的数组的主要维度进行读取,那么您就已经保证了合并的内存访问,并且几乎不需要做任何其他事情来提高性能。这些 O(N) 复杂度运算实际上不包含足够的算术强度,无法在优化的 CPU 实现上提供良好的并行速度。通常,最好的策略是将多个 O(N) 操作融合到一个内核中,以提高 FLOP 与内存事务的比率。

If you have defined the grid so that threads read along the major dimension of the array containing your matrix, then you have already guaranteed coalesced memory access, and there is little else to be done to improve performance. These sort of O(N) complexity operations really do not contain sufficient arithmetic intensity to give good parallel speed up over an optimized CPU implementation. Often the best strategy is to fuse multiple O(N) operations together into a single kernel to improve the FLOP to memory transaction ratio.

新一帅帅 2024-11-06 04:54:37

在我看来你的问题是这样的

load data ensemble from global memory

你的算法思想似乎是:

  1. 在cpu上做一些事情 - 有一些矩阵
  2. 将矩阵从全局转移到设备内存
  3. 对每个元素执行你的操作
  4. 将矩阵从设备转移回全局内存
  5. 在cpu上做其他事情 -有时会返回 1。

这种计算几乎每次都是 I/O 带宽受限(IO = 内存 IO),而不是计算能力受限。 GPGPU 计算可以维持非常高的内存带宽 - 但只能从设备内存到 GPU - 从全局内存的传输始终通过非常慢的 PCIe(与设备内存连接相比较慢,可提供高达 160 GB/s + 的速度)快速卡)。因此,获得良好结果的一件主要事情是将数据(矩阵)保存在设备内存中 - 如果可能的话,最好在设备内存中生成它(取决于您的问题)。切勿尝试在 cpu 和 gpu 之间来回迁移数据,因为传输开销会耗尽您的所有加速。另请记住,您的矩阵必须具有一定的大小才能分摊传输开销,这是您无法避免的(计算具有 10 x 10 元素的矩阵几乎不会带来任何好处,哎呀,它甚至会花费更多

)传输完全正常,这就是 GPU 算法的工作原理 - 但前提是传输来自设备内存。

In my eyes your problem is this

load data ensemble from global memory

It seems that your algorithm idea is:

  1. Do something on cpu - have some matrix
  2. Transfer matrix from global to device memory
  3. Perform your operation on every element
  4. Transfer matrix back from device to global memory
  5. Do something else on cpu - go sometimes back 1.

This kind of computations are almost everytime I/O-bandwidth limited (IO = memory IO), not computation power limited. GPGPU computations can sustain a very high memory bandwidth - but only from device memory to the gpu - transfer from global memory goes always over the very slow PCIe (slow compared to the device memory connection, that can deliver up to 160 GB/s + on fast cards). So one main thing to get good results is to keep the data (matrix) in device memory - preferable generate it even there if possible (depends on your problem). Never try to migrate data between cpu and gpu for and back as the transfer overhead eats all your speedup up. Also keep in mind that your matrix must have a certain size to amortize the transfer overhead, that you cant avoid (to compute a matrix with 10 x 10 elements would bring almost nothing, heck it would even cost more)

The interchanging transfer/compute/transfer is full ok, thats how such gpu algorithms work - but only if the the tranfer is from device memory.

要走就滚别墨迹 2024-11-06 04:54:37

对于这种微不足道的事情,使用 GPU 就显得有些过分了,而且会比仅将其保留在 CPU 上慢。特别是如果你有一个多核CPU。

我见过很多项目都展示了 GPU 相对于 CPU 的“巨大”优势。他们很少经得起审查。当然,那些想给自己的经理留下深刻印象的傻瓜经理想要展示他的团队有多么“领先”。

部门中的某个人花了几个月的时间来优化愚蠢的 GPU 代码(通常比同等 CPU 代码难读 8 倍),然后让一些印度血汗工厂(最后一个项目是 PGP 的程序员)编写“同等”CPU 代码,使用他们能找到的最慢的 gcc 版本编译它,没有优化,然后吹嘘他们的速度提高了 2 倍。顺便说一句,许多人忽视了 I/O 速度,认为它并不重要。

The GPU for something this trivial is overkill and will be slower than just keeping it on the CPU. Especially if you have a multicore CPU.

I have seen many projects showing the "great" advantages of the GPU over the CPU. They rarely stand up to scrutiny. Of course, goofy managers who want to impress their managers want to show how "leading edge" his group is.

Someone in the department toils months on getting silly GPU code optimized (which is generally 8x harder to read than equivalent CPU code), then have the "equivalent" CPU code written by some Indian sweat shop (the programmer whose last project was PGP), compile it with the slowest version of gcc they can find, with no optimization, then tout their 2x speed improvement. And BTW, many overlook I/O speed as somehow not important.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文