JNCI/JCOL 内核优化

发布于 2024-12-01 23:14:12 字数 890 浏览 1 评论 0原文

我有一个在 open CL 中运行的内核（通过 jocl 前端），与其他内核相比，它的运行速度非常慢，我试图找出原因以及如何加速它。这个内核非常基础。它唯一的工作就是减少我们拥有的样本点的数量。它将输入数组中的每 N 个点复制到较小的输出数组，以缩小数组大小。

向内核传递一个浮点数，指定在“好”点之间跳过多少个点。因此，如果超过 1.5，它将跳过 1 个点、10 个、2 个、然后 1 个，以此类推，以保持每跳过 1.5 个点的平均值。输入数组已经在 GPU 上（它是由早期内核生成的），输出数组将保留在内核上，因此与 CPU 之间的数据传输无需任何费用。

该内核的运行速度比任何其他内核慢 3-5 倍；比某些快速内核慢 20 倍。我意识到我因为没有合并数组访问而受到惩罚；但我不敢相信这会让我跑得这么慢。在所有其他内核都接触数组中的每个样本之后，我认为接触数组中的每个 X 样本，即使没有合并，至少应该与接触数组中的每个样本的速度大致相同。大批。

原始内核实际上一次抽取了两个数组，分别表示实数和虚数数据。我尝试将内核分为两个内核调用，一个用于抽取实数数据，一个用于抽取虚数数据；但这根本没有帮助。同样，我尝试通过让一个线程负责抽取 3-4 个点来“展开”内核；但这没有任何帮助。我尝试过调整传递到每个内核调用的数据大小（即对数千个数据点的一个内核调用，或对较少数量的数据点的几个内核调用），这使我能够调整小的性能增益；但还没有达到我认为该内核值得在 GPU 上实现所需的数量级。

只是为了给人一种规模感，该内核每次迭代运行需要 98 毫秒，而对于相同的输入数组大小，FFT 仅需要 32 毫秒，而每个其他内核只需要 5 毫秒或更少。还有什么可能导致如此简单的内核与正在运行的其他内核相比运行得如此缓慢？是否有可能我实际上无法充分优化该内核以保证在 GPU 上运行它。我不需要这个内核比 CPU 运行得更快；只是与 CPU 相比没有那么慢，因此我可以将所有处理保留在 GPU 上。

原文

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.

The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.

This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.

The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.

just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.

分享到QQ

分享到微博