对 OpenCL 内核（DSP 窗口函数）的分析结果感到困惑

发布于 2024-10-01 09:34:26 字数 1696 浏览 9 评论 0原文

我在 OpenCL 中完成了一个窗口函数内核。基本上，窗口函数只是将一组系数逐个应用于另一组数字（维基百科解释得更好）。在大多数情况下，我能够将窗口系数浮点数组填充到常量缓存中。

我预计计算教授的结果表明主机到设备以及设备到主机的内存传输将占用超过 95% 的处理时间。对于我几乎所有的案例来说，这仅占处理时间的 80%。我正在向开发板写入和读取一个 420 万个浮点数组，并写入另一个通常远低于 100 万个浮点数组。

内核中是否有任何内容看起来可疑？关于这是否是一个应该在 GPU 上运行得比 CPU 更快的问题，有什么意见吗（我对此仍然不是 100%）。我有点惊讶为什么我的 gld_efficiency 和 gst_efficiency 徘徊在 0.1 和 0.2 之间。我在制作这个内核时考虑了 G80 全局内存合并。我的全局内存总吞吐量看起来不错，为 40GB。内核非常简单，如下所示。

__kernel void window(__global float* inputArray, // first frame to ingest starts at 0.  Sized to nFramesToIngest*framesize samples
    __constant float* windowArray, // may already be partly filled
    int windowSize, // size of window frame, in floats
    int primitivesPerDataFrame, //amount of primitives in each frame of inputArray parameter
    int nInFramesThisCall, //each thread solves a frame, so this integer represent how many threads this kernel launches
    int isRealNumbers //0 for complex, non-zero for real 
)
{
int gid = get_global_id(0) + get_global_size(0) * get_global_id(1);

if(gid < nInFramesThisCall) //make sure we don't execute unnecessary threads
{
    if(isRealNumbers)
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize];
        }
    }
    else //complex
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize/2];
        }
    }
}

}

原文

I completed a Window Function kernel in OpenCL. Basically a window function just applies a set of coefficients over another set of numbers piece by piece (Wikipedia explains it better). I was able to stuff the window coefficient float array in constant cache for most cases.

I expected my results from Compute Prof to show that the host to device and device to host memory transfers would take more than 95% of the processing time. For nearly all of my cases it is only 80% of the processing time. I am writing and reading one 4.2 million float array to and from the board and writing another float array that generally stays well below a million.

Does anything in the kernel look fishy? Any opinions on if it is a problem that should run faster on a GPU than a CPU in the first place(I am still not 100% on this). I am a little stunned as to why my gld_efficiency and gst_efficiency hover between 0.1 and 0.2. I made this kernel with G80 global memory coalescing in mind. My global memory overall throughput seems alright at 40gbs. The kernel is pretty simple and is posted below.

__kernel void window(__global float* inputArray, // first frame to ingest starts at 0.  Sized to nFramesToIngest*framesize samples
    __constant float* windowArray, // may already be partly filled
    int windowSize, // size of window frame, in floats
    int primitivesPerDataFrame, //amount of primitives in each frame of inputArray parameter
    int nInFramesThisCall, //each thread solves a frame, so this integer represent how many threads this kernel launches
    int isRealNumbers //0 for complex, non-zero for real 
)
{
int gid = get_global_id(0) + get_global_size(0) * get_global_id(1);

if(gid < nInFramesThisCall) //make sure we don't execute unnecessary threads
{
    if(isRealNumbers)
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize];
        }
    }
    else //complex
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize/2];
        }
    }
}

}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寂寞笑我太脆弱 2024-10-08 09:34:26

您使用了多少个线程（顺便说一句，OpenCL 术语是工作项）？您至少需要数百个东西才能有效地加载大型 GPU。

您说您想利用合并内存访问，但

int inputArrayIndex = (gid*primitivesPerDataFrame)+i;

在大多数情况下，带有偏移量的加载不会使这成为可能。 NVidia 的 G80 在合并方面有相当严格的限制，请参阅“OpenCL 最佳实践指南”了解更多信息。基本上，来自一个 warp 的工作项必须同时以某种方式访问 64 或 128 字节对齐块的元素，以使加载和存储发生合并。

或者举个例子：如果 primitivesPerDataFrame 为 16，则扭曲的加载和存储将以 16 个元素间隔的偏移量完成，从而不可能进行任何有效的合并。

How many threads (the OpenCL term is work-items, by the way) are you using? You need at least something in the hundreds to load a big GPU efficiently.

You say you want to make use of coalesced memory access, but a load with an offset like

int inputArrayIndex = (gid*primitivesPerDataFrame)+i;

will not make this possible in most cases. NVidia's G80 has pretty severe restrictions when it comes to coalescing, see the "OpenCL Best Practices Guide" for more information. Basically, work-items from one warp have to access elements of a 64 or 128 byte aligned block in a certain fashion at the same time to make loads and stores happen coalesced.

Or to give you an example: if primitivesPerDataFrame is 16, loads and stores of a warp are done at offsets spaced 16 elements apart, making any efficient coalescing impossible.

回复收藏 0 原文

~没有更多了~