我的 OpenCL 内核在较快的硬件上速度较慢。但是为什么呢?

发布于 2024-08-28 11:35:54 字数 902 浏览 6 评论 0原文

当我完成多核编程课程的项目编码时,我发现了一些非常奇怪的事情,我想与您讨论。

我们被要求创建任何能够在多核平台编程方面显示出显着改进的程序。我决定尝试在 GPU 上编写一些代码来尝试 OpenCL。我选择了矩阵卷积问题,因为我非常熟悉它(我之前已经使用 open_mpi 对其进行了并行化,对于大图像来说具有很大的加速)。

所以在这里,我选择一个大的 GIF 文件 (2.5 MB) [2816X2112],然后运行顺序版本(原始代码),平均时间为 15.3 秒。

然后,我运行我刚刚在 MBP 集成 GeForce 9400M 上编写的新 OpenCL 版本,平均时间为 1.26 秒。到目前为止,一切顺利,加速了 12 倍!

但现在我进入节能面板打开“图形性能模式”,该模式会关闭 GeForce 9400M 并打开我系统上的 Geforce 9600M GT。苹果称该卡的速度是集成卡的两倍。

你猜怎么着,我使用强大显卡的平均时间是 3.2 秒……我的 9600M GT 似乎比 9400M 慢两倍多……

对于那些倾向于 OpenCL 的人,我将所有数据复制到远程缓冲区在开始之前,因此实际计算不需要往返主内存。另外,我让 OpenCL 确定最佳本地工作大小,因为我读到他们在计算该参数方面做了相当好的实现。

有人有线索吗?

编辑:此处包含 makefile 的完整源代码 http://www.mathieusavard.info/volving.zip

cd gimage
make
cd ../clconvolute
make
put a large input.gif in clconvolute and run it to see results

As I was finishing coding my project for a multicore programming class I came up upon something really weird I wanted to discuss with you.

We were asked to create any program that would show significant improvement in being programmed for a multi-core platform. I’ve decided to try and code something on the GPU to try out OpenCL. I’ve chosen the matrix convolution problem since I’m quite familiar with it (I’ve parallelized it before with open_mpi with great speedup for large images).

So here it is, I select a large GIF file (2.5 MB) [2816X2112] and I run the sequential version (original code) and I get an average of 15.3 seconds.

I then run the new OpenCL version I just wrote on my MBP integrated GeForce 9400M and I get timings of 1.26s in average.. So far so good, it’s a speedup of 12X!!

But now I go in my energy saver panel to turn on the “Graphic Performance Mode” That mode turns off the GeForce 9400M and turns on the Geforce 9600M GT my system has. Apple says this card is twice as fast as the integrated one.

Guess what, my timing using the kick-ass graphic card are 3.2 seconds in average… My 9600M GT seems to be more than two times slower than the 9400M..

For those of you that are OpenCL inclined, I copy all data to remote buffers before starting, so the actual computation doesn’t require roundtrip to main ram. Also, I let OpenCL determine the optimal local-worksize as I’ve read they’ve done a pretty good implementation at figuring that parameter out..

Anyone has a clue?

edit: full source code with makefiles here http://www.mathieusavard.info/convolution.zip

cd gimage
make
cd ../clconvolute
make
put a large input.gif in clconvolute and run it to see results

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浅沫记忆 2024-09-04 11:35:54

9400M 集成到您的内存控制器,而 9600M GT 是一个独立卡,通过 PCI-e 总线连接到您的内存控制器。这意味着当您将内存转移到 9400M 时,它只是将其分配到系统 RAM 中。另一方面,9600M 通过 PCI-e 将数据发送到卡上的专用图形内存。这种传输使您的基准测试看起来更慢。

如果您想比较两个显卡的性能,您应该使用 OpenCL 分析函数而不是当前使用的时钟函数。

cl_int clGetEventProfilingInfo(cl_event 事件,cl_profiling_info param_name,
size_t param_value_size, void *param_value, size_t *param_value_size_ret)

将您将内核排队时创建的事件传递给函数,并将第二个参数的 CL_PROFILING_COMMAND_START 传递给它,以获取内核的起点(以纳秒为单位)和 CL_PROFILING_COMMAND_END获取内核的终点。确保在内核执行完成后使用此命令(事件会保留其值,直到超出范围。)您还可以通过将此函数应用于事件来获取将数据传输到设备所花费的时间来自缓冲区的排队。这是一个示例:

        TRACE("Invoking the Kernel")
    cl::vector<cl::Event> matMultiplyEvent;
    cl::NDRange gIndex(32,64);
    cl::NDRange lIndex(16,16);

    err = queueList["GPU"]->enqueueNDRangeKernel(
                                                 matrixMultiplicationKernel, 
                                                 NULL, 
                                                 gIndex, 
                                                 lIndex, 
                                                 &bufferEvent,
                                                 matMultiplyEvent);
    checkErr(err, "Invoke Kernel");


    TRACE("Reading device data into array");
    err = queueList["GPU"]->enqueueReadBuffer(thirdBuff, 
                                              CL_TRUE,
                                              0,
                                              (matSize)*sizeof(float),
                                              testC,
                                              &matMultiplyEvent,
                                              bufferEvent);
    checkErr(err, "Read Buffer");
    matMultiplyEvent[0].wait();
    for (int i = 0; i < matSize; i++) {
        if (i%64 == 0) {
            std::cout << "\n";
        }
        std::cout << testC[i] << "\t";
    }
    long transferBackStart = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long transferBackEnd = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double transferBackSeconds = 1.0e-9 * (double)(transferBackEnd- transferBackStart);

    long matrixStart = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long matrixEnd = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double dSeconds = 1.0e-9 * (double)(matrixEnd - matrixStart);

此示例使用 C++ 包装器,但概念应该是相同的。

希望这有帮助。

The 9400M is integrated to your memory controller whereas the 9600M GT is a discrete card that is connected to your memory controller via PCI-e bus. This means that when you transfer memory to the 9400M it just allocates it into the System RAM. The 9600M on the other hand sends the data over the PCI-e to the dedicated graphics memory on the card. This transfer is what making your benchmark seem slower.

If you would like to compare the performance of the two graphics cards you should use the OpenCL profiling function instead of the clock function you are currently using.

cl_int clGetEventProfilingInfo (cl_event event, cl_profiling_info param_name,
size_t param_value_size, void *param_value, size_t *param_value_size_ret)

Pass the function the event that was created when you were enqueueing the Kernel and pass it the CL_PROFILING_COMMAND_START for the second argument to get the starting point of the Kernel in nanoseconds and CL_PROFILING_COMMAND_END to get the ending point of the kernel. Make sure to use this command AFTER the execution of the kernel has finished(the events hold their values until they go out of scope.) You can also get the time it took to transfer the data to the device by applying this function to the events from the enqueueing of the buffer. Here is an example:

        TRACE("Invoking the Kernel")
    cl::vector<cl::Event> matMultiplyEvent;
    cl::NDRange gIndex(32,64);
    cl::NDRange lIndex(16,16);

    err = queueList["GPU"]->enqueueNDRangeKernel(
                                                 matrixMultiplicationKernel, 
                                                 NULL, 
                                                 gIndex, 
                                                 lIndex, 
                                                 &bufferEvent,
                                                 matMultiplyEvent);
    checkErr(err, "Invoke Kernel");


    TRACE("Reading device data into array");
    err = queueList["GPU"]->enqueueReadBuffer(thirdBuff, 
                                              CL_TRUE,
                                              0,
                                              (matSize)*sizeof(float),
                                              testC,
                                              &matMultiplyEvent,
                                              bufferEvent);
    checkErr(err, "Read Buffer");
    matMultiplyEvent[0].wait();
    for (int i = 0; i < matSize; i++) {
        if (i%64 == 0) {
            std::cout << "\n";
        }
        std::cout << testC[i] << "\t";
    }
    long transferBackStart = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long transferBackEnd = bufferEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double transferBackSeconds = 1.0e-9 * (double)(transferBackEnd- transferBackStart);

    long matrixStart = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_START>();
    long matrixEnd = matMultiplyEvent[0].getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double dSeconds = 1.0e-9 * (double)(matrixEnd - matrixStart);

This example uses the C++ wrapper but the concept should be the same.

Hope this helps.

吻风 2024-09-04 11:35:54

我得到了相同的结果,但我不确定为什么。我的内核涉及非常少的复制(我为所有内核调用提供所有需要的数据,并且仅返回 512x512 图像)。它是一个光线跟踪器,因此内核工作远远超过了复制回来的工作(400+ms 到 10ms)。尽管如此,9600M GT 仍慢了约 1.5 倍至 2 倍。

根据 nVidia 的列表,9600M GT 应该有 32 个 SP(是 9400M 数量的两倍)。估计它的时钟也更高。

9600M GT 在某些情况下确实看起来更快,例如游戏。请参阅这些链接:
http://www.videocardbenchmark.net/video_lookup.php?cpu= GeForce+9600M+GT
http://www.videocardbenchmark.net/video_lookup.php?cpu= GeForce+9600M+GT

根据ars 技术

此外,早期测试揭示了有关 Snow Leopard 实现的一个有趣的花絮。尽管 Snow Leopard 似乎无法为使用 NVIDIA GeForce 9400M 芯片组的机器启用双 GPU 或即时 GPU 切换(这是 Leopard 遗留下来的限制),但操作系统似乎可以同时将两者用作 OpenCL 资源。因此,即使您在 MacBook Pro 上启用了 9600M GT,如果在应用程序中遇到 OpenCL 代码,Snow Leopard 也可以将该代码发送给 9400M 中几乎处于休眠状态的 16 个 GPU 核心进行处理。但反之则不然,当运行仅启用 9400M 的 MacBook Pro 时,9600M GT 会完全关闭以节省电量,并且不能用作 OpenCL 资源。

这似乎与我们所看到的相反。另外,我一次只在一台设备上显式设置 CL 上下文。

ars 论坛中有一些建议认为 9600M GT也不支持双打,这可以解释这个问题。我可能会尝试编写一个综合基准来测试这个假设。

I get the same results, and I'm unsure why. My kernel involves very minimal copying to/from (I presend all needed data for all kernel calls, and only return a 512x512 image). It's a raytracer, so the kernel work vastly outweighs the copy back (400+ms to 10ms). Still, the 9600M GT is about 1.5x-2x slower.

According to nVidia's listing, the 9600M GT should have 32 SPs (twice the number of the 9400M). It's presumably clocked higher too.

The 9600M GT does seem faster in some cases, e.g. games. See these links:
http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT
http://www.videocardbenchmark.net/video_lookup.php?cpu=GeForce+9600M+GT

According to ars technica:

Furthermore, an interesting tidbit about Snow Leopard's implementation is revealed by early tests. Though Snow Leopard doesn't seem to enable dual GPUs or on-the-fly GPU switching for machines using the NVIDIA GeForce 9400M chipset—a limitation carried over from Leopard—it does appear that the OS can use both as OpenCL resources simultaneously. So even if you have the 9600M GT enabled on your MacBook Pro, if OpenCL code is encountered in an application, Snow Leopard can send that code to be processed by the 16 GPU cores sitting pretty much dormant in the 9400M. The converse is not true, though—when running a MacBook Pro with just the 9400M enabled, the 9600M GT is shut down entirely to save power, and can't be used as an OpenCL resource.

This seems to be the opposite of what we are seeing. Also, I am explicitly setting up a CL context on only one device at a time.

There are some suggestions in the ars forums that the 9600M GT doesn't support doubles as well, which would explain this problem. I might try to write up a synthetic benchmark to test this hypothesis.

北方。的韩爷 2024-09-04 11:35:54

当我在 MacBook 上测试 OpenCL 时,我遇到了同样的问题。我相信这是因为 GeForce 9400M 的主内存总线速度比 Geforce 9600M GT 更高。因此,尽管 Geforce 9600M GT 的性能比 GeForce 9400M 强大得多,但将内存复制到 GPU 所需的时间太长,无法看出更强大的 GPU 对您的情况带来的好处。这也可能是由于工人组规模不适当造成的。

我还发现这个网站对我的 OpenCL 体验非常有帮助。

http://www.macresearch.org/opencl

I ran into the same issue when I was testing out OpenCL on my MacBook. I believe it's because the GeForce 9400M has a higher bus speed to the main memory bank than the Geforce 9600M GT. So even though the Geforce 9600M GT has much more power than the GeForce 9400M the time required to copy the memory to the GPU is too long to see the benefit of the more powerful GPU on your situation. It could also be caused by inappropriate worker group sizes.

Also I found this site very helpful in my OpenCL experience.

http://www.macresearch.org/opencl

祁梦 2024-09-04 11:35:54

性能并不是 GeForce 9400M 和 Geforce 9600M GT 之间的唯一区别。一大问题是它是一个离散 GPU。随之而来的是一系列差异,其中以下因素可能会产生影响:

  • 驱动程序批量处理更多命令的趋势
  • 内存并不统一。 GPU一般只访问自己的内存,驱动程序通过PCI-E总线来回移动内存。

我确信我错过了一些...

这里有一些您可以尝试的想法:

  • 避免调用 clFinish。在内存加载和执行之间调用它的方式会迫使驱动程序执行不必要的工作。它会导致 GPU 停顿。
  • 分析您的代码以查看什么花费了时间。我还不知道对 CL 性能分析的支持,但是通过您的 clFinish 调用,它可以通过简单地测量 CPU 端来为您提供一阶估计。请注意,通常很难区分什么是延迟造成的,什么是吞吐量造成的。

The performance is not the only difference between a GeForce 9400M and a Geforce 9600M GT. A big one is that one is a discrete GPU. With this come a slew of differences, amongst which the following can have an impact:

  • tendency of drivers to batch more commands
  • memory is not uniform. the GPU generally only accesses its own memory, and the driver moves memory back and forth over the PCI-E bus.

I'm sure I'm missing some...

Here are a bunch of ideas you can try:

  • avoid calling clFinish. The way you call it between the memory load and the execution forces the driver to do more work than necessary. It stalls the GPU.
  • profile your code to see what is taking the time. I'm not aware of support for CL performance analysis yet, but with your clFinish calls, it gives you a 1st order estimate by simply measuring the CPU side. Note that it's hard in general to distinguish what is due to latency and what is due to throughput.
为你鎻心 2024-09-04 11:35:54

我是 OpenCL 的新手,所以我可能有点天真,但我怀疑您需要进入节能面板来切换 OpenCL 计算设备。我相信您在代码中设置 OpenCL 上下文时选择了设备。

我的假设:1) 当您在不首先禁用集成 GPU 的情况下运行代码时,OpenCL 会选择您的独立 GPU 作为计算设备。您的代码在(快速)独立 GPU 上运行。 2) 当您首先禁用集成 GPU 时,您会将运行 OS X GUI 的负载强制转移到独立卡上。当您运行代码时,它会在独立 GPU 上运行,但会与 GUI 争夺资源。

这个答案是在问题提出 11 个月后给出的,但希望它对某人有用......

I'm new to OpenCL, so I may be a bit naive, but I doubt you needed to go into the energy saver panel to switch the OpenCL compute device. I believe that you choose the device when setting up the OpenCL context in your code.

My hypothesis: 1) When you run your code without disabling your integrated GPU first, OpenCL chooses your discrete GPU as the compute device. Your code runs on the (fast) discrete GPU. 2) When you disable the integrated GPU first, you force the load of running the OS X GUI onto your discrete card. When you run your code, it runs on the discrete GPU, but it contends with your GUI for resources.

This answer is coming 11 months after the question was asked, but hopefully it'll be useful to someone...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文