调用 opencl 需要多长时间?
我目前正在实现一种算法,该算法可以在小矩阵和向量上分配线性代数。代码速度很快,但我想知道在 gpgpu 而不是 cpu 上实现它是否有意义。
我能够将大多数矩阵和向量存储在 GPU 内存中作为预处理步骤,并配置乘法算法,这些算法当然在 GPU 上速度更快。
但现在我真正的问题是 如何确定从 cpu 调用 gpu 的开销?我失去了多少个周期来执行我的代码以及类似的东西?
我希望有人能提供一些意见?
I'm currently implementing an algorithm that does allot of linear algebra on small matrices and vectors. the code is fast but I'm wondering if it would make sense to implement it on a gpgpu instead of the cpu.
I'm able to store most of the matrices and vectors in the gpu memory as a preprocessing step, and have profiles the multiplication algorithms, the algorithms are, ofcaurse, way faster on the gpu.
but now for my real question,
how do I determine the overhead of making calls to the gpu from the cpu? how many cycles am I losing wayting for my code to be executed and stuff like that?
I hope someone has some input?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
很难确定调用 OpenCL 的确切“开销”,因为 GPU 上的操作可以与 CPU 上运行的其他操作并行完成。
例如,根据您的应用程序,您可以将应用程序中的数据块传输到 GPU,并并行在 CPU 中对以下数据块进行一些预处理。同样,当代码在 GPU 上执行时,您可以在 CPU 上对将来需要的某些数据进行一些准备工作。
到 GPU 的传输将通过 DMA 传输完成,一般来说速度非常快。
根据我的经验,我能够在 4 毫秒的时间内将大约 4MB 的数据传输到 GPU(现代 GPU、现代主板),同时对之前发送的数据进行一些处理。
由此看来,可以肯定地说,您可以每秒向 GPU 上传和下载 1GB 的数据,并对这些数据进行一些处理。
在您的情况下,GPU 或 CPU 端都将成为瓶颈。 CPU 端,如果它无法每秒向 GPU 提供 1GB 准备好的数据。这很可能受到磁盘 I/O 的限制。
要测试您的 GPU 路径,请设置一堆准备处理的数据缓冲区。您可能希望继续将该数据重新发送到 GPU、对其进行处理并下载结果(您将丢弃该结果)。测量吞吐量并与应用程序的 CPU 版本的吞吐量进行比较。
不要只测量 GPU 处理部分,因为 GPU 上的传输和处理会争夺 GPU 内存控制器时间,并且会影响彼此的速度。
此外,如果您希望对小块数据有很好的响应时间,而不是良好的吞吐量,那么您可能不会从使用 GPU 中受益,因为它会给您的处理带来一些延迟。
It is hard to determine the exact "overhead" of calling OpenCL, because operations on the GPU can be done in parallel with whatever else is running on the CPU.
Depending on your application, you can, for example, do a transfer of a chunk of data to the GPU from your application and in paralell do some preprocessing in CPU of the following chunk of data. Similarly, while the code is executing on the GPU, you can be doing some prep work on the CPU on some data needed in the future.
The transfers to the GPU will be done via DMA transfers, which are very fast in general.
From my experience, I was able to transfer around 4MB of data in the order of 4 milliseconds to the GPU (modern GPU, modern motherboard), while doing some processing on the data that was sent previosly.
From that, it seems safe to say you can upload and download an order of 1GB of data per second to the GPU and do some processing on that data.
In your case, either the GPU or the CPU side will be the bottleneck. CPU side, if it cannot feed, say, 1GB of prepared data to the GPU per second. This may be very possibly limited by your disk I/O.
To test your GPU path, set up a bunch of buffers of data ready to process. You would want to keep re-sending that data to the GPU, processing it, and downloading the results (which you will discard). Measure the throughput and compare to the throughput of your CPU version of the application.
Don't measure just the GPU processing part, because transfers and processing on the GPU will compete for GPU memory controller time and will be affecting each other's pace.
Also, in case you want very good response time on small pieces of data, not good throughput, you probably won't benefit from going through the GPU, because it introduces a bit of delay to your processing.
这里需要考虑的重要一点是将数据复制到 GPU 并返回的时间。即使 GPU 实现速度更快,传输所花费的时间也可能会消除任何优势。
此外,如果您非常重视代数的准确性,那么您可能需要考虑您想要执行的操作可能无法在 GPU 上以双精度本地执行。
鉴于您说您的矩阵和向量很小,我建议您检查 SIMD 优化,这可能会提高您的算法在 CPU 上的性能。
The important thing to consider here is the time it takes to copy the data to the GPU and back. Even if the GPU implementation is much faster, the time spent doing transfers may wipe out any advantage.
Furthermore, if you are very serious about the accuracy of your algebra then you may want to consider that the operations you want to perform may not be available natively on the GPU with double accuracy.
Given that you say your matrices and vectors are small I suggest checking out SIMD optimisations that may improve the performance of your algorithm on CPU.
您可以使用 clEvent 对象来跟踪实际计算所花费的时间(延迟)。如果您实际上指的是 CPU 周期,请使用 RDTSC(或其内在的 MSVC 中的 __rdtsc)为实际 API 调用进行纳秒级精确计时。 RDTSC 指令(读取时间戳计数器)返回 cpu 自上电以来已完成的时钟周期数。
如果上传真的那么容易,那么您可以批量调用,或许还可以向 NDRange 添加一个维度,以便在一次调用中执行多个计算。当然,细节取决于您的内核实现。
You can use clEvent objects to track the time that the actual computations take (latency). If you actually mean CPU cycles, use RDTSC (or its intrinsic, __rdtsc in MSVC) to do nanosecond-precise timing for the actual API calls. The RDTSC instruction (read time stamp counter) returns the number of clock cycles the cpu has completed since powerup.
If it really is that easy to upload, then you can batch up calls and perhaps add a dimension to your NDRange to do multiple computations in one call. Of course, the details depend on your kernel implementation.
我建议使用以下方法来测量 cpu 周期数:
I suggest using the following to measure the number of cpu cycles: