我们能否对 CUDA 或 OpenCL 与 CPU 性能相比的速度进行基准测试?

发布于 2024-10-04 08:58:29 字数 266 浏览 0 评论 0原文

与一般的单处理器核心相比,CUDA 或 OpenCL 代码上的算法运行速度能快多少? (考虑到算法是针对 CPU 和 GPU 目标编写和优化的)。

我知道这取决于显卡和 CPU,但比如说,NVIDIA 最快的 GPU 之一和 Intel i7 处理器(单核)?

我知道这还取决于算法的类型。

我不需要严格的答案,但有经验的例子如下:对于使用双精度浮点和每像素 10 次操作的图像处理算法,前 5 分钟需要花费 5 分钟,现在使用该硬件运行在 x 秒内。

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).

I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?

And I know it also depends on the type of algorithm.

I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

卷耳 2024-10-11 08:58:29

你的问题太宽泛了,很难回答。此外,只有一小部分算法(处理没有太多共享状态的算法)在 GPU 上可行。

但我确实想敦促您对这些说法持批评态度。我从事图像处理工作,阅读了很多有关该主题的文章,但在 GPU 情况下,将输入数据上传到 GPU 并将结果下载回主内存的时间通常不是包含在因子的计算中。

虽然在某些情况下这并不重要(两者都很小或者有第二阶段计算进一步减小结果的大小),但通常确实必须传输结果和初始数据。

我已经看到这将声称的优点变成了缺点,因为仅上传/下载时间就比主 CPU 进行计算所需的时间长。

几乎同样的情况也适用于组合不同 GPU 卡的结果。

更新 较新的 GPU 似乎能够使用乒乓缓冲区同时上传/下载和计算。但彻底检查边境状况的建议仍然有效。那里有很多旋转。

更新 2 通常使用与视频输出共享的 GPU 并不是最佳选择。例如,考虑添加低预算视频卡,并使用板载视频执行 GPGPU 任务

Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.

But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.

While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.

I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.

Pretty much the same thing applies to combining results of different GPU cards.

Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.

Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks

旧伤还要旧人安 2024-10-11 08:58:29

我认为这个 OpenCL 的视频介绍在第一集或第二集中很好地回答了你的问题(我不记得了)。我认为这是在第一集的结尾......

总的来说,这取决于你“并行化”问题的能力。问题大小本身也是一个因素,因为将数据复制到显卡会花费时间。

I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...

In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.

诗酒趁年少 2024-10-11 08:58:29

这在很大程度上取决于算法以及实现的效率。

总的来说,可以公平地说 GPU 的计算能力比 CPU 更好。因此,上限是将高端 GPU 的理论 GFlops 评级除以高端 CPU。您可以对理论内存带宽进行类似的计算。

例如,GTX580 为 1581.1 GFLOPS,而 i7 980XE 为 107.55 GFLOPS。请注意,GTX580 的评级是针对单精度的。我相信对于费米类非特斯拉,您需要将其减少 4 倍才能达到双精度等级。因此,在这种情况下,您可能期望大约 4 倍。

关于为什么您可能做得更好的警告(或查看声称具有更大加速速度的结果):

  1. 一旦数据存储在卡上,GPU 的内存带宽就比 CPU 更好。有时,内存限制算法可以在 GPU 上表现良好。

  2. 巧妙使用缓存(纹理内存等),可以让您比广告带宽做得更好。

  3. 就像 Marco 所说,传输时间不包括在内。我个人总是将这样的时间纳入我的工作中,因此发现我所见过的最大加速是在所有数据都适合 GPU 的迭代算法中(我个人在中端 CPU 到中端 GPU 上获得了超过 300 倍的加速) )。

  4. 苹果与橙子的比较。将高端 GPU 与低端 CPU 进行比较本质上是不公平的。反驳是高端CPU的成本比高端GPU高得多。一旦您进行 GFlops/$ 或 GFlops/Watt 比较,它看起来对 GPU 更有利。

It depends very much on the algorithm and how efficient the implementation can be.

Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.

For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.

Caveats on why you might do better (or see results which claim far bigger speedups):

  1. GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.

  2. Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.

  3. Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).

  4. Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.

夏末 2024-10-11 08:58:29
__kernel void vecAdd(__global float* results )
{
   int id = get_global_id(0);
}

该内核代码可以在 10 毫秒内在新的 60 美元 R7-240 GPU 上生成 16M 线程。

这相当于 10 纳秒内创建 16 个线程或进行上下文切换。 140 美元的 FX-8150 8 核 CPU 时序是多少?每个核心 50 纳秒内有 1 个线程。

该内核中添加的每条指令对于 GPU 来说都是一场胜利,直到它进行分支。

__kernel void vecAdd(__global float* results )
{
   int id = get_global_id(0);
}

this kernel code can spawn 16M threads on a new 60$ R7-240 GPU in 10 milliseconds.

This is equivalent to 16 thread creations or context switches in 10 nanoseconds. What is a 140$ FX-8150 8-core CPU timing? It is 1 thread in 50 nanoseconds per core.

Every instruction added in this kernel is a win for a gpu until it makes branching.

请你别敷衍 2024-10-11 08:58:29

你的问题很笼统,很难回答;只是存在许多不同的变量,因此很难给出准确或公平的答案。

值得注意的是,您同时比较 1) 算法的选择 2) 硬件的相对性能 3) 编译器优化能力 4) 实现语言的选择和 5) 算法实现的效率...

请注意,例如,与 CPU 相比,不同的算法在 GPU 上可能更适用;进出 GPU 的数据传输也需要考虑时序。

AMD 有一个关于在 CPU 和 GPU 上执行 OpenCL 代码的 OpenCL 性能的案例研究(实际上是几个)。 这里是稀疏矩阵向量乘法的性能结果。

Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.

Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...

Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.

AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.

浮云落日 2024-10-11 08:58:29

我见过从 2 倍到 400 倍不等的数字。我还知道中档 GPU 在双精度计算方面无法与高端 CPU 竞争 - 8 核 Xeon 上的 MKL 将比 300 美元 GPU 上的 CULA 或 CUBLAS 更快。

据说 OpenCL 比 CUDA 慢得多。

I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.

OpenCL is anecdotally much slower than CUDA.

拥抱我好吗 2024-10-11 08:58:29

橡树岭国家实验室和佐治亚理工学院推出的名为 SHOC(可扩展异构计算)的新基准测试套件具有许多重要内核的 OpenCL 和 CUDA 实现。您可以从 http://bit.ly/shocmarx 下载该套件。享受。

A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文