常见算法的 GPU 与 CPU 性能比较

发布于 2024-08-04 03:41:16 字数 1436 浏览 1 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

朦胧时间 2024-08-11 03:41:16

GPU 是高度专业化的硬件,旨在很好地完成一小部分任务并且高度并行。这基本上是算术(特别是单精度浮点数学,尽管较新的 GPU 在双精度方面做得很好)。因此它们只适合特定的算法。我不确定排序是否适合该类别(至少在一般情况下)。

更常见的例子是金融工具的定价、大量的矩阵数学,甚至击败加密(通过暴力)。话虽这么说,我确实找到了使用混合算法的快速并行 GPU 排序

另一个经常引用的例子是在 Nvidia GPU 上运行 SETI@HOME,但它是将苹果与橙子进行比较。与 CPU 通常执行的工作相比,GPU 的工作单元有所不同(并且非常有限)。

GPUs are highly specialized hardware designed to do a small set of tasks very well and highly parallelized. This is basically arithmetic (particularly single precision floating point math although newer GPUs do quite well with double precision). As such they're only suited to particular algorithms. I'm not sure if sorting fits that category (in the general case at least).

More common examples are pricing of financial instruments, large amounts of matrix maths and even defeating encryption (by brute force). That being said, I did find Fast parallel GPU-sorting using a hybrid algorithm.

Another commonly quoted example is running SETI@HOME on an Nvidia GPU but it's comparing apples to oranges. The units of work for GPUs are different (and highly limited) compared to what CPUs ordinarily do.

同展鸳鸯锦 2024-08-11 03:41:16

请小心、非常小心 GPGPU 引用的任何性能数据。许多人喜欢发布令人印象深刻的数字,但没有考虑将输入数据从 CPU 传输到 GPU 以及将输出数据传回所需的传输时间,这两者都会跨越 PCIe 瓶颈。

BE WARY, VERY WARY of any performance numbers quoted for GPGPU. Lots of people like to post really impressive numbers that don't take into consideration the transfer time needed to get the input data from the CPU to the GPU and the output data back, both going over a PCIe bottleneck.

┼── 2024-08-11 03:41:16

查看推力

Thrust是一个并行的CUDA库
具有接口的算法
类似于 C++ 标准模板
库(STL)。推力提供了
灵活的 GPU 高级接口
编程大大增强了
开发人员生产力。

Have a look at thrust:

Thrust is a CUDA library of parallel
algorithms with an interface
resembling the C++ Standard Template
Library (STL). Thrust provides a
flexible high-level interface for GPU
programming that greatly enhances
developer productivity.

一抹微笑 2024-08-11 03:41:16

NVidia 的网站上有相当多的示例。请记住,某些事情(例如排序)需要特殊的算法来实现高效的并行性,并且可能不如单核上的非线程算法那么高效。

There are quite a few samples of this sort of thing on NVidia's website. Bear in mind that some things such as sorting need special algorithms for efficient parallelism and may not be quite as efficient as a non-threaded algorithm on a single core.

﹎☆浅夏丿初晴 2024-08-11 03:41:16

图像调整大小在许多接受图像上传的网站上一定很常见。

在 C# 中,使用绝对最低质量选项和最近邻采样,调整 2600ish x 2000ish 2MB jpeg 图像的大小(至 512x512)需要 23.5 毫秒。使用的函数是基于 graphics.DrawImage() 的函数。 CPU 使用率也是 %21.5。

在 C# 端提取“rgba 字节数组”并将其发送到 GPU,在 GPU 中调整大小并将结果返回到图像中需要 6.3 毫秒,CPU 使用率为 %12.7。这是使用便宜 55%、仅有 320 个核心的 GPU 完成的。

仅 3.73 倍加速倍数。

这里的限制因素是,将提取的 20MB RGB 数据(jpeg 只有 2MB!)发送到 GPU。该耗时部分几乎占总时间的 90%,其中包括 C# 字节数组提取!所以我猜如果提取部分也可以在 GPU 中完成,至少会加速 30 倍左右。

30X还不错。

然后,您可以将提取层与调整大小层进行管道传输,以隐藏内存复制延迟,从而获得更快的速度!这可能是 40X-50X。

然后提高采样的质量(例如双三次而不是最近邻),你在GPU方面有更多优势。添加 5x5 高斯滤波器仅增加了 0.77 毫秒。除此之外,CPU 还会获得更多时间,特别是如果所需的高斯参数与 C#.Net 实现不同的话。


即使您对加速比不满意,卸载到 GPU 并在 CPU 上拥有“空闲核心”仍然有利于将更多工作推送到该服务器。

现在加上 GPU 功耗级别的事实(本例中为 30W 与 125W),它的优势更大。


当双方都运行优化代码时, CPU 很难在

 C[i]=A[i]+B[i]

基准测试中获胜,并且您仍然可以将一半阵列卸载到 GPU,并同时使用 CPU+GPU 更快地完成任务。


GPU 并不是为非统一工作而构建的。 GPU 具有很深的管道,因此在因分支而停顿后恢复需要很长时间。 SIMD 类型的硬件还强制它对其上的所有工作项执行相同的操作。当工作项执行与组不同的操作时,它会失去跟踪并在整个 SIMD 管道中添加气泡,或者只是其他工作项等待同步点。因此,支撑会影响深管道区域和宽管道区域,并使其在完全混乱的条件下甚至比 CPU 还要慢。

Image resizing must be common on many websites that accept image uploads.

Resizing a 2600ish x 2000ish 2MB jpeg image (to 512x512) took 23.5 milliseconds in C# with absolute lowest quality options and nearest neighbour sampling. Used function was graphics.DrawImage() based one. CPU usage was also %21.5.

Getting "rgba byte array" extraction on C# side and sending it to GPU and resizing in GPU and getting results back into an image took 6.3 milliseconds and CPU usage was %12.7. This was done with a %55 cheaper gpu with just 320 cores.

Only 3.73X speedup multiplier.

The limiting factor here was, sending the extracted 20MB rgb data (jpeg is only 2MB!) to GPU. That time consuming part was nearly %90 of total time, including C# side byte array extraction! So I gues there would be about 30X speedup at least if extraction part could be done in GPU too.

30X is not bad.

Then you could pipeline the extraction layer with the resizing layer to hide memory copy latency to get even more speed! This could be 40X-50X.

Then increase the quality of sampling(such as bicubic instead of nearest neighbor), you have even more advantage in GPU side. Adding a 5x5 Gaussian filter added only 0.77 milliseonds. CPU would get some higher time on top of that, especially if the Gaussian parameters needed are different than C#.Net implementation.


Even if you are not satisfied with speedup ratio, offloading to GPU and having a "free core" on the CPU is still advantageous for pushing more work to that server.

Now add the fact of GPU power consumption levels(30W vs 125W in this example), it is much more advantageous.


CPU could hardly win in

 C[i]=A[i]+B[i]

benchmarks when both sides run on optimized codes and you can still offload half of arrays to GPU and finish quicker using CPU+GPU at the same time.


GPU is not built for non-uniform works. GPUs have deep pipelines so standing up after a stall because of branching, takes too long. Also SIMD type hardware forces it to do same thing on all workitems on it. When a workitem does a different thing than the group, it loses track and adds bubbles in whole SIMD pipeline or simply others wait for sync point. So brancing affects both deep and wide pipeline areas and make it even slower than CPU in perfectly chaotic conditions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文