当前位置：文江博客话题详情

CUFFT的计算性能

发布于 2025-01-07 07:20:37 字数 487 浏览 7 评论 0原文

我正在划分为多个 GPU 的块 (N*N/p) 上运行 CUFFT，我有一个关于计算性能的问题。首先，介绍一下我的做法：

将 N*N/p 块发送到每个 GPU
对 p 个 GPU 中的每一行进行批量一维 FFT
将 N*N/p 块返回到主机 - 对整个数据
集执行转置同上步骤 1
同步骤 2

Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / 执行时间

并且执行时间计算如下：

execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for every GPU)

这是在多个 GPU 上评估 CUFFT 性能的正确方法吗？还有其他方法可以表示 FFT 的性能吗？

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝眼睛不忧郁 2025-01-14 07:20:37

如果您正在进行复杂变换，则操作计数是正确的（对于实值变换，操作计数应为 2.5 N log2(N)），但 GFLOP 公式不正确。在并行、多处理器操作中，吞吐量的通常计算是

operation count / wall clock time

在您的情况下，假设 GPU 并行操作，要么测量执行时间的挂钟时间（即整个操作花费多长时间），要么使用以下内容

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

：确实如此，您的计算代表了串行执行时间。考虑到多 GPU 方案的开销，我预计您获得的计算性能数字将低于在单个 GPU 上完成的等效转换。

If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is

operation count / wall clock time

In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.

回复收藏 0 原文

~没有更多了~