CUFFT的计算性能
我正在划分为多个 GPU 的块 (N*N/p) 上运行 CUFFT,我有一个关于计算性能的问题。首先,介绍一下我的做法:
- 将 N*N/p 块发送到每个 GPU
- 对 p 个 GPU 中的每一行进行批量一维 FFT
- 将 N*N/p 块返回到主机 - 对整个数据
- 集 执行转置 同上步骤 1
- 同步骤 2
Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / 执行时间
并且执行时间计算如下:
execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for every GPU)
这是在多个 GPU 上评估 CUFFT 性能的正确方法吗?还有其他方法可以表示 FFT 的性能吗?
谢谢。
I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:
- Send N*N/p chunks to each GPU
- Batched 1-D FFT for each row in p GPUs
- Get N*N/p chunks back to host - perform transpose on the entire dataset
- Ditto Step 1
- Ditto Step 2
Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / execution time
and Execution time is calculated as:
execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)
Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您正在进行复杂变换,则操作计数是正确的(对于实值变换,操作计数应为 2.5 N log2(N)),但 GFLOP 公式不正确。在并行、多处理器操作中,吞吐量的通常计算是
在您的情况下,假设 GPU 并行操作,要么测量执行时间的挂钟时间(即整个操作花费多长时间),要么使用以下内容
:确实如此,您的计算代表了串行执行时间。考虑到多 GPU 方案的开销,我预计您获得的计算性能数字将低于在单个 GPU 上完成的等效转换。
If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is
In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:
As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.