如何计算CPU的总时间+图形处理器
我在 CPU 上进行一些计算,然后将数字传输到 GPU 并在那里做一些工作。我想计算在 CPU + GPU 上进行计算所需的总时间。我该怎么做?
I am doing some computation on the CPU and then I transfer the numbers to the GPU and do some work there. I want to calculate the total time taken to do the computation on the CPU + the GPU. how do i do so?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当程序启动时,在 main() 中,使用任何系统计时器来记录时间。当您的程序在 main() 底部结束时,使用相同的系统计时器来记录时间。取 time2 和 time1 之间的差值。就这样吧!
您可以使用不同的系统计时器,其中一些具有比其他计时器更高的分辨率。我建议您在 SO 网站上搜索“系统计时器”,而不是在这里讨论这些内容。如果您只需要任何系统计时器,gettimeofday() 适用于 Linux 系统,但它已被更新、更高精度的函数取代。事实上, gettimeofday() 仅测量以微秒为单位的时间,这应该足以满足您的需求。
如果无法获得分辨率足够好的计时器,请考虑在循环中多次运行程序,对循环的执行进行计时,并将测量的时间除以循环迭代次数。
编辑:
系统计时器可用于测量总体应用程序性能,包括 GPU 计算期间使用的时间。请注意,以这种方式使用系统计时器仅适用于实际时间或挂钟时间,而不适用于进程时间。基于挂钟时间的测量必须包括等待 GPU 操作完成所花费的时间。
如果您想测量 GPU 内核所花费的时间,您有几种选择。首先,您可以使用计算视觉分析器来收集各种分析信息,虽然我不确定它是否报告时间,但它一定能够(这是基本的分析功能)。其他分析器(例如 PAPI)提供对 CUDA 内核的支持。
另一种选择是使用 CUDA 事件来记录时间。请参阅 CUDA 4.0 编程指南,其中讨论了使用 CUDA 事件来测量时间。
另一种选择是使用围绕 GPU 内核调用的系统计时器。请注意,考虑到内核调用返回的异步性质,您还需要在内核调用之后进行主机端 GPU 同步调用,例如 cudaThreadSynchronize() 才能应用此方法。如果您选择此选项,我强烈建议在循环中调用内核,对循环进行计时+最后进行一次同步(由于同步发生在不在不同流中执行的内核调用之间,因此循环内不需要 cudaThreadSynchronize() ),并除以迭代次数。
When your program starts, in main(), use any system timer to record the time. When your program ends at the bottom of main(), use the same system timer to record the time. Take the difference between time2 and time1. There you go!
There are different system timers you can use, some with higher resolution than others. Rather than discuss those here, I'd suggest you search for "system timer" on the SO site. If you just want any system timer, gettimeofday() works on Linux systems, but it has been superseded by newer, higher-precision functions. As it is, gettimeofday() only measures time in microseconds, which should be sufficient for your needs.
If you can't get a timer with good enough resolution, consider running your program in a loop many times, timing the execution of the loop, and dividing the measured time by the number of loop iterations.
EDIT:
System timers can be used to measure total application performance, including time used during the GPU calculation. Note that using system timers in this way applies only to real, or wall-clock, time, rather than process time. Measurements based on the wall-clock time must include time spent waiting for GPU operations to complete.
If you want to measure the time taken by a GPU kernel, you have a few options. First, you can use the Compute Visual Profiler to collect a variety of profiling information, and although I'm not sure that it reports time, it must be able to (that's a basic profiling function). Other profilers - PAPI comes to mind - offer support for CUDA kernels.
Another option is to use CUDA events to record times. Please refer to the CUDA 4.0 Programming Guide where it discusses using CUDA events to measure time.
Yet another option is to use system timers wrapped around GPU kernel invocations. Note that, given the asynchronous nature of kernel invocation returns, you will also need to follow the kernel invocation with a host-side GPU synchronization call such as cudaThreadSynchronize() for this method to be applicable. If you go with this option, I highly recommend calling the kernel in a loop, timing the loop + one synchronization at the end (since synchronization occurs between kernel calls not executing in different streams, cudaThreadSynchronize() is not needed inside the loop), and dividing by the number of iterations.
无论 GPU 是否工作,C 定时器都会继续运行。如果您不相信我,请做这个小实验:在 GPU_Function_Call 上进行 1000 次迭代的 for 循环。在 for 循环周围放置任何 C 计时器。现在,当您运行该程序时(假设 GPU 功能需要大量时间,例如 20 毫秒),您将用肉眼看到它运行了几秒钟,然后才返回。但是当您打印 C 时间时,您会发现它会显示几毫秒。这是因为 C 定时器没有等待 1000 个 MemcpyHtoD 和 1000 个 MemcpyfromDtoH 以及 1000 个内核调用。
我的建议是使用CUDA事件计时器或者更好的NVIDIA Visual Profiler来为GPU计时并使用秒表(增加迭代次数以减少人为错误)来测量完整时间。然后从总时间中减去 GPU 时间即可得到 CPU 时间。
The C timer moves on regardless of GPU is working or not. If you don't believe me then do this little experiment: Make a for loop with 1000 iterations over GPU_Function_Call. Put any C timer around that for loop. Now when you run the program (suppose GPU function takes substantial time like 20ms) you will see it running for few seconds with the naked eye before it returns. But when you print the C time you'll notice it'll show you like few miliseconds. This is because the C timer didn't wait for 1000 MemcpyHtoD and 1000 MemcpyfromDtoH and 1000 kernel calls.
What I suggest is to use CUDA event timer or even better NVIDIA Visual Profiler to time GPU and use stop watch (increase the iterations to reduce human error) to measure the complete time. Then just subtract the GPU time from total to get the CPU time.