cuda 视觉分析器中的 CPU 和 GPU 计时器
所以cuda Visual Profiler中有2个定时器,
GPU Time:它是GPU上方法的执行时间。 CPU时间:它是启动该方法的GPU时间和CPU开销的总和。在驱动程序生成的数据级别,CPU 时间只是启动非阻塞方法的 CPU 开销;对于阻塞方法,它是 GPU 时间和 CPU 开销的总和。默认情况下,所有内核启动都是非阻塞的。但是,如果启用了任何探查器计数器,内核启动就会被阻止。不同流中的异步内存复制请求是非阻塞的。
如果我有一个真正的程序,实际执行时间是多少?我测量时间,还有GPU定时器和CPU定时器,有什么区别?
So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您已经快完成了 - 现在您已经了解了一些不同的选项,最后一步是问自己到底要测量什么时间。这个问题没有正确的答案,因为这取决于您想要通过测量来做什么。当您尝试优化计算时,CPU 时间和 GPU 时间正是您想要的,但它们可能不包括等待等实际上非常重要的事情。你提到“实际执行时间”——这是一个开始。你的意思是问题的完整执行时间——从用户启动程序到吐出答案并程序结束?在某种程度上,这确实是唯一真正重要的时间。
对于这样的数字,在 Unix 类型的系统中,我喜欢只测量程序的整个运行时间;
/bin/time myprog
,大概有一个 Windows 等效项。这很好,因为它完全不夸张。另一方面,因为它是一个总体,所以它太宽泛了,没有什么帮助,而且如果你的代码有一个大的 GUI 组件,那就没什么好处了,因为这样你还要测量用户点击他们的方式所花费的时间到结果。如果您想要某些计算集的运行时间,cuda 有非常方便的函数 cudaEvent*,可以将其放置在代码的各个部分 - 请参阅 CUDA 最佳实践指南,第 2.1.2 节,使用 CUDA GPU 计时器 — 您可以将这些计时器放在重要的代码之前和之后并打印结果。
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program;
/bin/time myprog
, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
GPU定时器是基于事件的。
这意味着当创建一个事件时,它将被设置在 GPU 的队列中以供服务。所以那里也有一个小的开销。
根据我的测量,虽然差异并不重要
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance