使用 Thrust 时在 CUDA 中启动计时内核

发布于 2024-12-14 21:53:21 字数 698 浏览 1 评论 0原文

CUDA 中的内核启动通常是异步的,这(据我所知)意味着一旦 CUDA 内核启动,控制权立即返回到 CPU。当 GPU 忙于数字运算时,CPU 继续做一些有用的工作 除非使用 cudaThreadsynchronize()cudaMemcpy() 强制停止 CPU。

现在我刚刚开始使用 CUDA 的 Thrust 库。 Thrust 中的函数调用 同步还是异步?

换句话说,如果我调用 Thrust::sort(D.begin(),D.end());(其中 D 是设备向量),那么使用

        start = clock();//Start

             thrust::sort(D.begin(),D.end());

        diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
        std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

If 测量排序时间是否有意义函数调用是异步的,那么对于任何向量来说 diff 都是 0 秒(这对于计时来说是垃圾),但如果它是同步的,我确实会获得实时性能。

Kernel launches in CUDA are generally asynchronous, which (as I understand) means that once the CUDA kernel is launched control returns immediately to the CPU. The CPU continues doing some useful work while the GPU is busy number crunching
unless the CPU is forcefully stalled using cudaThreadsynchronize() or cudaMemcpy() .

Now I have just started using the Thrust library for CUDA. Are the function calls in Thrust
synchronous or asynchronous?

In other words, if I invoke thrust::sort(D.begin(),D.end()); where D is a device vector, does it make sense to measure the sorting time using

        start = clock();//Start

             thrust::sort(D.begin(),D.end());

        diff = ( clock() - start ) / (double)CLOCKS_PER_SEC;
        std::cout << "\nDevice Time taken is: " <<diff<<std::endl;

If the function call is asynchronous then diff will be 0 seconds for any vector (which is junk for timings), but if it is synchronous I will indeed get the real time performance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

韶华倾负 2024-12-21 21:53:21

调用内核的 Thrust 调用是异步的,就像 Thrust 使用的底层 CUDA API 一样。 Thrust 调用复制数据是同步的,就像 Thrust 使用的底层 CUDA API 一样。

因此,您的示例仅测量内核启动和推力主机端设置开销,而不是操作本身。对于计时,您可以通过在推力内核启动后调用 cudaThreadSynchronizecudaDeviceSynchronize(CUDA 4.0 或更高版本中的后者)来解决此问题。或者,如果您包含内核启动后复制操作并记录此后的停止时间,则您的计时将包括设置、执行和复制时间。

在你的例子中,这看起来像

   start = clock();//Start 

   thrust::sort(D.begin(),D.end()); 
   cudaThreadSynchronize(); // block until kernel is finished

   diff = ( clock() - start ) / (double)CLOCKS_PER_SEC; 
   std::cout << "\nDevice Time taken is: " <<diff<<std::endl; 

Thrust calls which invoke kernels are asynchronous, just like the underlying CUDA APIs thrust uses. Thrust calls which copy data are synchronous, just like the underlying CUDA APIs thrust uses.

So your example would only be measuring the kernel launch and thrust host side setup overheads, not the operation itself. For timing, you can get around this by calling either cudaThreadSynchronize or cudaDeviceSynchronize (the later in CUDA 4.0 or later) after the thrust kernel launch. Alternatively, if you include a post kernel launch copy operation and record the stop time after that, your timing will include setup, execution, and copying time.

In your example this would look something like

   start = clock();//Start 

   thrust::sort(D.begin(),D.end()); 
   cudaThreadSynchronize(); // block until kernel is finished

   diff = ( clock() - start ) / (double)CLOCKS_PER_SEC; 
   std::cout << "\nDevice Time taken is: " <<diff<<std::endl; 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文