当前位置：文江博客话题详情

cuda busy-loop cuda-streams

摆脱异步 cuda 流执行期间的忙等待

发布于 2024-10-19 05:10:25 字数 1451 浏览 5 评论 0 原文

我正在寻找一种方法，如何在以下代码中摆脱主机线程中的忙等待（不要复制该代码，它仅显示我的问题的想法，它有许多基本错误）：

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     while (true) {
         if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
             cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
             kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
             break;
         }
         sid = ++sid % S_N;
     }

有没有

办法使主机线程空闲并以某种方式等待某个流完成，然后准备并运行另一个流？

编辑：我在代码中添加了 while(true) ，以强调忙等待。现在，我执行所有流，并检查其中哪些已完成以运行另一个新流。 cudaStreamSynchronize 等待特定流完成，但我想等待首先完成作业的任何流。

编辑2：我以如下方式摆脱了忙等待：

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
    kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
    sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
    cudaStreamSynchronize(streams[i]);
    cudaStreamDestroy(streams[i]);
}

但它似乎比主机线程上忙等待的版本慢一点。我认为这是因为，现在我在流上静态分配作业，因此当一个流完成工作时，它会处于空闲状态，直到每个流完成工作为止。以前的版本动态地将工作分配到第一个空闲流，因此效率更高，但主机线程上存在忙等待。

原文

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     while (true) {
         if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
             cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
             kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
             break;
         }
         sid = ++sid % S_N;
     }

}

Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?

EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.

EDIT2: I got rid of busy-waiting in fallowing way:

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
    kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
    sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
    cudaStreamSynchronize(streams[i]);
    cudaStreamDestroy(streams[i]);
}

But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执笔绘流年 2024-10-26 05:10:25

真正的答案是使用cudaThreadSynchronize来等待所有之前的启动完成，cudaStreamSynchronize等待某个流中的所有启动完成，和cudaEventSynchronize等待仅记录特定流上的特定事件。

但是，您需要先了解流和同步的工作原理，然后才能在代码中使用它们。

如果根本不使用流会发生什么？考虑以下代码：

kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();

内核启动，主机继续同时执行 host_func1 和内核。然后，主机和设备同步，即主机等待内核完成，然后再继续执行 host_func2()。

现在，如果您有两个不同的内核怎么办？

kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);

kernel1 异步启动！主机继续前进，kernel2 在 kernel1 完成之前启动！然而，kernel2 直到 kernel1 完成之后才会执行，因为它们都已在流 0（默认流）上启动。考虑以下替代方案：

kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);

完全没有必要这样做，因为设备已经同步在同一流上启动的内核。

所以，我认为您正在寻找的功能已经存在......因为内核总是在启动之前等待同一流中的先前启动完成（即使主机经过）。也就是说，如果您想等待任何之前的启动完成，那么<em>不要使用流。这段代码可以正常工作：

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
    kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
 }

现在，进入流。您可以使用流来管理并发设备执行。

将流视为队列。您可以将不同的 memcpy 调用和内核启动放入不同的队列中。然后，流 1 中的内核和流 2 中的启动是异步的！它们可以同时执行，也可以按任何顺序执行。如果您想确保设备上一次只执行一个 memcpy/kernel，那么不要使用流。同样，如果您希望内核按特定顺序执行，那么不要使用流。

也就是说，请记住，放入流 1 中的任何内容都是按顺序执行的，因此不必费心同步。同步用于同步主机和设备调用，而不是两个不同的设备调用。因此，如果您想同时执行多个内核，因为它们使用不同的设备内存并且彼此没有影响，那么请使用流。就像......

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
 }

不需要显式的设备同步。

The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.

However, you need to understand how streams and sychronization work before you will be able to use them in your code.

What happens if you do not use streams at all? Consider the following code:

kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();

The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().

Now, what if you have two different kernels?

kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);

kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:

kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);

There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.

So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
    kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
 }

Now, on to streams. you can use streams to manage concurrent device execution.

Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.

That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
 }

No explicit device synchronization necessary.

回复收藏 0 原文

a√萤火虫的光℡ 2024-10-26 05:10:25

我解决该问题的想法是每个流都有一个主机线程。该主机线程将调用 cudaStreamSynchronize 来等待流命令完成。
不幸的是，这在 CUDA 3.2 中是不可能的，因为它只允许一个主机线程处理一个 CUDA 上下文，这意味着每个启用 CUDA 的 GPU 都有一个主机线程。

希望在 CUDA 4.0 中这是可能的： CUDA 4.0 RC 新闻

编辑：我已经在 CUDA 4.0 RC 中使用 open mp 进行了测试。我为每个 cuda 流创建了一个主机线程。它开始起作用了。

回复收藏 0 原文

爱给你人给你 2024-10-26 05:10:25

有：cudaEventRecord(event,stream)和cudaEventSynchronize(event)。参考手册 http://developer.download.nvidia。 com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf 包含所有详细信息。

编辑：顺便说一句，流对于并发执行内核和内存传输很方便。为什么要通过等待当前流完成来序列化执行？

回复收藏 0 原文

姐不稀罕 2024-10-26 05:10:25

您需要 cudaStreamSynchronize 而不是 cudaStreamQuery

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaStreamSynchronize(streams[sid]);
     cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
}

（您还可以使用 cudaThreadSynchronize 等待所有流的启动，并使用 cudaEventSynchronize 等待事件以实现更高级的主机/设备同步。）

您可以进一步控制使用这些同步函数发生的等待类型。查看 cudaDeviceBlockingSync 标志和其他标志的参考手册。不过，默认值可能就是您想要的。

Instead of cudaStreamQuery, you want cudaStreamSynchronize

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaStreamSynchronize(streams[sid]);
     cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
}

(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)

You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.

回复收藏 0 原文

策马西风 2024-10-26 05:10:25

您需要复制数据块并在不同的for 循环中对该数据块执行内核。这样效率会更高。

像这样：

size = N*sizeof(float)/nStreams;

for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}


for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}

通过这种方式，内存复制不必等待内核执行前一个流，反之亦然。

You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.

like this:

size = N*sizeof(float)/nStreams;

for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}


for (i=0; i<nStreams; i++){
    offset = i*N/nStreams;
    kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}

In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.

回复收藏 0 原文

~没有更多了~