CUDA 4.0 RC - 每个 GPU 有多个主机线程 - cudaStreamQuery 和 cudaStreamSynchronize 行为
我编写了一段代码,每个 GPU 使用许多主机 (OpenMP) 线程。每个线程都有自己的 CUDA 流来排序其请求。它看起来与下面的代码非常相似:
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&stream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus = cudaStreamQuery(stream);
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
一切都很好,直到我得到了很多小工作。在这种情况下,cudaStreamQuery 有时会返回 cudaErrorNotReady,这对我来说是意料之外的,因为我使用 cudaStreamSynchronize。到目前为止,我认为如果在 cudaStreamSynchronize 之后调用 cudaStreamQuery 将始终返回 cudaSuccess。不幸的是,即使 cudaStreamQuery 仍然返回 cudaErrorNotReady,cudaStreamSynchronize 也可能完成。
我将代码更改为以下内容,一切正常。
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&stream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus;
while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
cudaStreamSynchronize();
}
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
所以我的问题是......这是一个错误还是一个功能?
编辑:它类似于JAVA
synchronize {
while(waitCondition) {
wait();
}
}
I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&stream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus = cudaStreamQuery(stream);
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.
I changed the code into the following and everything works correctly.
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
cudaStream_t stream;
cudaStreamCreate(&stream);
while (hasJob()) {
//... code to prepare job - dData, hData, dataSize etc
cudaError_t streamStatus;
while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
cudaStreamSynchronize();
}
if (streamStatus == cudaSuccess) {
cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
else {
CUDA_CHECK(streamStatus);
}
cudaStreamSynchronize(stream);
}
cudaStreamDestroy(stream);
}
So my question is.... is it a bug or a feature?
EDIT: it is similar to JAVA
synchronize {
while(waitCondition) {
wait();
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
下面是什么?
您是否有任何类似
cudaMemcpyAsync
的函数,或者唯一的内存传输位于您显示的代码中?这些异步函数可能会提前退出,即使代码尚未到达目的地也是如此。发生这种情况时,仅当内存传输成功时,cudaStreamQuery
才会返回cudaSuccess
。另外,
hasJob()
是否使用任何主机 CUDA 函数?如果我没有记错的话,在单个流中,不可能同时执行内核和内存传输。因此,仅当内核依赖于不同流传输的数据时,才需要调用cudaStreamQuery。
What is under
Do you have any functions of kind
cudaMemcpyAsync
there, or the only memory transfer is in the code you have shown? Those are asynchronous functions may exit early, even when the code is not at the destination yet. When that happenscudaStreamQuery
will returncudaSuccess
only when memory transfers succeed.Also, does
hasJob()
uses any of the host-CUDA functions?If I am not mistaken, in a single stream, it is not possible to execute both kernel and memory transfers. Therefore, calling
cudaStreamQuery
is necessary only when a kernel depends on the data transferred by a different stream.之前没有注意到:
cudaStreamSynchronize()
应该采用一个参数(stream)
。我不确定当省略参数时您正在同步哪个流,可能是它默认为流 0。Didn't notice it earlier:
cudaStreamSynchronize()
should take a parameter(stream)
. I am not sure which stream you are synchronising when parameter is ommited, could be that it defaults to stream 0.