是否有方法能够阻止某些块直到满足某些条件?
我想阻止一些块,直到一个变量被设置为特定值。所以我编写这段代码来测试一个简单的 do-while 循环是否有效。
__device__ int tag = 0;
__global__ void kernel() {
if ( threadIdx.x == 0 ) {
volatile int v;
do {
v = tag;
}
while ( v == 0 );
}
__syncthreads();
return ;
}
然而,它不起作用(没有发生死循环,很奇怪)。
我想问是否有任何其他方法能够阻止某些块,直到满足某些条件,或者对代码进行一些更改是否有效。
I want to block some blocks until one variable is set to a particular value. So I write this code to test if a simple do-while loop will work.
__device__ int tag = 0;
__global__ void kernel() {
if ( threadIdx.x == 0 ) {
volatile int v;
do {
v = tag;
}
while ( v == 0 );
}
__syncthreads();
return ;
}
However, it doesn't work(No dead loop occurs, very strange).
I want to ask if any other method is able to block some blocks until some conditions satisfied or if some changes on the code will work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
目前在 CUDA 中没有可靠的方法来执行块间同步。
有一些黑客方法可以在总线程数适中的块之间实现某种方式的锁定或阻塞,但它们利用了执行模型中未定义的行为,这些行为不能保证在所有硬件上以相同的方式运行或在将来继续工作。确保块之间同步或阻塞的唯一可靠方法是我们单独启动内核。如果您的算法在没有块间同步的情况下无法工作,那么您要么需要新的算法,要么您的应用程序非常不适合 GPU 架构。
There currently is no reliable way to perform inter-block synchronization in CUDA.
There are hacky ways to achieve some manner of locking or blocking between blocks with a modest number of total threads, but they exploit undefined behaviour in the execution model which are not guaranteed to run the same way on all hardware or continue to work in the future. The only reliable way to ensure synchronization or blocking between blocks is to us separate kernel launches. If you can't make your algorithm work without interblock synchronization, you either need a new algorithm, or your application is a very poor fit for the GPU architecture.
这是我尝试看看是否有效的一种黑客方法。
我不能保证这永远有效。但我的卡(320M)上的断点似乎是 num = 5796。也许每张卡都有某种不同的硬件限制?
编辑
这个问题的答案是 n * (n + 1) / 2 > 2^24 对于 n > 5795(这是单精度限制)。超出此点的整数值的准确性是未定义的。感谢 talonmies 指出这一点。
--
我编辑了以前的代码,该代码仅使用一个块。这更能代表现实世界的线程/块(内存访问很奇怪并且会非常慢,但这样做是为了快速移植我的旧测试代码以使用多个线程)。
看起来在某些情况下你可以跨块同步,但主要取决于你事先了解某些事情(对于这种特殊情况,我只同步 n - 1 个块,然后对最后一个块执行疯狂无用的计数)。
这只是一个概念证明,请勿认真对待代码
Here is a hackish way I tried to see if it will work.
I can not guarantee this will always work. But the breaking point on my card (320M) seems to be for num = 5796. Perhaps a hardware limit of some kind different for each card ?
EDIT
The answer to this is that n * (n + 1) / 2 > 2^24 for n > 5795 (which is the single precision limit). The accuracy of integer values beyond this point is undefined. Thanks to talonmies for pointing it out.
--
I edited my former code which was using just one block. This is more representative of a real world threads / blocks (the memory accesses are weird and will be slow as hell, but they were done to quickly port my old test code to use multiple threads).
Looks like there are some cases when you can synchronize across blocks, but mostly depends on you knowing certain things before hand (for this particular case, I was only syncing n - 1 blocks before performing an insanely useless count on the last block).
This is a proof of concept only, do not take the code seriously