cuda多内存访问
请给我一些解释内存访问如何在以下内核中工作:
__global__ void kernel(float4 *a)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float4 reg1, reg2;
reg1 = a[tid]; //each thread reads a unique memory location
for(int i = 0; i < totalThreadsNumber; i++)
{
reg2 = a[i]; //all running threads start reading
//the same global memory location
//some computations
}
for(int i = 0; i < totalThreadsNumber; i++)
{
a[i] = reg1; // all running threads start writing
//to the same global memory location
//race condition
}
}
它在第一个循环中如何工作?有序列化吗?我假设第二个循环导致线程序列化(仅在扭曲内?)并且结果未定义。
Please give me some explanation how a memory access works in the following kernel:
__global__ void kernel(float4 *a)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float4 reg1, reg2;
reg1 = a[tid]; //each thread reads a unique memory location
for(int i = 0; i < totalThreadsNumber; i++)
{
reg2 = a[i]; //all running threads start reading
//the same global memory location
//some computations
}
for(int i = 0; i < totalThreadsNumber; i++)
{
a[i] = reg1; // all running threads start writing
//to the same global memory location
//race condition
}
}
How does it work in the first loop ? Is there some serialization ? I assume that the second loop causes threads serialization (only within a warp ?) and the result is undefined.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
保留我对 Fermi (sm_2x) 的解释,在较旧的硬件上,内存访问是按半扭曲进行的。
在第一个循环(读取)中,整个扭曲从同一地址读取到局部变量中。这导致“广播”。由于 Fermi 有一个 L1 缓存,要么会加载一个缓存行,要么直接从缓存中获取数据(用于后续迭代)。换句话说,没有序列化。
在第二个循环(写入)中,哪个线程获胜是未定义的 - 就像任何多线程编程模型一样,如果多个线程写入同一位置,则程序员负责理解竞争条件。您无法控制块中的哪个扭曲将最后执行,也无法控制最后一个扭曲中的哪个线程将完成写入,因此您无法预测最终值是什么。
Keeping my explanation to Fermi (sm_2x), on older hardware memory access are per half-warp instead.
In the first loop (reading) the whole warp is reading from the same address into a local variable. This results in a "broadcast". Since Fermi has a L1 cache either one cache line will be loaded or the data will be fetched directly from the cache (for subsequent iterations). In other words, there is no serialisation.
In the second loop (writing) which thread wins is undefined - just like any multi-threaded programming model if multiple threads write to the same location the programmer is responsible for understanding the race conditions. You have no control over which warp in the block will execute last and also no control over which thread within the last warp will complete the write, so you can't predict what the final value will be.