cuda多内存访问

发布于 2024-12-04 07:35:20 字数 740 浏览 2 评论 0原文

请给我一些解释内存访问如何在以下内核中工作：

__global__ void kernel(float4 *a)
{
     int tid = blockIdx.x * blockDim.x + threadIdx.x;

     float4 reg1, reg2;
     reg1 = a[tid]; //each thread reads a unique memory location

     for(int i = 0; i < totalThreadsNumber; i++)
     {  
          reg2 = a[i]; //all running threads start reading 
                       //the same global memory location
          //some computations
     }

     for(int i = 0; i < totalThreadsNumber; i++)
     {
          a[i] = reg1; // all running threads start writing 
                       //to the same global memory location
                       //race condition
     }
}

它在第一个循环中如何工作？有序列化吗？我假设第二个循环导致线程序列化（仅在扭曲内？）并且结果未定义。

原文

Please give me some explanation how a memory access works in the following kernel:

__global__ void kernel(float4 *a)
{
     int tid = blockIdx.x * blockDim.x + threadIdx.x;

     float4 reg1, reg2;
     reg1 = a[tid]; //each thread reads a unique memory location

     for(int i = 0; i < totalThreadsNumber; i++)
     {  
          reg2 = a[i]; //all running threads start reading 
                       //the same global memory location
          //some computations
     }

     for(int i = 0; i < totalThreadsNumber; i++)
     {
          a[i] = reg1; // all running threads start writing 
                       //to the same global memory location
                       //race condition
     }
}

How does it work in the first loop ? Is there some serialization ? I assume that the second loop causes threads serialization (only within a warp ?) and the result is undefined.

分享到QQ

分享到微博