康威生命游戏的 cuda 内核
我正在尝试计算对于 pxq 矩阵进行 n 次迭代的康威 GOL 运行中将进行的转换数量。例如,给定 1 次迭代,初始状态为 1 个闪光灯(如下)。将会有 5 次转变(2 次出生,1 次存活,2 次因人口不足而死亡)。我已经完成了这个工作,但我想转换这个逻辑以使用 CUDA 运行。下面是我想要移植到 CUDA 的内容。
代码:
static void gol() // call this iterations x's
{
int[] tempGrid = new int[rows * cols]; // grid holds init conditions
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
tempGrid[i * cols + j] = grid[i * cols + j];
}
}
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
int numNeighbors = neighbors(i, j); // finds # of neighbors
if (grid[i * cols + j] == 1 && numNeighbors > 3)
{
tempGrid[i * cols + j] = 0;
overcrowding++;
}
else if (grid[i * cols + j] == 1 && numNeighbors < 2)
{
tempGrid[i * cols + j] = 0;
underpopulation++;
}
else if (grid[i * cols + j] == 1 && numNeighbors > 1)
{
tempGrid[i * cols + j] = 1;
survival++;
}
else if (grid[i * cols + j] == 0 && numNeighbors == 3)
{
tempGrid[i * cols + j] = 1;
birth++;
}
}
}
grid = tempGrid;
}
I'm trying to calculate the number of transitions that would be made in a run of Conway's GOL for a pxq matrix for n iterations. For instance, given 1 iteration with the initial state being 1 blinker (as below). there would be 5 transitions (2 births, 1 survival, 2 deaths from underpopulation). I've already got this working, but I'd like to convert this logic to run using CUDA. Below is what I want to port to CUDA.
code:
static void gol() // call this iterations x's
{
int[] tempGrid = new int[rows * cols]; // grid holds init conditions
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
tempGrid[i * cols + j] = grid[i * cols + j];
}
}
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
int numNeighbors = neighbors(i, j); // finds # of neighbors
if (grid[i * cols + j] == 1 && numNeighbors > 3)
{
tempGrid[i * cols + j] = 0;
overcrowding++;
}
else if (grid[i * cols + j] == 1 && numNeighbors < 2)
{
tempGrid[i * cols + j] = 0;
underpopulation++;
}
else if (grid[i * cols + j] == 1 && numNeighbors > 1)
{
tempGrid[i * cols + j] = 1;
survival++;
}
else if (grid[i * cols + j] == 0 && numNeighbors == 3)
{
tempGrid[i * cols + j] = 1;
birth++;
}
}
}
grid = tempGrid;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的主要减速将是主内存访问。因此,我建议您根据可用的硬件选择较大的线程块大小。 256 (16x16) 是跨硬件兼容性的不错选择。这些线程块中的每一个都将计算板的稍小的部分的结果 - 如果您使用 16x16,它们将计算板的 14x14 部分的结果,因为有一个元素边框。 (使用 16x16 块来计算 14x14 块而不是 16x16 块的原因是为了内存读取合并。)
将板划分为(例如)14x14 块;这就是你的网格(按照你认为合适的方式组织,但很可能类似于
board_width / 14
、board_height / 14
。在内核中,让每个线程将其元素加载到共享中然后同步线程。然后让中间的 14x14 元素计算新值(使用共享内存中存储的值)并将其写回到全局内存中。这也是减少全局读写的原因。让线程块大小尽可能大——边缘和角落会“浪费”全局内存访问,因为从那里获取的值只使用 1 或 3 次,而不是 9 次。
Your main slowdown is going to be main memory access. So I'd suggest that you pick a largish thread block size based on the hardware you have available. 256 (16x16) is a good choice for cross-hardware compatibility. Each of those thread blocks is going to calculate the results for a slightly smaller section of the board -- if you used 16x16, they'll calculate the results for a 14x14 section of the board, since there is a one element border. (The reason to use a 16x16 block to calculate a 14x14 chunk rather than a 16x16 chunk is for memory read coalescing.)
Divide the board up into (say) 14x14 chunks; that is your grid (organized however you see fit, but most likely something like
board_width / 14
,board_height / 14
.Within the kernels, have each thread load its element into shared memory. Then syncthreads. Then have the middle 14x14 elements calculate the new value (using the values stored in shared memory) and write it back into global memory. The use of shared memory helps minimize global reads and writes. This is also the reason to have your thread block size as big as possible -- the edges and corners are "wasted" global memory accesses, since the values fetched there only get used 1 or 3 times, not 9 times.
您可以继续进行的一种方法:
Here's one way you could proceed: