康威生命游戏的 cuda 内核

发布于 2024-10-07 08:56:59 字数 1629 浏览 6 评论 0原文

我正在尝试计算对于 pxq 矩阵进行 n 次迭代的康威 GOL 运行中将进行的转换数量。例如，给定 1 次迭代，初始状态为 1 个闪光灯（如下）。将会有 5 次转变（2 次出生，1 次存活，2 次因人口不足而死亡）。我已经完成了这个工作，但我想转换这个逻辑以使用 CUDA 运行。下面是我想要移植到 CUDA 的内容。

替代文字代码：

    static void gol() // call this iterations x's
    {
        int[] tempGrid = new int[rows * cols]; // grid holds init conditions
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                tempGrid[i * cols + j] = grid[i * cols + j];
            }
        }

        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                int numNeighbors = neighbors(i, j); // finds # of neighbors

                if (grid[i * cols + j] == 1 && numNeighbors > 3)
                {
                    tempGrid[i * cols + j] = 0;
                    overcrowding++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors < 2)
                {
                    tempGrid[i * cols + j] = 0;
                    underpopulation++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors > 1)
                {
                    tempGrid[i * cols + j] = 1;
                    survival++;
                }
                else if (grid[i * cols + j] == 0 && numNeighbors == 3)
                {
                    tempGrid[i * cols + j] = 1;
                    birth++;
                }
            }
        }

        grid = tempGrid;
    }

原文

I'm trying to calculate the number of transitions that would be made in a run of Conway's GOL for a pxq matrix for n iterations. For instance, given 1 iteration with the initial state being 1 blinker (as below). there would be 5 transitions (2 births, 1 survival, 2 deaths from underpopulation). I've already got this working, but I'd like to convert this logic to run using CUDA. Below is what I want to port to CUDA.

alt text
code:

    static void gol() // call this iterations x's
    {
        int[] tempGrid = new int[rows * cols]; // grid holds init conditions
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                tempGrid[i * cols + j] = grid[i * cols + j];
            }
        }

        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                int numNeighbors = neighbors(i, j); // finds # of neighbors

                if (grid[i * cols + j] == 1 && numNeighbors > 3)
                {
                    tempGrid[i * cols + j] = 0;
                    overcrowding++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors < 2)
                {
                    tempGrid[i * cols + j] = 0;
                    underpopulation++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors > 1)
                {
                    tempGrid[i * cols + j] = 1;
                    survival++;
                }
                else if (grid[i * cols + j] == 0 && numNeighbors == 3)
                {
                    tempGrid[i * cols + j] = 1;
                    birth++;
                }
            }
        }

        grid = tempGrid;
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

葵雨 2024-10-14 08:56:59

您的主要减速将是主内存访问。因此，我建议您根据可用的硬件选择较大的线程块大小。 256 (16x16) 是跨硬件兼容性的不错选择。这些线程块中的每一个都将计算板的稍小的部分的结果 - 如果您使用 16x16，它们将计算板的 14x14 部分的结果，因为有一个元素边框。（使用 16x16 块来计算 14x14 块而不是 16x16 块的原因是为了内存读取合并。）

将板划分为（例如）14x14 块；这就是你的网格（按照你认为合适的方式组织，但很可能类似于 board_width / 14、board_height / 14。

在内核中，让每个线程将其元素加载到共享中然后同步线程。然后让中间的 14x14 元素计算新值（使用共享内存中存储的值）并将其写回到全局内存中。这也是减少全局读写的原因。让线程块大小尽可能大——边缘和角落会“浪费”全局内存访问，因为从那里获取的值只使用 1 或 3 次，而不是 9 次。

回复收藏 0 原文