康威生命游戏的 cuda 内核

发布于 2024-10-07 08:56:59 字数 1629 浏览 6 评论 0原文

我正在尝试计算对于 pxq 矩阵进行 n 次迭代的康威 GOL 运行中将进行的转换数量。例如,给定 1 次迭代,初始状态为 1 个闪光灯(如下)。将会有 5 次转变(2 次出生,1 次存活,2 次因人口不足而死亡)。我已经完成了这个工作,但我想转换这个逻辑以使用 CUDA 运行。下面是我想要移植到 CUDA 的内容。

替代文字 代码:

    static void gol() // call this iterations x's
    {
        int[] tempGrid = new int[rows * cols]; // grid holds init conditions
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                tempGrid[i * cols + j] = grid[i * cols + j];
            }
        }

        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                int numNeighbors = neighbors(i, j); // finds # of neighbors

                if (grid[i * cols + j] == 1 && numNeighbors > 3)
                {
                    tempGrid[i * cols + j] = 0;
                    overcrowding++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors < 2)
                {
                    tempGrid[i * cols + j] = 0;
                    underpopulation++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors > 1)
                {
                    tempGrid[i * cols + j] = 1;
                    survival++;
                }
                else if (grid[i * cols + j] == 0 && numNeighbors == 3)
                {
                    tempGrid[i * cols + j] = 1;
                    birth++;
                }
            }
        }

        grid = tempGrid;
    }

I'm trying to calculate the number of transitions that would be made in a run of Conway's GOL for a pxq matrix for n iterations. For instance, given 1 iteration with the initial state being 1 blinker (as below). there would be 5 transitions (2 births, 1 survival, 2 deaths from underpopulation). I've already got this working, but I'd like to convert this logic to run using CUDA. Below is what I want to port to CUDA.

alt text
code:

    static void gol() // call this iterations x's
    {
        int[] tempGrid = new int[rows * cols]; // grid holds init conditions
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                tempGrid[i * cols + j] = grid[i * cols + j];
            }
        }

        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                int numNeighbors = neighbors(i, j); // finds # of neighbors

                if (grid[i * cols + j] == 1 && numNeighbors > 3)
                {
                    tempGrid[i * cols + j] = 0;
                    overcrowding++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors < 2)
                {
                    tempGrid[i * cols + j] = 0;
                    underpopulation++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors > 1)
                {
                    tempGrid[i * cols + j] = 1;
                    survival++;
                }
                else if (grid[i * cols + j] == 0 && numNeighbors == 3)
                {
                    tempGrid[i * cols + j] = 1;
                    birth++;
                }
            }
        }

        grid = tempGrid;
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

葵雨 2024-10-14 08:56:59

您的主要减速将是主内存访问。因此,我建议您根据可用的硬件选择较大的线程块大小。 256 (16x16) 是跨硬件兼容性的不错选择。这些线程块中的每一个都将计算板的稍小的部分的结果 - 如果您使用 16x16,它们将计算板的 14x14 部分的结果,因为有一个元素边框。 (使用 16x16 块来计算 14x14 块而不是 16x16 块的原因是为了内存读取合并。)

将板划分为(例如)14x14 块;这就是你的网格(按照你认为合适的方式组织,但很可能类似于 board_width / 14board_height / 14

在内核中,让每个线程将其元素加载到共享中然后同步线程。然后让中间的 14x14 元素计算新值(使用共享内存中存储的值)并将其写回到全局内存中。这也是减少全局读写的原因。让线程块大小尽可能大——边缘和角落会“浪费”全局内存访问,因为从那里获取的值只使用 1 或 3 次,而不是 9 次。

Your main slowdown is going to be main memory access. So I'd suggest that you pick a largish thread block size based on the hardware you have available. 256 (16x16) is a good choice for cross-hardware compatibility. Each of those thread blocks is going to calculate the results for a slightly smaller section of the board -- if you used 16x16, they'll calculate the results for a 14x14 section of the board, since there is a one element border. (The reason to use a 16x16 block to calculate a 14x14 chunk rather than a 16x16 chunk is for memory read coalescing.)

Divide the board up into (say) 14x14 chunks; that is your grid (organized however you see fit, but most likely something like board_width / 14, board_height / 14.

Within the kernels, have each thread load its element into shared memory. Then syncthreads. Then have the middle 14x14 elements calculate the new value (using the values stored in shared memory) and write it back into global memory. The use of shared memory helps minimize global reads and writes. This is also the reason to have your thread block size as big as possible -- the edges and corners are "wasted" global memory accesses, since the values fetched there only get used 1 or 3 times, not 9 times.

梦初启 2024-10-14 08:56:59

您可以继续进行的一种方法:

  1. 每个线程对网格的 1 个元素进行计算
  2. 每个线程首先将主网格中的一个元素加载到共享内存
  3. 中 线程块边缘的线程还需要加载边界元素
  4. 每个线程可以然后根据共享内存的内容进行生存计算
  5. 然后每个线程将其结果写回主内存

Here's one way you could proceed:

  1. Each thread makes the computation for 1 element of the grid
  2. Each thread first loads up one element from the main grid into shared memory
  3. Threads on the edge of the thread block need also to load up boundary elements
  4. Each thread can then make their survival computation based on the contents of shared memory
  5. Each thread then writes their result back to main memory
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文