如何针对CUDA优化Conway的生命游戏？

发布于 2024-10-10 11:57:45 字数 1198 浏览 9 评论 0原文

我已经为 Conway 的生命游戏编写了这个 CUDA 内核：

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

我正在寻找错误/优化。并行编程对我来说相当陌生，我不确定我是否知道如何正确地做到这一点。

其余部分是从输入数组到绑定到 CUDA 数组的 2D 纹理 inputTex 的 memcpy。输出从全局内存进行 memcpy 到主机，然后进行处理。

正如您所看到的，线程处理单个像素。我不确定这是否是最快的方法，因为一些消息来源建议每个线程执行一行或更多操作。如果我理解正确的话，NVidia 自己说线程越多越好。我希望得到有实践经验的人的建议。

原文

I've written this CUDA kernel for Conway's game of life:

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

I am looking for errors/optimizations.
Parallel programming is quite new to me and I am not sure if I get how to do it right.

The rest is a memcpy from an input array to the 2D texture inputTex bound to a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.

As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this from someone with practical experience.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流心雨 2024-10-17 11:57:45

我的两分钱。

整个事情看起来很可能受到多处理器和 GPU 内存之间通信延迟的限制。您的代码应该需要 30-50 个时钟周期才能自行执行，并且它会生成至少 3 次内存访问，如果缓存中没有所需的数据，则每次访问需要 200 多个时钟周期。

使用纹理内存是解决这个问题的好方法，但它不一定是最佳方法。

至少，尝试每个线程一次（水平）处理 4 个像素。全局内存一次可以访问 128 个字节（只要您有一个 warp 尝试访问 128 字节间隔内的任何字节，您就可以拉入整个缓存行，几乎不需要额外的成本）。由于扭曲有 32 个线程，因此每个线程处理 4 个像素应该是高效的。

此外，您希望同一多处理器处理垂直相邻的像素。原因是相邻行共享相同的输入数据。如果像素 (x=0,y=0) 由一个 MP 处理，而像素 (x=0,y=1) 由另一 MP 处理，则两个 MP 都必须分别发出三个全局内存请求。如果它们都由同一个 MP 处理并且结果被正确缓存（隐式或显式），则总共只需要四个。这可以通过使每个线程在几个垂直像素上工作或者通过使blockDim.y＞1来完成。

更一般地说，您可能希望每个 32 线程扭曲加载 MP 上可用的尽可能多的内存（16-48 kb，或至少 128x128 块），然后处理该窗口内的所有像素。

在 2.0 之前的计算兼容性设备上，您需要使用共享内存。在计算兼容性 2.0 和 2.1 的设备上，缓存功能大大提高，因此全局内存可能没问题。

通过确保每个扭曲仅访问输入像素的每个水平行中的两个缓存行（而不是三个），可以节省一些重要的费用，这在每个线程处理 4 个像素、每个扭曲处理 32 个线程的简单实现中会发生。

没有充分的理由使用 float 作为缓冲区类型。您不仅最终会获得四倍的内存带宽，而且代码也会变得不可靠且容易出现错误。（例如，您确定 if(neighbors == 3) 工作正常，因为您正在比较浮点数和整数？）使用 unsigned char。更好的是，使用 uint8_t 和 typedef it 来表示 unsigned char（如果未定义）。

最后，不要低估实验的价值。通常，CUDA 代码的性能无法通过逻辑轻松解释，您必须调整参数并查看会发生什么。

My two cents.

The whole thing looks likely to be bounded by the latency of communication between multiprocessors and the GPU memory. You have code that should take something like 30-50 clock ticks to execute on its own, and it generates at least 3 memory accesses which take 200+ clock ticks each if the requisite data is not in the cache.

Using texture memory is a good way to address that, but it is not necessarily the optimal way.

At the very least, try to do 4 pixels at a time (horizontally) per thread. Global memory can be accessed 128 bytes at a time (as long as you have a warp trying to access any byte in a 128-byte interval, you might as well pull in the whole cache line at almost no additional cost). Since a warp is 32 threads, having each thread work on 4 pixels should be efficient.

Furthermore, you want to have vertically-adjacent pixels worked on by the same multiprocessor. The reason is that adjacent rows share the same input data. If you have the pixel (x=0,y=0) worked on by one MP and the pixel (x=0,y=1) is worked on by a different MP, both MPs have to issue three global memory requests each. If they are both worked on by the same MP and the results are properly cached (implicitly or explicitly), you only need a total of four. This can be done by having each thread work on several vertical pixels, or by having blockDim.y>1.

More generally, you'd probably want to have each 32-thread warp load as much memory as you have available on the MP (16-48 kb, or at least a 128x128 block), and then process all pixels within that window.

On devices of compute compatibility before 2.0, you'll want to use shared memory. On devices of compute compatibility 2.0 and 2.1, caching capabilities are much improved, so global memory may be fine.

Some nontrivial savings could be had by making sure that each warp only accesses two cache lines in each horizontal row of input pixels, instead of three, as would happen in a naive implementation that works on 4 pixels per thread, 32 threads per warp.

There's no good reason to use float as the buffer type. Not only do you end up with four times the memory bandwidth, but the code becomes unreliable and bug-prone. (For example, are you sure that if(neighbors == 3) works correctly, since you're comparing a float and an integer?) Use unsigned char. Better yet, use uint8_t and typedef it to mean unsigned char if it's not defined.

Finally, don't underestimate the value of experimenting. Quite often performance of CUDA code can't be easily explained by logic and you have to resort to tweaking parameters and seeing what happens.

回复收藏 0 原文

深居我梦 2024-10-17 11:57:45

TL;DR：请参阅：http://golly.sourceforge.net

问题是大多数 CUDA 实现都遵循大脑手动计算邻居的想法已经死了。这实在是太慢了，以至于任何智能串行 CPU 实现都会胜过它。

进行 GoL 计算的唯一明智的方法是使用查找表。
目前 CPU 上最快的实现使用查找方形 4x4 = 16 位块来查看内部未来的 2x2 单元。

在此设置中，单元的布局如下：

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

采用一些位移来使 4x4 块适合一个字，并使用查找表查找该字。查找表也保存单词，这样可以将结果的 4 个不同版本存储在查找表中，因此您可以最大限度地减少输入和/或输出所需的位移量。

此外，不同代是交错的，因此您只需查看 4 个相邻的板，而不是 9 个。
就像这样：

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

与愚蠢的计数实现相比，仅此一项就可以实现 1000 倍以上的加速。

然后是不计算静态或周期为2的slab的优化。

然后是HashLife< /a>，但那是完全不同的野兽。
HashLife 可以在 O(log n) 时间内生成 Life 模式，而不是普通实现的 O(n) 时间。
这使您可以在几秒钟内计算出发电量：6,366,548,773,467,669,985,195,496,000（6 十亿）。
不幸的是，Hashlife 需要递归，因此在 CUDA 上很难。

TL;DR: see: http://golly.sourceforge.net

The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. This is so dead slow that any smart serial CPU implementation will outperform it.

The only sensible way to do GoL calculations is using a lookup table.
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside.

in this setup the cells are laid out like so:

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output.

In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9.
Like so:

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

This alone results in a 1000x+ speed-up compared to silly counting implementations.

Then there is the optimization of not calculating slabs that are static or have a periodicity of 2.

And then there is HashLife, but that's a different beast altogether.
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can.
This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds.
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA.

回复收藏 0 原文