针对低内存使用情况,康威生命游戏的有效实现是什么?
我正在寻找一种快速且节省内存的方法来实现康威的生命游戏。
限制:96x128 板、大约 2kB 可用 RAM 和 52MHz 处理器(请参阅此处的技术规格:http://www.getinpulse .com/features)。
我当前的简单解决方案将每个单元表示为矩阵中的单个位(96*128/8=1,536 字节),但速度太慢。可以使用哪些技巧来提高性能?
存储活细胞的坐标(例如在此实现中 http://dotat.at/prog/life /life.html)会使用太多内存。
I'm looking for a fast and memory efficient approach for implementing Conway's Game of Life.
Constraints: a 96x128 board, approximately 2kB RAM available and 52MHz processor (see the tech specs here: http://www.getinpulse.com/features).
My current naive solution that represents each cell as a single bit in a matrix (96*128/8=1,536 bytes) works but is too slow. What tricks can be used to improve performance?
Storing the coordinates of live cells (for example in this implementation http://dotat.at/prog/life/life.html) would use too much memory.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
看起来是一个有趣的硬件。
96x128 像素显示器的每个像素存储 1 位即可得到 12,288 位。
这已经超过了您所说的“可用”16,384 位的一半以上。
如果每个单元甚至无法存储 2 位,那么就没有足够的空间可以做很多事情。
一些想法:
您有一个 32 位处理器。在这样的处理器上,有几种位图处理方法可以并行计算位图并计算多个单元的邻居数量。
存储邻居计数通常更快,在出生时增加所有 8 个邻居,在死亡时减少所有 8 个邻居,而不是每一代从头开始重新计算邻居数量 - 但看起来你没有足够的内存对于这种方法。
也许您可以使用每个单元格 2x2 像素(因此只有 3,072 个单元格可见)或每个单元格 3x3 像素(因此只有 1,376 个单元格可见),这样您的软件就会减少工作量,从而给人一种运行速度更快的错觉。 (这也释放了足够的 RAM,以便您可以执行上面提到的邻居计数)。
由于许多生命模式都有大面积的死空间,因此可能有一个小的“活区域”位图 - 例如,一个 12x16 位数组,如果相应的 8x8 单元区域完全死,则每个位都是“0” ,如果相应区域中的任何细胞存活,则为“1”。下一代更新只需要查看活区和死区的1个单元宽的边界;您可以跳过检查死区的 6x6 单元核心。另外,如果整个 8x8 单元区域的 4 个最近的相邻区域也已死亡,则您可以完全跳过整个 8x8 单元区域。
由于许多生活模式都有大面积的静态不变的“静物”模式,因此可能有一个小的“动态区域”位图 - 例如,一个 12x16 位数组,其中如果对应的 8x8 则每个位都是“0”单元区域在上一次更新中没有变化,如果相应区域中的任何单元在上一次更新中发生变化,则为“1”。下一代更新只需要查看动态区域和静态区域的1-cell宽边界;您可以跳过检查静态区域的 6x6 单元核心,因为它在下一代中将是相同的。
如果一个模式“足够大”,基于 Gosper 的 Hashlife 的表示可能能够将其存储在比直接存储位图更少的 RAM 中。唉,我怀疑你远远低于“足够大”的阈值。
Looks like a fun piece of hardware.
Storing 1 bit per pixel of a 96x128 pixel display gives 12,288 bits.
That's already over half of the 16,384 bits you say are "available".
If you can't even store 2 bits per cell, there's not a lot of room to do much.
A few ideas:
You have a 32-bit processor. There are several bit-twiddling tricks to take a bitmap and calculate the number of neighbors of several cells in parallel on such a processor.
It's often faster to store the neighbor counts, incrementing all 8 neighbors at birth and decrementing all 8 neighbors at death, rather than recalculating the number of neighbors from scratch every generation -- but it doesn't look like you have enough memory for this approach.
Perhaps you could use 2x2 pixels per cell (so only 3,072 cells visible) or 3x3 pixels per cell (so only 1,376 cells visible), so your software does less work and so gives the illusion of running faster. (This also frees up enough RAM that you can do the neighbor count mentioned above).
Since many life patterns have large areas of dead space, perhaps have a small bitmap of "live regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region is entirely dead, and "1" if any of the cells in the corresponding region is alive. The next-generation update only needs to look at live regions and the 1-cell-wide border of dead regions; you can skip checking the 6x6 cell core of dead regions. Also, you can completely skip the entire 8x8 cell region if its 4 nearest neighboring regions are also dead.
Since many life patterns have large areas of static unchanging "still life" patterns, perhaps have a small bitmap of "dynamic regions" -- say, a 12x16 array of bits, where each bit is "0" if the corresponding 8x8 cell region had no changes in the last generation update, and "1" if any of the cells in the corresponding region changed in the last update. The next generation update only needs to look at dynamic regions and the 1-cell-wide border of static regions; you can skip checking the 6x6 cell core of static regions, since it will be the same in the next generation.
If a pattern is "large enough", a representation based on Gosper's Hashlife may be able to store it in less RAM than storing a bitmap directly. Alas, I suspect you're far below the "large enough" threshold.
请参阅“代码优化之禅”中有关生命游戏的章节迈克尔·阿布拉什.有一种实现将 3 个单元的当前和先前状态编码为单个字,并使用查找表和进位位的技巧生成下一个状态。速度快得惊人。
Look at the chapter on the Game of Life in "The Zen of Code Optimization" by Michael Abrash. There was an implementation that coded the current and previous states of 3 cells into a single word, and used tricks with lookup tables and carry bits generate the next states. It was blazingly fast.
对于小生命宇宙来说,使它们四面环绕(形成环形宇宙)是很常见的,但这需要双缓冲。在您的情况下,这需要 3KB RAM,但您只有 2KB。
如果你不换行,那么你就不需要对整个宇宙进行双缓冲。您只需要在使用完单元格作为计算的输入之前避免覆盖单元格即可。
假设您的宇宙被布局为传统的位图。我们将把它视为在内存中按顺序排列的一系列行。假设宇宙有四行,编号为 0 到 3。
当您计算下一代时,第 3 行的新版本是使用第 2、3 和 4 行(空白)计算的。您可以将第 3 行的新版本写入第 4 行的顶部。同样,根据第 1、2、3 行计算新的第 2 行并将其写入第 3 行的顶部,因为计算第 2 行后不再需要该数据。新的第 1 行是根据第 0、1、2 行计算出来的,并覆盖第 2 行。
这意味着您只需要额外的一行内存。 97*128位就是1552字节。
缺点是你的宇宙在内存中滚动,你必须有某种机制来处理这个问题。要么在每次计算后将其复制回原来的位置(这很慢),要么安排能够从内存中的任何位置显示它,并确保您的计算和显示系统可以从高地址到低地址环绕。
It is quite common with small life universes to make them wrap round on all sides (to make a toroidal universe) but this requires double buffering. In your case this requires 3KB RAM but you only have 2KB.
If you don't wrap then you don't need to double-buffer the whole universe. You just need to avoid overwriting cells before you have finished using them as input to the calculation.
Say your universe is laid out as a conventional bitmap. We are going to treat it as a series of rows arranges sequentially in memory. Say the universe has four rows numbered 0 to 3.
When you calculate the next generation, the new version of row 3 is calculated using rows 2, 3, and 4 (which is blank). You can write the new version of row 3 on top of row 4. Similarly, calculate the new row 2 from rows 1,2,3 and write it on top of row 3, since that data is no longer needed after calculating row 2. The new row 1 is calculated from rows 0,1,2 and written over row 2.
This means you only need memory for one extra row. 97*128 bits is 1552 bytes.
The disadvantage is that your universe scrolls through memory and you have to have some mechanism to deal with this. Either copy it back to its original position after each calculation (which is slow) or arrange to be able to display it from any position in memory, and ensure that your calculation and display systems can wrap around from high to low addresses.
我建议从一个例程开始读取板的一行并生成两个新的行缓冲区 H 和 L,这样如果两个或多个 (bin n+1,位 n,位n-1)被设置在原始行中,并且如果在原始行中设置了奇数个(bin n+1,位n,位n-1),则L的位'n'将被设置。
总共分配三对行缓冲区(称为 H0..H2 和 L0..L2)。获取源文件的每一行并计算一对缓冲区并将其存储在 HL 对中,同时保留它和前两个缓冲区。检查所有六个缓冲区中的单词将揭示原始矩阵中的 32 个单元中的哪些单元应该是活动的,当且仅当它们之前是活动的,以及无论以前的状态如何,哪些单元应该是活动的。
为了获得机器代码的最佳性能,三对缓冲区可以交错;这可以实现每个像素少于两个周期的执行速度。也许在 52MHz 下有点过头了(只有 12288 像素,帧速率约为 4000fps),但速度很酷。
I would suggest starting with a routine to read a row of the board and generate two new row buffers, H and L such that bit 'n' of H will be set if two or more of (bin n+1, bit n, bit n-1) were set in the original row, and bit 'n' of L will be set if an odd number of (bin n+1, bit n, bit n-1) were set in the original row.
Allocate a total of three pairs of row buffers (call them H0..H2 and L0..L2). Take each row of the source file and compute a pair of buffers and store it in an HL pair, keeping it and the previous two. Examining a word from all six buffers will reveal which of 32 cells in the original matrix should be live if and only if they were previously, and which should be alive regardless of previous state.
For optimal performance in machine code, the three pairs of buffers may be interleaved; this may make it possible to achieve an execution speed of less than two cycles per pixel. Perhaps overkill at 52MHz (with only 12288 pixels, that would be a frame rate of ~4000fps) but speed is cool.
如果您有权滥用设备上剩余的 30KB(又称闪存),您可以始终将其存储在那里,这并不理想,但它是解决 RAM 有限问题的一种潜在方法。
就效率而言,您将相互交换 CPU 和内存性能:
对于内存效率,位数组可能是最佳解决方案,但通过迭代该网格会损失 CPU 效率。
根据内存地址的功能和大小,您可以使用单元格链接列表来避免迭代整个网格:这肯定会节省您扫描大量死单元的时间。
If you have license to abuse the rest of the 30KB on the device (a.k.a. flash memory), you could always store it there, not ideal, but it is a way to potentially work around the limited RAM.
Efficiency speaking you're going to be trading CPU and memory performance for one another:
For memory efficiency, an array of bits is likely the optimal solution, but you lose CPU efficiency by iterating through that grid.
Depending on the capabilities and the size of memory addresses, you could use a linked list of cells to avoid iterating through the entire grid: would definitely save you time scanning through huge regions of dead cells.
坚持使用位数组并通过以下技巧跳过邻居计数:使用创建或维护活动单元的相邻单元块的可能位模式创建一个查找表。
如果要最小化内存,“索引”模式可以打包为 8 位:上面的行中的 3 位、两侧的列中的 2 位以及下面的行中的 3 位。您可以将输出编码为查找表中的单个位,总共仅占用 256 位。使用索引作为查找数组的位移计数来“计算”结果。位移位和算术 OR 运算仍然非常快,并且该技术消除了动态计数相邻活细胞 - 或者更确切地说,查找表对计数和计算进行了预处理。
最重要的处理瓶颈应该是: 检查板边缘条件,即一行的末尾;单词边界;提取并打包相邻位作为索引值。使用 32 位处理器,您应该能够在到达字边界之前非常快速地循环通过 30 个单元。如果将位打包成字大小的整数,则对单元格行进行寻址只需将列数/32 相加即可。将结果缓存到两个备用的新行中,并在处理完一行后复制整行。
您可能可以利用一些模式对称性来进一步优化事物,但这可能就足够了。我认为这种技术将使处理和内存使用保持在您的限制范围内。
Stick with the bit array and skip the neighbor counts with this trick: create a lookup table with the possible bit patterns of neighboring cell blocks that create or maintain a live cell.
The "index" pattern can be packed into 8 bits if you want to minimize memory: 3 bits from the row above, two bits from the columns to either side, and 3 bits from the row below. You can encode the output as a single bit in a lookup table, taking up a total of only 256 bits. Use the index as a bit-shift count for your lookup array to "calculate" the result. Bit shifts and arithmetic OR operations are still very fast, and this technique eliminates counting neighboring live cells on the fly - or rather, the lookup table pre-processes the count and calculation.
The top processing bottlenecks should be: checking for board edge conditions, i.e. end of a row; word boundaries; extracting and packing neighbor bits as an index value. With a 32-bit processor, you should be able to cycle through 30 cells very quickly before hitting a word boundary. Addressing cell rows could simply be a matter of adding the number of columns/32 if you pack the bits into word-sized ints. Cache the results into two spare new-life rows and copy a whole row when you're done processing one.
There may be some pattern symmetries you can take advantage of to optimize things even more, but this may be enough. I think this technique will keep the processing and memory usage within your limitations.