优化康威的“生命游戏”
为了进行实验,我(很久以前)实现了康威的生命游戏(并且我'我知道这个相关问题!)。
我的实现是通过保留 2 个布尔数组来实现的,分别代表“最后状态”和“正在更新的状态”(这两个数组在每次迭代时交换)。 虽然这相当快,但我经常想知道如何优化它。
例如,一个想法是在迭代 N 时预先计算可以在迭代 (N+1) 时修改的区域(这样,如果一个单元格不属于这样的区域,则它不会)甚至在迭代(N+1)时考虑修改。 我知道这是非常模糊的,我从来没有花时间去详细了解......
您对如何优化(速度)生命游戏迭代有任何想法(或经验!)吗?
To experiment, I've (long ago) implemented Conway's Game of Life (and I'm aware of this related question!).
My implementation worked by keeping 2 arrays of booleans, representing the 'last state', and the 'state being updated' (the 2 arrays being swapped at each iteration). While this is reasonably fast, I've often wondered about how to optimize this.
One idea, for example, would be to precompute at iteration N the zones that could be modified at iteration (N+1) (so that if a cell does not belong to such a zone, it won't even be considered for modification at iteration (N+1)). I'm aware that this is very vague, and I never took time to go into the details...
Do you have any ideas (or experience!) of how to go about optimizing (for speed) Game of Life iterations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
两个想法:
(1)许多配置大部分都是空的。 保留活细胞的链接列表(不一定按顺序,这会花费更多时间),并且在更新期间,仅更新活细胞周围(这类似于你模糊的建议,OysterD:)
(2)保留额外的数组,每行 3 个位置(左-中-右)存储活细胞数。 现在,当您计算单元格的新死/活值时,您只需要 4 次读取操作(顶部/底部行和中心侧位置)和 4 次写入操作(更新 3 个受影响的行汇总值和死/活值)。新单元格的实时值)。 假设写入不慢于读取,这比 8 次读取和 1 次写入略有改进。 我猜你可能能够更聪明地使用这样的配置,并沿着这些路线取得更好的改进。
Two ideas:
(1) Many configurations are mostly empty space. Keep a linked list (not necessarily in order, that would take more time) of the live cells, and during an update, only update around the live cells (this is similar to your vague suggestion, OysterD :)
(2) Keep an extra array which stores the # of live cells in each row of 3 positions (left-center-right). Now when you compute the new dead/live value of a cell, you need only 4 read operations (top/bottom rows and the center-side positions), and 4 write operations (update the 3 affected row summary values, and the dead/live value of the new cell). This is a slight improvement from 8 reads and 1 write, assuming writes are no slower than reads. I'm guessing you might be able to be more clever with such configurations and arrive at an even better improvement along these lines.
什么是最有效的算法主要取决于初始状态。
如果大多数单元已失效,您可以通过跳过空部分而不是逐单元计算内容来节省大量 CPU 时间。
我认为,当你的初始状态类似于“随机,但生存机会低于 5%”时,首先检查完全死区是有意义的。
我会将矩阵分成两半,然后首先检查较大的矩阵。
因此,如果您有一个 10,000 * 10,000 的字段,您首先要累积左上四分之一的 5,000 * 5,000 的状态。
如果第一季度的状态总和为零,那么您现在可以完全忽略第一季度,并检查下一步的右上角 5,000 * 5,000。
如果它的状态总和大于 0,您现在将再次将第二个季度分成 4 部分 - 并对每个子空间的生命周期重复此检查。
您现在可以降低到 8*8 或 10*10 的子帧(不确定这里最有意义)。
每当你发现生命时,你就把这些子空间标记为“有生命”。
只有“有生命”的空间才需要被分成更小的子空间——空的子空间可以被跳过。
当您完成将“有生命”属性分配给所有可能的子空间时,您最终会得到一个子空间列表,您现在只需将其向每个方向扩展+1 - 带有空单元格 - 并执行常规(或修改后的)游戏生活对他们有规则。
你可能认为将 10,000*10,000 个空间划分为 8*8 的子空间是很多操作系统任务 - 但实际上,累积它们的状态值比对每个单元加上它们的 8 个邻居执行 GoL 算法要少得多的计算工作比较数字并在某处存储净迭代的新状态...
但就像我上面所说的那样,对于具有 30% 人口的随机初始状态,这没有多大意义,因为不会有很多完全死的 8*8 子空间find(更不用说死的256*256子空间)
,当然,完美优化的方式将持续但并非最不重要的取决于你的语言。
-110
what is the most efficient algo mainly depends on the initial state.
if the majority of cells is dead, you could save a lot of CPU time by skipping empty parts and not calculating stuff cell by cell.
im my opinion it can make sense to check for completely dead spaces first, when your initial state is something like "random, but with chance for life lower than 5%."
i would just divide the matrix up into halves and start checking the bigger ones first.
so if you have a field of 10,000 * 10,000, you´d first accumulate the states of the upper left quarter of 5,000 * 5,000.
and if the sum of states is zero in the first quarter, you can ignore this first quarter completely now and check the upper right 5,000 * 5,000 for life next.
if its sum of states is >0, you will now divide up the second quarter into 4 pieces again - and repeat this check for life for each of these subspaces.
you could go down to subframes of 8*8 or 10*10 (not sure what makes the most sense here) now.
whenever you find life, you mark these subspaces as "has life".
only spaces which "have life" need to be divided into smaller subspaces - the empty ones can be skipped.
when you are finished assigning the "has life" attribute to all possible subspaces, you end up with a list of subspaces which you now simply extend by +1 to each direction - with empty cells - and perform the regular (or modified) game of life rules to them.
you might think that dividn up a 10,000*10,000 spae into subspaces of 8*8 is a lot os tasks - but accumulating their states values is in fact much, much less computing work than performing the GoL algo to each cell plus their 8 neighbours plus comparing the number and storing the new state for the net iteration somewhere...
but like i said above, for a random init state with 30% population this wont make much sense, as there will be not many completely dead 8*8 subspaces to find (leave alone dead 256*256 subpaces)
and of course, the way of perfect optimisation will last but not least depend on your language.
-110
我将引用我在另一个问题中的答案,因为我提到的章节有一些非常有趣且经过微调的解决方案。 是的,一些实现细节是用 C 和/或汇编语言编写的,但大多数情况下,算法可以用任何语言运行:
I am going to quote my answer from the other question, because the chapters I mention have some very interesting and fine-tuned solutions. Some of the implementation details are in c and/or assembly, yes, but for the most part the algorithms can work in any language:
有一些超快的实现(从内存)将 8 个或更多相邻方块的单元表示为位模式,并将其用作大量预先计算值的索引,以在单个机器指令中确定单元是活的还是死的。
在这里查看:
http://dotat.at/prog/life/life.html
还有 XLife:
http://linux.maruhn.com/sec/xlife.html
There are some super-fast implementations that (from memory) represent cells of 8 or more adjacent squares as bit patterns and use that as an index into a large array of precalculated values to determine in a single machine instruction if a cell is live or dead.
Check out here:
http://dotat.at/prog/life/life.html
Also XLife:
http://linux.maruhn.com/sec/xlife.html
正如 Arbash 的黑皮书中提到的,获得巨大加速的最简单、最直接的方法之一就是保留更改列表。
不要每次都迭代整个单元格网格,而是保留所有更改的单元格的副本。
这将缩小您在每次迭代中必须完成的工作范围。
As mentioned in Arbash's Black Book, one of the most simple and straight forward ways to get a huge speedup is to keep a change list.
Instead of iterating through the entire cell grid each time, keep a copy of all the cells that you change.
This will narrow down the work you have to do on each iteration.
您应该研究一下 Hashlife,终极优化。 它使用 Skinp 提到的 四叉树 方法。
You should look into Hashlife, the ultimate optimization. It uses the quadtree approach that skinp mentioned.
我在 C# 中实现了这一点:
所有单元格都有一个位置、一个邻居计数、一个状态以及对规则的访问权限。
邻居。
优点:
缺点:
对于活跃的细胞。
可能的改进:
在当前刻度中更新,消除了上面提到的数组 B 的需要
没有的单元格,以及检查规则 B0 的单元格。
I implemented this in C#:
All cells have a location, a neighbor count, a state, and access to the rule.
neighbors.
Pros:
Cons:
for the active cells.
Possible improvements:
updated in the current tick, removing the need of array B as mentioned above
cells without, and those check for rule B0.
有一些表驱动的解决方案可以解决每个表查找中的多个单元格。 谷歌查询应该会给你一些例子。
There are table-driven solutions for this that resolve multiple cells in each table lookup. A google query should give you some examples.
它是一个二维自动机,因此您可以查找优化技术。 您的想法似乎是压缩每一步需要检查的单元格数量。 由于您只需要检查已占用或与已占用单元格相邻的单元格,因此也许您可以保留所有此类单元格的缓冲区,并在处理每个单元格时在每个步骤中更新它。
如果您的字段最初是空的,这会快得多。 您可能会找到某个平衡点,在该平衡点上维护缓冲区比处理所有单元的成本更高。
It's a two dimensional automaton, so you can probably look up optimization techniques. Your notion seems to be about compressing the number of cells you need to check at each step. Since you only ever need to check cells that are occupied or adjacent to an occupied cell, perhaps you could keep a buffer of all such cells, updating it at each step as you process each cell.
If your field is initially empty, this will be much faster. You probably can find some balance point at which maintaining the buffer is more costly than processing all the cells.
不完全知道如何做到这一点,但我记得我的一些朋友必须用四叉树来表示这个游戏的网格才能完成作业。 我猜这对于优化网格空间确实很有好处,因为您基本上只代表占用的单元格。 但我不知道执行速度。
Don't exactly know how this can be done, but I remember some of my friends had to represent this game's grid with a Quadtree for a assignment. I'm guess it's real good for optimizing the space of the grid since you basically only represent the occupied cells. I don't know about execution speed though.
如果你不想要太复杂的东西,那么你可以使用网格将其分割,如果网格的那部分是空的,不要尝试模拟它(请查看泰勒的答案)。 但是,您可以进行一些优化:
If you don't want anything too complex, then you can use a grid to slice it up, and if that part of the grid is empty, don't try to simulate it (please view Tyler's answer). However, you could do a few optimizations:
我在测试高性能生活游戏的一些想法时编写了以下代码。
您使用布尔数组来存储状态的想法是朝着正确方向迈出的一步,但真正的优化是意识到 64 位字已经是布尔数组,如果您知道如何,您可以对它们进行高度并行的操作,可能将计算速度提高至 64 倍。
像这样的东西应该是任何高性能有限自动机代码的起点,因为当用于状态表示的数据结构和迭代代码本身远未优化时,引入更高级的优化(例如四叉树或瓦片散列)没有什么意义。
从长远来看,它并不是最佳的,但它说明了一些核心思想,例如利用位级指令并行性来大幅提高性能并节省内存。
在 Ryzen 5700G CPU 的单核上,其运行速度约为每秒 1600 万步。 也就是说,每个 CPU 每秒的平均性能约为 640 亿次单元迭代。
在现代 CPU 上,相同的想法可以扩展到/适应 256 位 AVX 或 512 位 AVX512 SIMD 寄存器,以获得大约 4 倍/8 倍的加速。
PS:请注意,当前状态的代码不处理多块边界条件,这应该不难引入。 无论如何,电路板的最大尺寸限制为 64x64 单元。
I wrote the following code while testing some ideas for high-performance game of life.
Your idea of using arrays of booleans for storing the state is a step in the right direction, but the real optimization is realizing that 64-bit words already are boolean arrays, and you can do highly parallel operations on them if you know how to, potentially speeding up computation up to 64x.
Something like this should be the starting point for any high-performance finite automata code, as there is little point in introducing more advanced optimizations like quadtrees or tile hashing when the data structures used for state representation and the iteration code itself are far from optimized.
It is not optimal by a long shot, but it illustrates a few core ideas, like exploiting bit-level instruction parallelism for big performance gains and memory savings.
This ran at ~16 million steps / second on a single core of a Ryzen 5700G CPU. That is, an averaged performance of ~64 billion cell iterations per cpu per second.
On modern CPUs, the same idea can be extended/adapted to 256-bit AVX or 512-bit AVX512 SIMD registers for a roughly 4x/8x speedup.
PS: Note that the code at its current state doesn't handle multi-tile boundary conditions, which shouldn't be too difficult to introduce. In any case, the board's maximum size is limited to 64x64 cells.
该算法本身本质上是可并行的。 在未优化的 CUDA 内核中使用相同的双缓冲方法,我在 4096x4096 包裹的世界中每代大约需要 25 毫秒。
The algorithm itself is inherently parallelizable. Using the same double-buffered method in an unoptimized CUDA kernel, I'm getting around 25ms per generation in a 4096x4096 wrapped world.