C++ 中的粒子池性能优化
我有一个存储粒子指针的粒子池
它有大量粒子
每个粒子都有 IsActive 变量
当根据其活动状态每帧对向量进行分区时,CPU 使用率很高
因此这是对向量进行分区的好方法在工作线程上每隔几秒?
I have a particle pools which stored particle pointers
It has a large number of particles
Every particle has the IsActive variable
When partitioning the vector every frame according to their active state, there is high cpu usage
So instead is it a good way to partition the vector every few seconds on a worker thread?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于活动粒子的顺序并不重要,并且它们不会经常变得活动或不活动,因此您可以使用一种技巧将活动粒子保持在前面:
假设
nActive
是数字的活性粒子。当非活动粒子变为活动状态时,执行 std::swap(vector[nParticle], vector[nActive]); nActive++; 即使其成为最后一个活动粒子 - 无论该槽中已有什么非活动粒子,都会转到该粒子曾经所在的位置,我们不在乎,因为我们不关心订单。
当活动粒子变为非活动状态时,执行 nActive--; std::swap(vector[nParticle], vector[nActive]); 即使其成为第一个非活动粒子 - 并且无论该插槽中已有什么活动粒子,都会转到粒子曾经是,我们不在乎,因为我们不关心顺序。
您可以通过记住使粒子处于活动或不活动状态很容易(如果我们不关心它是哪一个)来记住这个技巧,因为我们只需从
nActive
中加或减 1。但随后我们不知道我们刚刚改变了哪个粒子。因此,我们将该粒子与我们想要更改的粒子交换,并且我们不关心该粒子去哪里,只要它保持不活动或保持活动(不改变)即可。像这样:
使粒子 3 处于活动状态:交换粒子 3 和 2,然后递增 nActive:
使粒子 0 处于非活动状态:递减 nActive,然后交换粒子 3(在插槽 2 中)和 0。
现在粒子顺序都打乱了,但您不这样做不在乎这个,对吧?
Since the order of the active doesn't matter, and they don't become active or inactive very often, you can use a trick to keep the active particles at the front:
Suppose that
nActive
is the number of active particles.When an inactive particle becomes active, do
std::swap(vector[nParticle], vector[nActive]); nActive++;
i.e. make it be the last active particle - and whatever inactive particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.When an active particle becomes inactive, do
nActive--; std::swap(vector[nParticle], vector[nActive]);
i.e. make it be the first inactive particle - and whatever active particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.You can remember this trick by remembering that it's easy to make a particle active or inactive - if we don't care which one it is - because we just add or subtract 1 from
nActive
. But then we don't know which particle we just changed. So we swap that particle with the one we meant to change, and we don't care where that particle goes, as long as it stays inactive or stays active (doesn't change).Like this:
make particle 3 active: swap particle 3 and 2, then increment nActive:
make particle 0 inactive: decrement nActive, then swap particle 3 (in slot 2) and 0.
The particle order is all jumbled up now, but you don't care about that, right?
虽然 user253751 的答案对于优化分区方案来说是绝对正确的,但听起来您将遇到与粒子数量和缓存一致性相关的更大问题。
我猜测所使用的数据结构类似于
这导致每个粒子元素被一个接一个地打包。这在多线程环境中会产生更差的性能,因为同步可能必须在单个缓存行中的核心之间发生(假设您单独处理每个组件而不是块,无论哪种方式能够矢量化仍然会更好)。
一致性并且不利于矢量化。
在这种情况下,您可能可以通过切换到类似于以下内容的面向数据的编程方法来显着提高性能。
会导致相同的元素打包在一起,从而允许您一次对其中多个元素进行操作
这种内存布局更有利于矢量化,并且由于缓存一致性可以显着提高性能。它还为您提供了手动矢量化代码的选项,类似于
此代码显然需要一些健全性检查,例如确保每个数组大小始终是向量大小的倍数。此外,如果您可以确保内存从 16 字节边界开始,您可以使用对齐的加载和存储函数。如果可用,您还可以使用 AVX 和 AVX2 256 位寄存器。
内存访问是现代计算机中最慢的部分,这样做可以确保在给定时间操作尽可能多的数据。这也使得 CPU 更容易急切地加载内存行。您现在还可以在多核环境中使用此功能,性能显着提高,并且不会影响缓存行的使用,这可以通过 CPU 上的多个线程来完成,也可以使用 CUDA、OpenCL 或计算着色器等在 GPGPU 上轻松执行。
While user253751's answer is absolutely correct for optimizing your partitioning scheme, it sounds like you're going to have a larger issue relating to the number of particles and cache coherency.
I'm guessing that the data structure being used is something along the lines of
This results in each particle element being packed one after another. This will have worse performance in a multithreaded environment as synchronization may have to occur between cores in a single cache line (assuming you are processing each component separately instead of chunks, either way being able to vectorize will still be better).
This has poor cache coherency and isn't condusive to vectorization.
In this case you can likely substantially improve performance by switching to a Data Oriented Programming approach similar to
This results in identical elements being packed together, allowing you to operate on multiple of them at a time
This memory layout is much more condusive to vectorization and can drastrically improve performance due to cache coherency. It also gives you the option to manually vectorize your code in something similar to
This code obviously requires some sanity checks such as ensuring that each array size is always a multiple of the vector size. Additionally if you can ensure that the memory starts on the 16 byte boundary you can use the aligned load and store functions. You can also use AVX and AVX2 256 bit registers if those are available to you.
Memory accesses are the slowest part of modern computers, and doing this can ensure as much data is operated on as possible at a given time. This also makes it easier for the CPU to eagerly load memory lines. You can also now use this in a multicore environment with substanitally better performance and without clobbering cache line usage, this can either be done with multiple threads on a CPU or easily performed on a GPGPU using something like CUDA, OpenCL, or compute shaders.