C++ 中的粒子池性能优化

发布于 2025-01-16 14:12:56 字数 131 浏览 0 评论 0原文

我有一个存储粒子指针的粒子池

它有大量粒子

每个粒子都有 IsActive 变量

当根据其活动状态每帧对向量进行分区时,CPU 使用率很高

因此这是对向量进行分区的好方法在工作线程上每隔几秒?

I have a particle pools which stored particle pointers

It has a large number of particles

Every particle has the IsActive variable

When partitioning the vector every frame according to their active state, there is high cpu usage

So instead is it a good way to partition the vector every few seconds on a worker thread?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

南冥有猫 2025-01-23 14:12:56

由于活动粒子的顺序并不重要,并且它们不会经常变得活动或不活动,因此您可以使用一种技巧将活动粒子保持在前面:

假设 nActive 是数字的活性粒子。

当非活动粒子变为活动状态时,执行 std::swap(vector[nParticle], vector[nActive]); nActive++; 即使其成为最后一个活动粒子 - 无论该槽中已有什么非活动粒子,都会转到该粒子曾经所在的位置,我们不在乎,因为我们不关心订单。

当活动粒子变为非活动状态时,执行 nActive--; std::swap(vector[nParticle], vector[nActive]); 即使其成为第一个非活动粒子 - 并且无论该插槽中已有什么活动粒子,都会转到粒子曾经是,我们不在乎,因为我们不关心顺序。

您可以通过记住使粒子处于活动或不活动状态很容易(如果我们不关心它是哪一个)来记住这个技巧,因为我们只需从 nActive 中加或减 1。但随后我们不知道我们刚刚改变了哪个粒子。因此,我们将该粒子与我们想要更改的粒子交换,并且我们不关心该粒子去哪里,只要它保持不活动或保持活动(不改变)即可。

像这样:

+-----------+-----------+-----------+-----------+
| Particle0 | Particle1 | Particle2 | Particle3 |
| active    | active    | inactive  | inactive  |
+-----------+-----------+-----------+-----------+
                        ^
                        |
                    nActive==2

使粒子 3 处于活动状态:交换粒子 3 和 2,然后递增 nActive:

+-----------+-----------+-----------+-----------+
| Particle0 | Particle1 | Particle3 | Particle2 |
| active    | active    | active    | inactive  |
+-----------+-----------+-----------+-----------+
                                    ^
                                    |
                                nActive==3

使粒子 0 处于非活动状态:递减 nActive,然后交换粒子 3(在插槽 2 中)和 0。

+-----------+-----------+-----------+-----------+
| Particle3 | Particle1 | Particle0 | Particle2 |
| active    | active    | inactive  | inactive  |
+-----------+-----------+-----------+-----------+
                        ^
                        |
                    nActive==2

现在粒子顺序都打乱了,但您不这样做不在乎这个,对吧?

Since the order of the active doesn't matter, and they don't become active or inactive very often, you can use a trick to keep the active particles at the front:

Suppose that nActive is the number of active particles.

When an inactive particle becomes active, do std::swap(vector[nParticle], vector[nActive]); nActive++; i.e. make it be the last active particle - and whatever inactive particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

When an active particle becomes inactive, do nActive--; std::swap(vector[nParticle], vector[nActive]); i.e. make it be the first inactive particle - and whatever active particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

You can remember this trick by remembering that it's easy to make a particle active or inactive - if we don't care which one it is - because we just add or subtract 1 from nActive. But then we don't know which particle we just changed. So we swap that particle with the one we meant to change, and we don't care where that particle goes, as long as it stays inactive or stays active (doesn't change).

Like this:

+-----------+-----------+-----------+-----------+
| Particle0 | Particle1 | Particle2 | Particle3 |
| active    | active    | inactive  | inactive  |
+-----------+-----------+-----------+-----------+
                        ^
                        |
                    nActive==2

make particle 3 active: swap particle 3 and 2, then increment nActive:

+-----------+-----------+-----------+-----------+
| Particle0 | Particle1 | Particle3 | Particle2 |
| active    | active    | active    | inactive  |
+-----------+-----------+-----------+-----------+
                                    ^
                                    |
                                nActive==3

make particle 0 inactive: decrement nActive, then swap particle 3 (in slot 2) and 0.

+-----------+-----------+-----------+-----------+
| Particle3 | Particle1 | Particle0 | Particle2 |
| active    | active    | inactive  | inactive  |
+-----------+-----------+-----------+-----------+
                        ^
                        |
                    nActive==2

The particle order is all jumbled up now, but you don't care about that, right?

行至春深 2025-01-23 14:12:56

虽然 user253751 的答案对于优化分区方案来说是绝对正确的,但听起来您将遇到与粒子数量和缓存一致性相关的更大问题。

我猜测所使用的数据结构类似于

struct Particle
{
    vec2 position;
    vec2 velocity;
    float scale;
    float scaleVelocity;
    float lifeRemaining;
    vec4 color;
    // Other attributes
};

std::vector<Particle> particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles[i].position += particles[i].velocity;
        particles[i].scale += particles[i].scaleVelocity;
        particles[i].lifeRemaining -= delta;
    }
}

这导致每个粒子元素被一个接一个地打包。这在多线程环境中会产生更差的性能,因为同步可能必须在单个缓存行中的核心之间发生(假设您单独处理每个组件而不是块,无论哪种方式能够矢量化仍然会更好)。

位置速度比例速度寿命颜色
比例position_0velocity_0scale_0scaleVelocity_0lifeRemaining_0color_0position_1velocity_1scale_1scaleVelocity_1lifeRemaining_1color_1position_2velocity_2scale_2scaleVelocity_2lifeRemaining_2color_2position_3velocity_3scale_3scaleVelocity_3lifeRemaining_3color_3
缓存糟糕剩余

​一致性并且不利于矢量化。

在这种情况下,您可能可以通过切换到类似于以下内容的面向数据的编程方法来显着提高性能。

struct ParticleContainer
{
    std::vector<vec2> position;
    std::vector<vec2> velocity;
    std::vector<float> scale;
    std::vector<float> lifeRemaining;
    std::vector<vec4> color;
};

ParticleContainer particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.position[i] += particles.velocity[i];
    }

    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.scale[i] += particles.scaleVelocity[i];
    }

    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.lifeRemaining[i] -= delta;
    }
}

会导致相同的元素打包在一起,从而允许您一次对其中多个元素进行操作

Particle 0Particle 12Particle
3position_0position_1position_2Particle位置
_3 速度_0速度_1 速度_2 速度_3
缩放_0缩放_1 缩放_2
缩放_3缩放速度_0 缩放速度_1 缩放速度_2缩放速度_3
lifeRemaining_0lifeRemaining_1lifeRemaining_2lifeRemaining_3
color_0color_1color_2color_3

这种内存布局更有利于矢量化,并且由于缓存一致性可以显着提高性能。它还为您提供了手动矢量化代码的选项,类似于

void updateParticles(const float delta)
{
    // A vec2 of type float requires 64 bits, allowing us to pack 2 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 2; i += 2)
    {
        const __m128 positionVector = _mm_loadu_ps(&particles.position[i]);
        const __m128 velocityVector = _mm_loadu_ps(&particles.velocity[i]);
        const __m128 newPosition = _mm_add_ps(positionVector, velocityVector);
        _mm_storeu_ps(&particles.position[i], newPosition);
    }

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i += 4)
    {
        const __m128 scaleVector = _mm_loadu_ps(&particles.scale[i]);
        const __m128 scaleVelocityVector = _mm_loadu_ps(&particles.scaleVelocity[i]);
        const __m128 newScale = _mm_add_ps(positionVector, scaleVelocityVector);
        _mm_storeu_ps(&particles.scale[i], newScale);
    }

    const __m128 deltaVector = _mm_set_ps1(delta);

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i += 4)
    {
        const __m128 lifeRemainingVector = _mm_loadu_ps(&particles.lifeRemaining[i]);
        const __m128 newLifeRemaining = _mm_sub_ps(lifeRemainingVector, deltaVector);
        _mm_storeu_ps(&particles.lifeRemaining[i], newLifeRemaining);
    }
}

此代码显然需要一些健全性检查,例如确保每个数组大小始终是向量大小的倍数。此外,如果您可以确保内存从 16 字节边界开始,您可以使用对齐的加载和存储函数。如果可用,您还可以使用 AVX 和 AVX2 256 位寄存器。

内存访问是现代计算机中最慢的部分,这样做可以确保在给定时间操作尽可能多的数据。这也使得 CPU 更容易急切地加载内存行。您现在还可以在多核环境中使用此功能,性能显着提高,并且不会影响缓存行的使用,这可以通过 CPU 上的多个线程来完成,也可以使用 CUDA、OpenCL 或计算着色器等在 GPGPU 上轻松执行。

While user253751's answer is absolutely correct for optimizing your partitioning scheme, it sounds like you're going to have a larger issue relating to the number of particles and cache coherency.

I'm guessing that the data structure being used is something along the lines of

struct Particle
{
    vec2 position;
    vec2 velocity;
    float scale;
    float scaleVelocity;
    float lifeRemaining;
    vec4 color;
    // Other attributes
};

std::vector<Particle> particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles[i].position += particles[i].velocity;
        particles[i].scale += particles[i].scaleVelocity;
        particles[i].lifeRemaining -= delta;
    }
}

This results in each particle element being packed one after another. This will have worse performance in a multithreaded environment as synchronization may have to occur between cores in a single cache line (assuming you are processing each component separately instead of chunks, either way being able to vectorize will still be better).

PositionVelocityScaleScale VelocityLife RemainingColor
position_0velocity_0scale_0scaleVelocity_0lifeRemaining_0color_0
position_1velocity_1scale_1scaleVelocity_1lifeRemaining_1color_1
position_2velocity_2scale_2scaleVelocity_2lifeRemaining_2color_2
position_3velocity_3scale_3scaleVelocity_3lifeRemaining_3color_3

This has poor cache coherency and isn't condusive to vectorization.

In this case you can likely substantially improve performance by switching to a Data Oriented Programming approach similar to

struct ParticleContainer
{
    std::vector<vec2> position;
    std::vector<vec2> velocity;
    std::vector<float> scale;
    std::vector<float> lifeRemaining;
    std::vector<vec4> color;
};

ParticleContainer particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.position[i] += particles.velocity[i];
    }

    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.scale[i] += particles.scaleVelocity[i];
    }

    for(size_t i = 0; i < particles.size(); ++i)
    {
        particles.lifeRemaining[i] -= delta;
    }
}

This results in identical elements being packed together, allowing you to operate on multiple of them at a time

Particle 0Particle 1Particle 2Particle 3
position_0position_1position_2position_3
velocity_0velocity_1velocity_2velocity_3
scale_0scale_1scale_2scale_3
scaleVelocity_0scaleVelocity_1scaleVelocity_2scaleVelocity_3
lifeRemaining_0lifeRemaining_1lifeRemaining_2lifeRemaining_3
color_0color_1color_2color_3

This memory layout is much more condusive to vectorization and can drastrically improve performance due to cache coherency. It also gives you the option to manually vectorize your code in something similar to

void updateParticles(const float delta)
{
    // A vec2 of type float requires 64 bits, allowing us to pack 2 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 2; i += 2)
    {
        const __m128 positionVector = _mm_loadu_ps(&particles.position[i]);
        const __m128 velocityVector = _mm_loadu_ps(&particles.velocity[i]);
        const __m128 newPosition = _mm_add_ps(positionVector, velocityVector);
        _mm_storeu_ps(&particles.position[i], newPosition);
    }

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i += 4)
    {
        const __m128 scaleVector = _mm_loadu_ps(&particles.scale[i]);
        const __m128 scaleVelocityVector = _mm_loadu_ps(&particles.scaleVelocity[i]);
        const __m128 newScale = _mm_add_ps(positionVector, scaleVelocityVector);
        _mm_storeu_ps(&particles.scale[i], newScale);
    }

    const __m128 deltaVector = _mm_set_ps1(delta);

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i += 4)
    {
        const __m128 lifeRemainingVector = _mm_loadu_ps(&particles.lifeRemaining[i]);
        const __m128 newLifeRemaining = _mm_sub_ps(lifeRemainingVector, deltaVector);
        _mm_storeu_ps(&particles.lifeRemaining[i], newLifeRemaining);
    }
}

This code obviously requires some sanity checks such as ensuring that each array size is always a multiple of the vector size. Additionally if you can ensure that the memory starts on the 16 byte boundary you can use the aligned load and store functions. You can also use AVX and AVX2 256 bit registers if those are available to you.

Memory accesses are the slowest part of modern computers, and doing this can ensure as much data is operated on as possible at a given time. This also makes it easier for the CPU to eagerly load memory lines. You can also now use this in a multicore environment with substanitally better performance and without clobbering cache line usage, this can either be done with multiple threads on a CPU or easily performed on a GPGPU using something like CUDA, OpenCL, or compute shaders.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文