优化计算着色器

发布于 2025-01-22 19:15:31 字数 1643 浏览 2 评论 0原文

在过去的几个月中，我一直在OpenGL中的计算着色器中进行许多不同的计算。有些工作正常，有些工作很慢，有些我可以在某种程度上进行优化，另一些我再也无法优化。

我一直在玩下面的简单代码（n粒子之间的引力），只是为了找到一些有关如何提高性能的策略，但绝对没有任何作用：

#version 450 core

uniform uint NumParticles;

layout (std430, binding = 0) buffer bla
{
    double rIn[];
};

layout (std430, binding = 1) writeonly buffer bla2
{
    double aOut[];
};


layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;


void main()
{
    int n;
    double dist3, dist2;
    dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);

    a.x = a.y = a.z = 0;
    for (n = 0; n < NumParticles; n++)
    {
        if (n != gl_GlobalInvocationID.x)
        {
            diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
            dist2 = dot(diff, diff);
            dist3 = 1.0 / (sqrt(dist2) * dist2);
            a += diff * dist3;
        }
    }
    aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
    aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
    aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}

我强烈怀疑它是很多内存访问会减慢此代码的速度。因此，我尝试的一件事是将共享变量作为“缓冲区”，然后让第一个线程（gl_localinvocationid.x == 0）读取第一个（例如）1024个粒子，让所有线程都执行他们的工作计算，然后是下一个1024。这使代码减慢了2-3倍。我尝试过的另一件事是将粒子坐标放在一个均匀的阵列中（最多可用于1024个粒子，我使用的是更多 - 因此，这只是为了看，如果它有所不同），这绝对没有改变。

我可以为上述示例提供一些代码，但我认为这不会有帮助。

我知道可以做出较小的改进（例如，使用inversesqrt而不是1.0/sqrt，而不是计算粒子n ＆lt; - ＆gt; <- em>粒子m m ＆lt; - ＆gt; 。

那么，有人能给我任何提示我如何提高此代码性能的提示吗？我真的找不到有关如何提高计算着色器的性能的任何在线上的任何东西，因此，任何一般建议（不一定仅针对此代码）将不胜感激。

原文

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.

I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:

#version 450 core

uniform uint NumParticles;

layout (std430, binding = 0) buffer bla
{
    double rIn[];
};

layout (std430, binding = 1) writeonly buffer bla2
{
    double aOut[];
};


layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;


void main()
{
    int n;
    double dist3, dist2;
    dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);

    a.x = a.y = a.z = 0;
    for (n = 0; n < NumParticles; n++)
    {
        if (n != gl_GlobalInvocationID.x)
        {
            diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
            dist2 = dot(diff, diff);
            dist3 = 1.0 / (sqrt(dist2) * dist2);
            a += diff * dist3;
        }
    }
    aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
    aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
    aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}

I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.

I can provide some code for the above examples, but I don't think, this would be helpful.

I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.

So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.

分享到QQ

分享到微博