优化计算着色器
在过去的几个月中,我一直在OpenGL中的计算着色器中进行许多不同的计算。有些工作正常,有些工作很慢,有些我可以在某种程度上进行优化,另一些我再也无法优化。
我一直在玩下面的简单代码(n
粒子之间的引力),只是为了找到一些有关如何提高性能的策略,但绝对没有任何作用:
#version 450 core
uniform uint NumParticles;
layout (std430, binding = 0) buffer bla
{
double rIn[];
};
layout (std430, binding = 1) writeonly buffer bla2
{
double aOut[];
};
layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
void main()
{
int n;
double dist3, dist2;
dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);
a.x = a.y = a.z = 0;
for (n = 0; n < NumParticles; n++)
{
if (n != gl_GlobalInvocationID.x)
{
diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
dist2 = dot(diff, diff);
dist3 = 1.0 / (sqrt(dist2) * dist2);
a += diff * dist3;
}
}
aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}
我强烈怀疑它是很多内存访问会减慢此代码的速度。因此,我尝试的一件事是将共享变量作为“缓冲区”,然后让第一个线程(gl_localinvocationid.x == 0
)读取第一个(例如)1024个粒子,让所有线程都执行他们的工作计算,然后是下一个1024。这使代码减慢了2-3倍。我尝试过的另一件事是将粒子坐标放在一个均匀的阵列中(最多可用于1024个粒子,我使用的是更多 - 因此,这只是为了看,如果它有所不同),这绝对没有改变。
我可以为上述示例提供一些代码,但我认为这不会有帮助。
我知道可以做出较小的改进(例如,使用inversesqrt
而不是1.0/sqrt
,而不是计算粒子n &lt; - &gt; <- em>粒子m m &lt; - &gt; 。
那么,有人能给我任何提示我如何提高此代码性能的提示吗?我真的找不到有关如何提高计算着色器的性能的任何在线上的任何东西,因此,任何一般建议(不一定仅针对此代码)将不胜感激。
I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.
I have been playing around with the simple code below (gravitational forces between n
particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:
#version 450 core
uniform uint NumParticles;
layout (std430, binding = 0) buffer bla
{
double rIn[];
};
layout (std430, binding = 1) writeonly buffer bla2
{
double aOut[];
};
layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
void main()
{
int n;
double dist3, dist2;
dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);
a.x = a.y = a.z = 0;
for (n = 0; n < NumParticles; n++)
{
if (n != gl_GlobalInvocationID.x)
{
diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
dist2 = dot(diff, diff);
dist3 = 1.0 / (sqrt(dist2) * dist2);
a += diff * dist3;
}
}
aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}
I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0
) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.
I can provide some code for the above examples, but I don't think, this would be helpful.
I know there are minor improvements one could make (like using inversesqrt
instead of 1.0 / sqrt
, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.
So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于GPU并行性来说,定义的操作似乎不是一个很好的操作。在内存访问方面,它非常饥饿,因为一个粒子的完整处理需要读取系统中其他每个粒子的数据。
如果要保留算法,则可以更优化地实现它。就目前而言,每个工作项都会一次为特定粒子的处理一次。这是一次大量的内存操作。
取而代之的是,将粒子分成块,大小用于工作组。每个工作组都在源粒子和一个测试粒子块(可能是相同的块)上运行。测试粒子应加载到
共享
内存中,因此每个工作组都可以快速读取测试数据。因此,一个工作组仅对每个源粒子块进行一部分测试。现在的最大困难是编写数据。由于多个工作组可能是将附加的力写入相同的源粒子,因此您需要使用某些机制将原子递增的源粒子数据递增或将数据写入临时存储器缓冲区。第二个计算着色器过程可以在临时缓冲区上运行,并将数据组合在还原过程中。
This operation as defined doesn't seem like a good one for GPU parallelism. It's very hungry in terms of memory accesses, as complete processing for one particle requires reading the data for every other particle in the system.
If you want to keep the algorithm as is, you can implement it more optimally. As it stands, each work item does all of the processing for a particular particle all at once. That's a huge number of memory operations happening all at once.
Instead, split your particles into blocks, sized for a work group. Each work group operates on a block of source particles and a block of test particles (which may be the same block). The test particles should be loaded into
shared
memory, so each work group can repeatedly read test data quickly. So a single work group only does a portion of the tests for each block of source particles.The big difficulty now is writing the data. Since multiple work groups are potentially be writing the added forces to the same source particles, you need to use some mechanism to either atomically increment the source particle data or write the data to a temporary memory buffer. A second compute shader process can run over the temporary buffers and combine the data in a reduction process.