着色器中分支的效率

发布于 2024-10-02 07:25:27 字数 1351 浏览 5 评论 0原文

我知道这个问题可能看起来有些没有根据，但是如果有人对这个主题有任何理论知识/有实践经验，如果你能分享它那就太好了。

我正在尝试优化我的一个旧着色器，它使用大量纹理查找。

对于三个可能的映射平面中的每一个，我都有漫反射、法线、镜面反射贴图，对于一些靠近用户的面，我还必须应用映射技术，这也会带来大量的纹理查找（如视差遮挡映射）。

分析表明纹理查找是着色器的瓶颈，我愿意删除其中一些。对于输入参数的某些情况，我已经知道部分纹理查找是不必要的，明显解决方案是执行（伪代码）之类的操作em>：

if (part_actually_needed) {
   perform lookups;
   perform other steps specific for THIS PART;
}

// All other parts.

现在 - 问题来了。

我不太记得了（这就是为什么我说这个问题可能毫无根据），但在我最近读到的一些论文中 >（不幸的是，不记得名字了）有类似以下内容的陈述：

所呈现的性能技术取决于效率 基于硬件的条件实现了分支。

在我即将开始重构大量着色器并实现我正在谈论的基于 if 的优化之前，我想起了这种说法。

那么 - 在我开始这样做之前 - 有人知道着色器中分支的效率吗？为什么分支会给着色器带来严重的性能损失？

是否有可能我只能使用基于 if 的分支来恶化实际性能？

你可能会说 - 尝试并明白了。是的，如果这里没有人帮助我，我就会这么做:)

但是，在 if 情况下，对于新 GPU 可能有效的情况可能会是一场噩梦年纪大一点的。 这种问题很难预测，除非你有很多不同的 GPU（这不是我的情况）

所以，如果有人了解这一点或有此类基准测试经验着色器，我真的很感谢你的帮助。

几乎没有真正工作的剩余脑细胞不断告诉我，GPU 上的分支可能远不如 CPU 上的分支有效（CPU 通常具有极其有效的分支预测方式）并消除缓存未命中）仅仅因为它是 GPU（或者可能很难/不可能在 GPU 上实现）。

不幸的是，我不确定这个说法是否与真实情况有任何共同点......

原文

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.

I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.

I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).

Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):

if (part_actually_needed) {
   perform lookups;
   perform other steps specific for THIS PART;
}

// All other parts.

Now - here comes the question.

I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:

The performance of the presented
technique depends on how efficient
the HARDWARE-BASED CONDITIONAL
BRANCHING is implemented.

I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.

So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?

And is it even possible that I could only worsen the actual performance with the if-based branching?

You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)

But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)

So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.

Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).

Unfortunately I am not sure if this statement has anything in common with the real situation...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微暖i 2024-10-09 07:25:28

在许多情况下，两个分支都可以作为插值器按条件进行计算和混合。
这种方法的工作速度比分支要快得多。也可用于CPU。
例如：

...

vec3 c = vec3(1.0, 0.0, 0.0); 如果（a==b） c = vec3(0.0, 1.0, 0.0);

可以替换为：

vec3 c = mix(vec3(1.0, 0.0, 0.0), vec3(0.0, 1.0, 0.0), (a == b));

...

回复收藏 0 原文

沧桑㈠ 2024-10-09 07:25:28

这是 Kindle Fire 上的真实世界性能基准：

在片段着色器中...

运行速度为 20fps：

lowp vec4 a = vec4(0.0, 0.0, 0.0, 0.0);
if (a.r == 0.0)
    gl_FragColor = texture2D ( texture1, TextureCoordOut );

运行速度为 60fps：

gl_FragColor = texture2D ( texture1, TextureCoordOut );

Here's a real world performance benchmark on a kindle Fire:

In the fragment shader...

This runs at 20fps:

lowp vec4 a = vec4(0.0, 0.0, 0.0, 0.0);
if (a.r == 0.0)
    gl_FragColor = texture2D ( texture1, TextureCoordOut );

This runs at 60fps:

gl_FragColor = texture2D ( texture1, TextureCoordOut );

回复收藏 0 原文

桃扇骨 2024-10-09 07:25:28

我不知道基于 if 的优化，但是如何创建您认为需要的纹理查找的所有排列，每个排列都有自己的着色器，然后针对正确的情况使用正确的着色器（取决于纹理在其上查找特定模型或模型的一部分（需要）。我想我们在 Xbox 360 版《Bully》中也做了类似的事情。

回复收藏 0 原文

笑着哭最痛 2024-10-09 07:25:27

不幸的是，我认为真正的答案是使用特定情况的性能分析器在目标硬件上进行实际测试。特别是考虑到听起来您正处于项目优化阶段；这是考虑硬件频繁变化和特定着色器性质这一事实的唯一方法。

在 CPU 上，如果您得到错误预测的分支，则会导致管道刷新，并且由于 CPU 管道非常深，因此您实际上会丢失大约 20 个或更多周期的内容。在 GPU 上情况有些不同；管道可能要浅得多，但没有分支预测，并且所有着色器代码都将位于快速内存中 - 但这并不是真正的区别。

很难知道正在发生的一切的确切细节，因为 nVidia 和 ATI 相对守口如瓶，但关键是 GPU 是为大规模并行执行而设计的。有许多异步着色器核心，但每个核心又被设计为运行多个线程。我的理解是，每个核心都希望在任何给定周期内的所有线程上运行相同的指令（nVidia 将此线程集合称为“扭曲”）。

在这种情况下，线程可能代表一个顶点、一个几何元素或一个像素/片段，而扭曲是其中大约 32 个的集合。对于像素，它们可能是屏幕上彼此靠近的像素。问题是，如果在一个 warp 内，不同的线程在条件跳转时做出不同的决定，则 warp 就会发散，并且不再为每个线程运行相同的指令。硬件可以处理这个问题，但并不完全清楚（至少对我来说）它是如何做到的。每一代卡牌的处理方式也可能略有不同。最新、最通用的 CUDA/计算着色器友好型 nVidias 可能具有最佳的实现；旧卡的实现可能较差。最糟糕的情况是您可能会发现许多线程同时执行 if/else 语句的两侧。

着色器的一大技巧是学习如何利用这种大规模并行范例。有时，这意味着使用额外的通道、临时屏外缓冲区和模板缓冲区将逻辑从着色器中推送到 CPU 上。有时，优化可能会消耗更多周期，但实际上可能会减少一些隐藏的开销。

另请注意，您可以将 DirectX 着色器中的 if 语句显式标记为 [branch] 或 [flatten]。展平样式可以为您提供正确的结果，但始终执行指令中的所有内容。如果您没有明确选择一个，编译器可以为您选择一个 - 并且可能会选择[展平]，这对您的示例没有好处。

要记住的一件事是，如果跳过第一个纹理查找，这将混淆硬件的纹理坐标导数数学。您会遇到编译器错误，最好不要这样做，否则您可能会错过一些更好的纹理支持。

Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.

On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.

It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").

In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.

One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.

Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.

One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.

回复收藏 0 原文