如何提高自定义 OpenGL ES 2.0 深度纹理生成的性能?
我有一个开源 iOS 应用程序,它使用自定义 OpenGL ES 2.0 着色器来显示分子结构的 3D 表示。它通过使用在矩形上绘制的程序生成的球体和圆柱体冒名顶替者来实现这一点,而不是使用大量顶点构建的相同形状。这种方法的缺点是,这些冒充对象的每个片段的深度值需要在片段着色器中计算,以便在对象重叠时使用。
不幸的是,OpenGL ES 2.0 不允许您写入 gl_FragDepth ,所以我需要将这些值输出到自定义深度纹理。我使用帧缓冲区对象 (FBO) 对场景进行传递,仅渲染与深度值相对应的颜色,并将结果存储到纹理中。然后将该纹理加载到渲染过程的后半部分,其中生成实际的屏幕图像。如果该阶段的片段处于存储在屏幕上该点的深度纹理中的深度级别,则显示该片段。如果没有,就被扔掉。有关该过程的更多信息,包括图表,可以在我的帖子 此处。
这种深度纹理的生成是我的渲染过程中的一个瓶颈,我正在寻找一种方法来使其更快。它似乎比应有的速度慢,但我不明白为什么。为了正确生成此深度纹理,请禁用 GL_DEPTH_TEST
,使用 glBlendFunc(GL_ONE, GL_ONE)
启用 GL_BLEND
,并且 < code>glBlendEquation() 设置为 GL_MIN_EXT
。我知道以这种方式输出的场景在 iOS 设备中的 PowerVR 系列等基于图块的延迟渲染器上并不是最快的,但我想不出更好的方法来做到这一点。
我的球体深度片段着色器(最常见的显示元素)看起来是这个瓶颈的核心(仪器中的渲染器利用率固定为 99%,表明我受到片段处理的限制)。目前看起来如下:
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
const vec3 stepValues = vec3(2.0, 1.0, 0.0);
const float scaleDownFactor = 1.0 / 255.0;
void main()
{
float distanceFromCenter = length(impostorSpaceCoordinate);
if (distanceFromCenter > 1.0)
{
gl_FragColor = vec4(1.0);
}
else
{
float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter);
mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;
// Inlined color encoding for the depth values
float ceiledValue = ceil(currentDepthValue * 765.0);
vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - stepValues;
gl_FragColor = vec4(intDepthValue, 1.0);
}
}
在 iPad 1 上,使用直通着色器进行显示需要 35 - 68 毫秒来渲染 DNA 空间填充模型的帧(iPhone 4 上为 18 到 35 毫秒)。根据 PowerVR PVRUniSCo 编译器(其 SDK 的一部分) ,该着色器最多使用 11 个 GPU 周期,最差使用 16 个周期。我知道建议您不要在着色器中使用分支,但在这种情况下,这会带来比其他情况更好的性能。
当我简化它时
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
void main()
{
gl_FragColor = vec4(adjustedSphereRadius * normalizedDepth * (impostorSpaceCoordinate + 1.0) / 2.0, normalizedDepth, 1.0);
}
,在 iPad 1 上需要 18 - 35 毫秒,但在 iPhone 4 上只需要 1.7 - 2.4 毫秒。该着色器的估计 GPU 周期数为 8 个周期。基于周期计数的渲染时间的变化似乎不是线性的。
最后,如果我只输出恒定的颜色:
precision mediump float;
void main()
{
gl_FragColor = vec4(0.5, 0.5, 0.5, 1.0);
}
iPad 1 上的渲染时间降至 1.1 - 2.3 毫秒(iPhone 4 上为 1.3 毫秒)。
渲染时间的非线性缩放以及 iPad 和 iPhone 4 之间第二个着色器的突然变化让我觉得我在这里遗漏了一些东西。包含这三个着色器变体的完整源项目(查看 SphereDepth.fsh 文件并注释掉相应部分)和测试模型可以从 此处,如果您想亲自尝试一下。
如果您已经读到这里,我的问题是:根据此分析信息,如何提高自定义深度着色器在 iOS 设备上的渲染性能?
I have an open source iOS application that uses custom OpenGL ES 2.0 shaders to display 3-D representations of molecular structures. It does this by using procedurally generated sphere and cylinder impostors drawn over rectangles, instead of these same shapes built using lots of vertices. The downside to this approach is that the depth values for each fragment of these impostor objects needs to be calculated in a fragment shader, to be used when objects overlap.
Unfortunately, OpenGL ES 2.0 does not let you write to gl_FragDepth, so I've needed to output these values to a custom depth texture. I do a pass over my scene using a framebuffer object (FBO), only rendering out a color that corresponds to a depth value, with the results being stored into a texture. This texture is then loaded into the second half of my rendering process, where the actual screen image is generated. If a fragment at that stage is at the depth level stored in the depth texture for that point on the screen, it is displayed. If not, it is tossed. More about the process, including diagrams, can be found in my post here.
The generation of this depth texture is a bottleneck in my rendering process and I'm looking for a way to make it faster. It seems slower than it should be, but I can't figure out why. In order to achieve the proper generation of this depth texture, GL_DEPTH_TEST
is disabled, GL_BLEND
is enabled with glBlendFunc(GL_ONE, GL_ONE)
, and glBlendEquation()
is set to GL_MIN_EXT
. I know that a scene output in this manner isn't the fastest on a tile-based deferred renderer like the PowerVR series in iOS devices, but I can't think of a better way to do this.
My depth fragment shader for spheres (the most common display element) looks to be at the heart of this bottleneck (Renderer Utilization in Instruments is pegged at 99%, indicating that I'm limited by fragment processing). It currently looks like the following:
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
const vec3 stepValues = vec3(2.0, 1.0, 0.0);
const float scaleDownFactor = 1.0 / 255.0;
void main()
{
float distanceFromCenter = length(impostorSpaceCoordinate);
if (distanceFromCenter > 1.0)
{
gl_FragColor = vec4(1.0);
}
else
{
float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter);
mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;
// Inlined color encoding for the depth values
float ceiledValue = ceil(currentDepthValue * 765.0);
vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - stepValues;
gl_FragColor = vec4(intDepthValue, 1.0);
}
}
On an iPad 1, this takes 35 - 68 ms to render a frame of a DNA spacefilling model using a passthrough shader for display (18 to 35 ms on iPhone 4). According to the PowerVR PVRUniSCo compiler (part of their SDK), this shader uses 11 GPU cycles at best, 16 cycles at worst. I'm aware that you're advised not to use branching in a shader, but in this case that led to better performance than otherwise.
When I simplify it to
precision mediump float;
varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
void main()
{
gl_FragColor = vec4(adjustedSphereRadius * normalizedDepth * (impostorSpaceCoordinate + 1.0) / 2.0, normalizedDepth, 1.0);
}
it takes 18 - 35 ms on iPad 1, but only 1.7 - 2.4 ms on iPhone 4. The estimated GPU cycle count for this shader is 8 cycles. The change in render time based on cycle count doesn't seem linear.
Finally, if I just output a constant color:
precision mediump float;
void main()
{
gl_FragColor = vec4(0.5, 0.5, 0.5, 1.0);
}
the rendering time drops to 1.1 - 2.3 ms on iPad 1 (1.3 ms on iPhone 4).
The nonlinear scaling in rendering time and sudden change between iPad and iPhone 4 for the second shader makes me think that there's something I'm missing here. A full source project containing these three shader variants (look in the SphereDepth.fsh file and comment out the appropriate sections) and a test model can be downloaded from here, if you wish to try this out yourself.
If you've read this far, my question is: based on this profiling information, how can I improve the rendering performance of my custom depth shader on iOS devices?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据 Tommy、Pivot 和 rotoglup 的建议,我实施了一些优化,使应用程序中深度纹理生成和整体渲染管道的渲染速度提高了一倍。
首先,我重新启用了之前使用的预先计算的球体深度和照明纹理,效果甚微,只是现在在处理该纹理的颜色和其他值时使用正确的
lowp
精度值。这种组合,再加上适当的纹理贴图,似乎可以带来约 10% 的性能提升。更重要的是,我现在在渲染深度纹理和最终的光线追踪冒名顶替者之前进行一次传递,在其中放置一些不透明的几何体来阻止永远不会渲染的像素。为此,我启用深度测试,然后使用简单的不透明着色器绘制出构成场景中对象的正方形,并按 sqrt(2) / 2 缩小。这将创建覆盖所表示球体中已知不透明区域的嵌入正方形。
然后,我使用
glDepthMask(GL_FALSE)
禁用深度写入,并在距离用户更近一个半径的位置渲染方形球体冒名顶替者。这使得 iOS 设备中基于图块的延迟渲染硬件能够有效地去除在任何条件下都不会出现在屏幕上的片段,但仍然可以根据每个像素的深度值在可见球体冒名顶替者之间提供平滑的交叉点。这在下面的粗略插图中进行了描述:在本例中,顶部两个不透明的阻挡方块冒名顶替者不会阻止渲染这些可见对象的任何片段,但它们会阻止最低冒名顶替者的大部分片段。然后,最前面的冒名顶替者可以使用每像素测试来生成平滑的交叉点,而来自后面的冒名顶替者的许多像素不会因渲染而浪费 GPU 周期。
我没有想过禁用深度写入,但在进行最后渲染阶段时保留了深度测试。这是防止冒名顶替者简单地相互堆叠的关键,但仍然使用 PowerVR GPU 中的一些硬件优化。
在我的基准测试中,渲染我上面使用的测试模型每帧的时间为 18 - 35 毫秒,与我之前得到的 35 - 68 毫秒相比,渲染速度几乎翻倍。将同样的不透明几何体预渲染应用于光线跟踪通道可使整体渲染性能提高一倍。
<罢工>
奇怪的是,当我尝试使用内嵌八边形和外接八边形(绘制时应覆盖约 17% 的像素,并且在块片段方面更有效)进一步细化时,性能实际上比使用简单的正方形时更差。在最坏的情况下,Tiler 利用率仍然低于 60%,因此可能较大的几何图形会导致更多的缓存未命中。
编辑(5/31/2011):
根据 Pivot 的建议,我创建了内接和外接八边形来代替我的矩形,只是我遵循了建议 此处 用于优化光栅化三角形。在之前的测试中,尽管删除了许多不必要的片段并让您更有效地阻止被覆盖的片段,但八边形的性能比正方形差。通过按如下方式调整三角形绘制:
我能够将总体渲染时间平均减少 14%通过从正方形切换到八边形来实现上述优化。现在,深度纹理可在 19 毫秒内生成,偶尔会下降至 2 毫秒,高峰至 35 毫秒。
编辑2(5/31/2011):
我重新审视了Tommy使用step函数的想法,现在由于八边形我有更少的碎片需要丢弃。这与球体的深度查找纹理相结合,现在在 iPad 1 上为我的测试模型生成深度纹理的平均渲染时间为 2 毫秒。我认为在这个渲染案例中这已经达到了我所希望的效果,并且比我开始的地方有了巨大的进步。对于后代,这是我现在使用的深度着色器:
我已经更新了测试示例 这里,如果您希望看到这种新方法与我最初所做的相比的实际效果。
我仍然愿意接受其他建议,但这对于该应用程序来说是一个巨大的进步。
Based on the recommendations by Tommy, Pivot, and rotoglup, I've implemented some optimizations which have led to a doubling of the rendering speed for the both the depth texture generation and the overall rendering pipeline in the application.
First, I re-enabled the precalculated sphere depth and lighting texture that I'd used before with little effect, only now I use proper
lowp
precision values when handling the colors and other values from that texture. This combination, along with proper mipmapping for the texture, seems to yield a ~10% performance boost.More importantly, I now do a pass before rendering both my depth texture and the final raytraced impostors where I lay down some opaque geometry to block pixels that would never be rendered. To do this, I enable depth testing and then draw out the squares that make up the objects in my scene, shrunken by sqrt(2) / 2, with a simple opaque shader. This will create inset squares covering area known to be opaque in a represented sphere.
I then disable depth writes using
glDepthMask(GL_FALSE)
and render the square sphere impostor at a location closer to the user by one radius. This allows the tile-based deferred rendering hardware in the iOS devices to efficiently strip out fragments that would never appear onscreen under any conditions, yet still give smooth intersections between the visible sphere impostors based on per-pixel depth values. This is depicted in my crude illustration below:In this example, the opaque blocking squares for the top two impostors do not prevent any of the fragments from those visible objects from being rendered, yet they block a chunk of the fragments from the lowest impostor. The frontmost impostors can then use per-pixel tests to generate a smooth intersection, while many of the pixels from the rear impostor don't waste GPU cycles by being rendered.
I hadn't thought to disable depth writes, yet leave on depth testing when doing the last rendering stage. This is the key to preventing the impostors from simply stacking on one another, yet still using some of the hardware optimizations within the PowerVR GPUs.
In my benchmarks, rendering the test model I used above yields times of 18 - 35 ms per frame, as compared to the 35 - 68 ms I was getting previously, a near doubling in rendering speed. Applying this same opaque geometry pre-rendering to the raytracing pass yields a doubling in overall rendering performance.
Oddly, when I tried to refine this further by using inset and circumscribed octagons, which should cover ~17% fewer pixels when drawn, and be more efficient with blocking fragments, performance was actually worse than when using simple squares for this. Tiler utilization was still less than 60% in the worst case, so maybe the larger geometry was resulting in more cache misses.
EDIT (5/31/2011):
Based on Pivot's suggestion, I created inscribed and circumscribed octagons to use instead of my rectangles, only I followed the recommendations here for optimizing triangles for rasterization. In previous testing, octagons yielded worse performance than squares, despite removing many unnecessary fragments and letting you block covered fragments more efficiently. By adjusting the triangle drawing as follows:
I was able to reduce overall rendering time by an average of 14% on top of the above-described optimizations by switching to octagons from squares. The depth texture is now generated in 19 ms, with occasional dips to 2 ms and spikes to 35 ms.
EDIT 2 (5/31/2011):
I've revisited Tommy's idea of using the step function, now that I have fewer fragments to discard due to the octagons. This, combined with a depth lookup texture for the sphere, now leads to a 2 ms average rendering time on the iPad 1 for the depth texture generation for my test model. I consider that to be about as good as I could hope for in this rendering case, and a giant improvement from where I started. For posterity, here is the depth shader I'm now using:
I've updated the testing sample here, if you wish to see this new approach in action as compared to what I was doing initially.
I'm still open to other suggestions, but this is a huge step forward for this application.
在桌面上,许多早期的可编程设备都是这种情况,虽然它们可以同时处理 8 个或 16 个或任何片段,但它们实际上只有一个程序计数器来容纳大量片段(因为这也意味着只有一个获取/解码单元,并且其他一切之一,只要它们以 8 或 16 像素为单位工作)。因此,最初禁止条件,并且在那之后的一段时间内,如果对一起处理的像素的条件评估返回不同的值,则这些像素将以某种安排在较小的组中进行处理。
尽管 PowerVR 并不明确,但其 应用程序开发建议有一个关于流量控制的部分,并提出了很多关于动态分支的建议,通常只有在结果可以合理预测的情况下才是一个好主意,这让我认为它们处于同一类型的东西。因此,我建议速度差异可能是因为您包含了条件。
作为第一个测试,如果您尝试以下操作会发生什么?
On the desktop, it was the case on many early programmable devices that while they could process 8 or 16 or whatever fragments simultaneously, they effectively had only one program counter for the lot of them (since that also implies only one fetch/decode unit and one of everything else, as long as they work in units of 8 or 16 pixels). Hence the initial prohibition on conditionals and, for a while after that, the situation where if the conditional evaluations for pixels that would be processed together returned different values, those pixels would be processed in smaller groups in some arrangement.
Although PowerVR aren't explicit, their application development recommendations have a section on flow control and make a lot of recommendations about dynamic branches usually being a good idea only where the result is reasonably predictable, which makes me think they're getting at the same sort of thing. I'd therefore suggest that the speed disparity may be because you've included a conditional.
As a first test, what happens if you try the following?
其中许多要点已被其他发布答案的人涵盖,但这里的首要主题是您的渲染做了很多将被丢弃的工作:
着色器本身做了一些潜在的冗余工作。< /strong> 向量的长度可能被计算为
sqrt(dot(向量, 矢量))
。您不需要开方来拒绝圆之外的片段,并且无论如何,您都可以通过对长度进行平方来计算深度。此外,您是否考虑过深度值的显式量化是否实际上是必要的,或者您是否可以仅使用帧缓冲区的硬件从浮点到整数的转换(可能带有额外的偏差以确保您的准-深度测试稍后就会出来)?许多片段都在圆之外。只有您正在绘制的四边形面积的 π/4 才能产生有用的深度值。此时,我认为您的应用程序严重偏向于片段处理,因此您可能需要考虑增加绘制的顶点数量,以换取必须着色的区域的减少。由于您是通过正交投影绘制球体,因此任何外接正多边形都可以,但根据缩放级别,您可能需要一点额外的尺寸,以确保光栅化足够的像素。
许多片段被其他片段轻微遮挡。正如其他人指出的那样,您没有使用硬件深度测试,因此没有充分利用 TBDR 尽早终止着色工作的能力。如果您已经实现了 2) 的某些内容,那么您所需要做的就是在您可以生成的最大深度处绘制一个内接正多边形(穿过球体中间的平面),并在最小深度处绘制您的真实多边形(球体的前面)。 Tommy 和 rotoglup 的帖子都已包含状态向量细节。
请注意,2) 和 3) 也适用于您的光线追踪着色器。
Many of these points have been covered by others who have posted answers, but the overarching theme here is that your rendering does a lot of work that will be thrown away:
The shader itself does some potentially redundant work. The length of a vector is likely to be calculated as
sqrt(dot(vector, vector))
. You don’t need the sqrt to reject fragments outside of the circle, and you’re squaring the length to calculate the depth, anyway. Additionally, have you looked at whether or not explicit quantization of the depth values is actually necessary, or can you get away with just using the hardware’s conversion from floating-point to integer for the framebuffer (potentially with an additional bias to make sure your quasi-depth tests come out right later)?Many fragments are trivially outside the circle. Only π/4 of the area of the quads you’re drawing produce useful depth values. At this point, I imagine your app is heavily skewed towards fragment processing, so you may want to consider increasing the number of vertices you draw in exchange for a reduction in the area that you have to shade. Since you’re drawing spheres through an orthographic projection, any circumscribing regular polygon will do, although you may need a little extra size depending on zoom level to make sure you rasterize enough pixels.
Many fragments are trivially occluded by other fragments. As others have pointed out, you’re not using hardware depth test, and therefore not taking full advantage of a TBDR’s ability to kill shading work early. If you’ve already implemented something for 2), all you need to do is draw an inscribed regular polygon at the maximum depth that you can generate (a plane through the middle of the sphere), and draw your real polygon at the minimum depth (the front of the sphere). Both Tommy’s and rotoglup’s posts already contain the state vector specifics.
Note that 2) and 3) apply to your raytracing shaders as well.
我根本不是移动平台专家,但我认为让您感到困扰的是:
不需要额外的测试吗?通过,在深度测试之前绘制有帮助吗?
此通道可以进行 GL_DEPTH 预填充,例如通过绘制表示为四面相机(或立方体,可能更容易设置)的每个球体,并将其包含在关联的球体中。此通道可以在没有颜色遮罩或片段着色器的情况下绘制,只需启用
GL_DEPTH_TEST
和glDepthMask
即可。在桌面平台上,此类通道的绘制速度比颜色+深度通道更快。然后在深度计算过程中,您可以启用
GL_DEPTH_TEST
并禁用glDepthMask
,这样您的着色器就不会在被较近的几何体隐藏的像素上执行。该解决方案将涉及发出另一组绘制调用,因此这可能没有好处。
I'm no mobile platform expert at all, but I think that what bites you is that:
GL_DEPTH
testWouldn't an additional pass, drawn before the depth test be helpful ?
This pass could do a
GL_DEPTH
prefill, for example by drawing each sphere represented as quad facing camera (or a cube, that may be easier to setup), and contained in the associated sphere. This pass could be drawn without color mask or fragment shader, just withGL_DEPTH_TEST
andglDepthMask
enabled. On desktop platforms, these kind of passes get drawn faster than color + depth passes.Then in you depth computation pass, you could enable
GL_DEPTH_TEST
and disableglDepthMask
, this way your shader would not be executed on pixels that are hidden by nearer geometry.This solution would involve issuing another set of draw calls, so this may not be beneficial.