OpenGL 低级性能问题
与任何优化问题一样,这个主题受到了很多关注,但我就是找不到我(认为)我想要的东西。
很多教程,甚至SO题都有类似的提示;一般覆盖:
- 使用 GL 面剔除(OpenGL 函数,而不是场景逻辑)
- 仅发送 1 个矩阵到 GPU(projectionModelView 组合),因此将 MVP 计算从每个顶点减少到每个模型一次(应该是这样)。
- 使用交错顶点
- 尽可能减少 GL 调用,在适当的情况下进行批处理,
并且可能还有一些/许多其他调用。我(出于好奇)在我的应用程序中使用多个顶点缓冲区渲染 2800 万个三角形。我已经尝试了所有上述技术(据我所知),并且几乎没有收到性能变化。
虽然我在实现中收到了大约 40FPS 的速度,这绝不是有问题的,但我仍然很好奇这些优化“技巧”实际上在哪里使用?
我的 CPU 在渲染过程中闲置率约为 20-50%,因此我假设为了提高性能而使用 GPU。
注意:我目前正在研究 gDEBugger
Cross 在 游戏开发
This subject, as with any optimisation problem, gets hit on a lot, but I just couldn't find what I (think) I want.
A lot of tutorials, and even SO questions have similar tips; generally covering:
- Use GL face culling (the OpenGL function, not the scene logic)
- Only send 1 matrix to the GPU (projectionModelView combination), therefore decreasing the MVP calculations from per vertex to once per model (as it should be).
- Use interleaved Vertices
- Minimize as many GL calls as possible, batch where appropriate
And possibly a few/many others. I am (for curiosity reasons) rendering 28 million triangles in my application using several vertex buffers. I have tried all the above techniques (to the best of my knowledge), and received almost no performance change.
Whilst I am receiving around 40FPS in my implementation, which is by no means problematic, I am still curious as to where these optimisation 'tips' actually come into use?
My CPU is idling around 20-50% during rendering, therefore I assume I am GPU bound for increasing performance.
Note: I am looking into gDEBugger at the moment
Cross posted at Game Development
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
第 1 点是显而易见的,因为它可以节省填充率。如果首先处理对象背面的基元,这将忽略这些面。然而,现代 GPU 能够很好地容忍过度绘制。我曾经测量过 (GeForce8800 GTX) 在性能受到显着影响之前透支高达 20%。但最好将此保留用于遮挡剔除、混合几何体渲染等。
第2点毫无意义。这些矩阵从来没有在 GPU 上计算过——好吧,如果你不考虑 SGI Onyx。矩阵始终只是在 CPU 上计算的某种渲染全局参数,然后推入 GPU 上的全局寄存器(现在称为统一),因此加入它们几乎没有什么好处。在着色器中仅节省了一个额外的向量矩阵乘法(归结为 4 个 MAD 指令),但代价是算法灵活性较低。
第 3 点与缓存效率有关。属于一起的数据应该适合缓存行。
第 4 点是关于防止状态更改破坏缓存。但这在很大程度上取决于它们所指的 GL 调用。换制服很便宜。切换纹理的成本很高。原因是,制服位于寄存器中,而不是缓存的某些内存中。切换着色器的成本很高,因为不同的着色器表现出不同的运行时行为,从而破坏管道执行预测、改变内存(进而改变缓存访问模式)等等。
但这些都是微观优化(其中一些具有巨大影响)。不过,我建议寻找影响较大的优化,例如实施早期 Z 通道;在 Z 早期使用遮挡查询来快速区分整个几何批次。一项影响较大的优化本质上是总结大量像 Point-4 一样的微观优化,即按昂贵的 GL 状态对渲染批次进行排序。因此,将所有内容与通用着色器分组,在这些组内按纹理等进行排序。此状态分组只会影响可见的渲染通道。在早期 Z 中,您仅测试 Z 缓冲区上的结果,因此只有几何变换,片段着色器将仅传递 Z 值。
Point 1 is obvious, as is saves fill rate. In case the primitives of an objects backside get processed first this will omit those faces. However modern GPUs tolerate overdraw quite well. I once (GeForce8800 GTX) measured up to 20% overdraw before significant performance hit. But it's better to save this reserve for things like occlusion culling, rendering of blended geometry and the like.
Point 2 is, well pointless. The matrices never have been calculated on the GPU – well, if you don't count SGI Onyx. Matrices always were just some kind of rendering global parameter calculated on the CPU, then pushed into global registers on the GPU, now called a uniform, so joining them has only very little benefit. In the shader that saves only one additional vector matrix multiplication (boils down to 4 MAD instructions), at the expense of less algorithmic flexibility.
Point 3 is all about cache efficiency. Data belonging together should fit into a cache line.
Point 4 is about preventing state changes trashing the caches. But it strongly depends which GL calls they mean. Changing uniforms is cheap. Switching a texture is expensive. The reason is, that a uniform sits in a register, not some piece of memory that's cached. Switching a shader is expensive, because different shaders exhibit different runtime behaviour, thus trashing the pipeline execution predition, altering memory (and thus) cache access patterns and so on.
But those are all micro optimizations (some of them with huge impact). However I recommend looking in large impact optimizations, like implementing an early Z pass; using occlusion query in th early Z for quick discrimination of whole geometry batches. One large impact optimization, that essentially consists of summing up a lot of Point-4 like micro optimizations is to sort render batches by expensive GL states. So group everything with common shaders, within those groups sort by texture and so on. This state grouping will only affect the visible render passes. In early Z you're only testing outcomes on the Z buffer so there's only geometry transformation and the fragment shaders will just pass the Z value.
您需要知道的第一件事是您的瓶颈到底在哪里。 GPU 不是答案,因为它是一个复杂的系统。实际问题可能如下:
您需要执行一系列测试才能发现问题。例如,将所有内容绘制到更大的 FBO 以查看是否是填充率问题(或增加 MSAA 数量)。或者将所有内容绘制两次以检查绘制调用过载问题。
The first thing you need to know is where exactly is your bottleneck. GPU is not an answer, because it's a complex system. The actual problem might be among these:
You need to perform a series of test to see the problem. For example, draw everything to a bigger FBO to see if it's a fill rate problem (or increase MSAA amount). Or draw everything twice to check the draw call overload issues.
只是为了给 @kvark 和 @datenwolf 答案添加我的 2 美分,我想说的是,虽然您提到的要点是“基本”GPU 性能技巧,但更多涉及的优化非常依赖于应用程序。
在您的几何密集型测试用例中,您已经抛出了 2800 万个三角形 * 40 FPS = 每秒 11.2 亿个三角形 - 这已经相当多了:大多数(不是全部,尤其是 Fermi)GPU 都有一个三角形设置每个 GPU 时钟周期 1 个三角形的性能。这意味着以 800MHz 运行的 GPU 每秒无法处理超过 8 亿个三角形;这甚至不需要绘制单个像素。 NVidia Fermi 每个时钟周期可以处理 4 个三角形。
如果您达到此限制(您没有提及您的硬件平台),则在 OpenGL/GPU 级别您无能为力。您所能做的就是通过更有效的剔除(视锥体或遮挡)或通过 LOD 方案发送更少的几何体。
另一件事是,当光栅化器对方形像素块进行并行处理时,小三角形会损害填充率;请参阅http://www.geeks3d.com/20101201/amd -graphics-blog-tessellation-for-all/。
Just to add my 2 cents to @kvark and @datenwolf answers, I'd like to say that, while the points you mention are 'basic' GPU performance tips, more involved optimization is very application dependent.
In your geometry-heavy test case, you're already throwing 28 million triangles * 40 FPS = 1120 million triangles per second - this is already quite a lot : most (not all, esp Fermi) GPU out there have a triangle setup performance of 1 triangle per GPU clock cycle. Meaning that a GPU running at 800MHz, say, cannot process more than 800 million triangles per second ; this without even drawing a single pixel. NVidia Fermi can process 4 triangles per clock cycle.
If you're hitting this limit (you don't mention your hardware platform), there's not much you can do at the OpenGL/GPU level. All you can do is send less geometry, via more efficient culling (frustum or occlusion), or via a LOD scheme.
Another thing is that tiny triangles hurt fillrate as rasterizers do parrallel processing on square blocks of pixels ; see http://www.geeks3d.com/20101201/amd-graphics-blog-tessellation-for-all/.
这在很大程度上取决于您正在运行的特定硬件以及使用场景。 OpenGL 性能技巧对于一般情况是有意义的 - 毕竟,该库是许多不同驱动程序实现的抽象。
驱动程序制造商可以在后台自由地进行优化,因此他们可能会在您不知情的情况下删除多余的状态更改或执行其他优化。在另一台设备上,它们可能不会。最好坚持最佳实践,以便更好地在各种设备上获得良好的性能。
This very much depends on what particular hardware you are running and what the usage scenarios are. OpenGL performance tips make sense for the general case - the library is, after all, an abstraction over many different driver implementations.
The driver makers are free to optimize however they want under the hood so they may remove redundant state changes or perform other optimizations without your knowledge. On another device, they may not. It is best to stick to best practices to have a better chance of good performance over a range of devices.