OpenGL 是批量绘制更好还是静态 VBO 更好
从效率的角度来看(或者如果重要的话从另一个角度来看)什么是更好的?
情况
一个 OpenGL 应用程序,每帧在不同位置绘制多条线 (60 fps)。假设有 10 行。或者 100 000 行。答案会有所不同吗?
- #1 有一个永远不会改变的静态 VBO,包含一条线的 2 个顶点
每一帧每条线都会有一个 glDrawArrays 调用来绘制,并且在中间会有矩阵转换来定位我们的一条线
- >#2 使用每帧所有线的数据更新 VBO
每帧都会有一个绘制调用
What is preferrable, from an effiency point of view (or another point of view if it's important) ?
Situation
An OpenGL application that draws many lines at different positions every frame (60 fps). Lets say there are 10 lines. Or 100 000 lines. Would the answer be different?
- #1 Have a static VBO that never changes, containing 2 vertices of a line
Every frame would have one glDrawArrays call per line to draw, and in between there would be matrix transformations to position our one line
- #2 Update the VBO with the data for all the lines every frame
Every frame would have a single draw call
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
第二个效率更高。
改变状态,特别是变换和矩阵,往往会导致其他状态的重新计算,并且通常会导致更多的数学计算。
然而,更新几何图形只涉及覆盖缓冲区。
由于现代视频硬件在相当大的带宽总线上,发送一些浮点数是微不足道的。它们是为快速移动大量数据而设计的,这是这项工作的副作用。更新顶点缓冲区正是他们经常且快速执行的操作。我相信,如果我们假设每个点有 32 个字节(float4 位置和颜色),则 100000 个线段小于 6 MB,而 PCIe 2.0 x16 约为 8 GB/s。
在某些情况下,根据驱动程序或卡处理变换的方式,更改一个可能会导致一些矩阵乘法并重新计算其他值,包括变换、剔除和剪切平面等。如果您更改状态、绘制,这不是问题几千个多边形,并重复,但当状态变化经常时,它们将产生很大的成本。
之前解决这个问题的一个很好的例子是批处理的概念,最大限度地减少状态变化,以便可以在它们之间绘制更多的几何图形。这用于更有效地绘制大量几何图形。
作为一个非常明显的示例,请考虑#1 的最佳情况:变换集不会触发额外的计算,并且驱动程序会热情且完美地进行缓冲。要绘制 100000 条线,您需要:
仅函数调用开销就会降低性能。
另一方面,批处理涉及:
您确实复制了更多数据,但很有可能 VBO 内容仍然不如复制矩阵数据那么昂贵。另外,您还可以在函数调用中节省大量 CPU 时间(200000 次减少到 2 次)。这简化了您、驱动程序(必须缓冲所有内容并检查冗余调用并优化和处理下载)以及显卡(可能必须重新计算)的生活。为了使其真正清晰,请可视化它的简单代码:
1:(
现在展开)
2:
The second is incredibly more efficient.
Changing states, particularly transformation and matrices, tends to cause recalculation of other states and generally more math.
Updating geometry, however, simply involves overwriting a buffer.
With modern video hardware on rather massive bandwidth busses, sending a few floats across is trivial. They're designed for moving tons of data quickly, it's a side effect of the job. Updating vertex buffers is exactly what they do often and fast. If we assum points of 32 bytes each (float4 position and color), 100000 line segments is less than 6 MB and PCIe 2.0 x16 is about 8 GB/s, I believe.
In some cases, depending on how the driver or card handles transforms, changing one may cause some matrix multiplication and recalculating of other values, including transforms, culling and clipping planes, etc. This isn't a problem if you change the state, draw a few thousand polys, and repeat, but when the state changes are often, they will have a significant cost.
A good example of this being previously solved is the concept of batching, minimizing state changes so more geometry can be drawn between them. This is used to more efficiently draw large amounts of geometry.
As a very clear example, consider the best case for #1: transform set triggers no additional calculation and the driver buffers zealously and perfectly. To draw 100000 lines, you need:
The function call overhead alone is going to kill performance.
On the other hand, batching involves:
You do copy more data, but there's a good chance the VBO contents still aren't as expensive as copying the matrix data. Plus, you save a huge amount of CPU time in function calls (200000 down to 2). This simplifies life for you, the driver (which has to buffer everything and check for redundant calls and optimize and handle downloading) and probably the video card as well (which may have had to recalculate). To make it really clear, visualize simple code for it:
1:
(now unwrap that)
2: