OpenGL 是批量绘制更好还是静态 VBO 更好

发布于 12-06 09:41 字数 357 浏览 2 评论 0原文

从效率的角度来看(或者如果重要的话从另一个角度来看)什么是更好的?

情况
一个 OpenGL 应用程序,每帧在不同位置绘制多条线 (60 fps)。假设有 10 行。或者 100 000 行。答案会有所不同吗?

  • #1 有一个永远不会改变的静态 VBO,包含一条线的 2 个顶点

每一帧每条线都会有一个 glDrawArrays 调用来绘制,并且在中间会有矩阵转换来定位我们的一条线

  • >#2 使用每帧所有线的数据更新 VBO

每帧都会有一个绘制调用

What is preferrable, from an effiency point of view (or another point of view if it's important) ?

Situation
An OpenGL application that draws many lines at different positions every frame (60 fps). Lets say there are 10 lines. Or 100 000 lines. Would the answer be different?

  • #1 Have a static VBO that never changes, containing 2 vertices of a line

Every frame would have one glDrawArrays call per line to draw, and in between there would be matrix transformations to position our one line

  • #2 Update the VBO with the data for all the lines every frame

Every frame would have a single draw call

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

放低过去2024-12-13 09:41:41

第二个效率更高。

改变状态,特别是变换和矩阵,往往会导致其他状态的重新计算,并且通常会导致更多的数学计算。

然而,更新几何图形只涉及覆盖缓冲区。

由于现代视频硬件在相当大的带宽总线上,发送一些浮点数是微不足道的。它们是为快速移动大量数据而设计的,这是这项工作的副作用。更新顶点缓冲区正是他们经常且快速执行的操作。我相信,如果我们假设每个点有 32 个字节(float4 位置和颜色),则 100000 个线段小于 6 MB,而 PCIe 2.0 x16 约为 8 GB/s。

在某些情况下,根据驱动程序或卡处理变换的方式,更改一个可能会导致一些矩阵乘法并重新计算其他值,包括变换、剔除和剪切平面等。如果您更改状态、绘制,这不是问题几千个多边形,并重复,但当状态变化经常时,它们将产生很大的成本。

之前解决这个问题的一个很好的例子是批处理的概念,最大限度地减少状态变化,以便可以在它们之间绘制更多的几何图形。这用于更有效地绘制大量几何图形。

作为一个非常明显的示例,请考虑#1 的最佳情况:变换集不会触发额外的计算,并且驱动程序会热情且完美地进行缓冲。要绘制 100000 条线,您需要:

  • 100000 个矩阵集(在系统 RAM 中)
  • 100000 个具有函数调用开销的矩阵集调用(到视频驱动程序,将矩阵复制到那里的缓冲区)
  • 100000 个复制到视频 RAM 的矩阵,一次性执行
  • 100000线条绘制调用

仅函数调用开销就会降低性能。

另一方面,批处理涉及:

  • 100000点计算和设置,将系统RAM中的
  • 1个vbo复制到视频RAM。这将是一个很大的块,但是是一个连续的块,双方都知道会发生什么。是可以很好处理的。
  • 1 个矩阵集调用
  • 1 个矩阵复制到视频 RAM
  • 1 个绘制调用

您确实复制了更多数据,但很有可能 VBO 内容仍然不如复制矩阵数据那么昂贵。另外,您还可以在函数调用中节省大量 CPU 时间(200000 次减少到 2 次)。这简化了您、驱动程序(必须缓冲所有内容并检查冗余调用并优化和处理下载)以及显卡(可能必须重新计算)的生活。为了使其真正清晰,请可视化它的简单代码:

1:(

for (i = 0; i < 100000; ++i)
{
    matrix = calcMatrix(i);
    setMatrix(matrix);
    drawLines(1, vbo);
}

现在展开)

2:

matrix = calcMatrix();
setMatrix(matrix);
for (i = 0; i < 100000; ++i)
{
    localVBO[i] = point[i];
}
setVBO(localVBO);
drawLines(100000, vbo);

The second is incredibly more efficient.

Changing states, particularly transformation and matrices, tends to cause recalculation of other states and generally more math.

Updating geometry, however, simply involves overwriting a buffer.

With modern video hardware on rather massive bandwidth busses, sending a few floats across is trivial. They're designed for moving tons of data quickly, it's a side effect of the job. Updating vertex buffers is exactly what they do often and fast. If we assum points of 32 bytes each (float4 position and color), 100000 line segments is less than 6 MB and PCIe 2.0 x16 is about 8 GB/s, I believe.

In some cases, depending on how the driver or card handles transforms, changing one may cause some matrix multiplication and recalculating of other values, including transforms, culling and clipping planes, etc. This isn't a problem if you change the state, draw a few thousand polys, and repeat, but when the state changes are often, they will have a significant cost.

A good example of this being previously solved is the concept of batching, minimizing state changes so more geometry can be drawn between them. This is used to more efficiently draw large amounts of geometry.

As a very clear example, consider the best case for #1: transform set triggers no additional calculation and the driver buffers zealously and perfectly. To draw 100000 lines, you need:

  • 100000 matrix sets (in system RAM)
  • 100000 matrix set calls with function call overhead (to video driver, copying the matrix to the buffer there)
  • 100000 matrices copied to video RAM, performed in a single lump
  • 100000 line draw calls

The function call overhead alone is going to kill performance.

On the other hand, batching involves:

  • 100000 point calculations and sets, in system RAM
  • 1 vbo copy to video RAM. This will be a large chunk, but a single contiguous chunk and both sides know what to expect. It can be handled well.
  • 1 matrix set call
  • 1 matrix copy to video RAM
  • 1 draw call

You do copy more data, but there's a good chance the VBO contents still aren't as expensive as copying the matrix data. Plus, you save a huge amount of CPU time in function calls (200000 down to 2). This simplifies life for you, the driver (which has to buffer everything and check for redundant calls and optimize and handle downloading) and probably the video card as well (which may have had to recalculate). To make it really clear, visualize simple code for it:

1:

for (i = 0; i < 100000; ++i)
{
    matrix = calcMatrix(i);
    setMatrix(matrix);
    drawLines(1, vbo);
}

(now unwrap that)

2:

matrix = calcMatrix();
setMatrix(matrix);
for (i = 0; i < 100000; ++i)
{
    localVBO[i] = point[i];
}
setVBO(localVBO);
drawLines(100000, vbo);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文