VBO 什么时候比“简单”更快？ OpenGL 基元（glBegin()）？

发布于 2024-07-10 18:39:11 字数 1503 浏览 10 评论 0原文

在听说顶点缓冲区对象 (VBO) 多年之后，我终于决定尝试一下它们（显然，我的东西通常对性能并不关键......）

我将在下面描述我的实验，但长话短说，我发现“简单”直接模式（glBegin()/glEnd()）、顶点数组（CPU 端）和 VBO（GPU 端）渲染模式之间的性能没有区别。我试图理解这是为什么，以及在什么条件下我可以期望看到 VBO 显着超越它们的原始（双关语）祖先。

实验详细信息

在实验中，我生成了包含大量点的（静态）3D 高斯云。每个点都有顶点和顶点与其相关的颜色信息。然后，我以连续帧的形式围绕云旋转相机，形成一种“轨道”行为。同样，这些点是静态的，只有眼睛移动（通过 gluLookAt()）。数据在任何渲染和渲染之前生成一次。存储在两个数组中以供渲染循环使用。

对于直接渲染，整个数据集在单个 glBegin()/glEnd() 块中渲染，其中的循环包含对 glColor3fv() 和 glVertex3fv() 的单个调用。

对于顶点数组和 VBO 渲染，整个数据集通过单个 glDrawArrays() 调用来渲染。

然后，我只需在紧密循环中运行一分钟左右，并使用高性能计时器测量平均 FPS。

性能结果 ##

如上所述，在我的台式机（XP x64、8GB RAM、512 MB Quadro 1700）和笔记本电脑（XP32、4GB RAM、256 MB Quadro NVS 110）上，性能没有区别。然而，它确实按照预期的点数进行了缩放。显然，我还禁用了垂直同步。

笔记本电脑运行的具体结果（使用 GL_POINTS 渲染）：

glBegin()/glEnd()：

1K pts --> 603 FPS
10K 点 --> 401 FPS
100K 点 --> 97 FPS
100 万点 --> 14 FPS

顶点数组（CPU 端）：

1K pts --> 603 FPS
10K 点 --> 402 FPS
100K 点 --> 97 FPS
100 万点 --> 14 FPS

顶点缓冲区对象（GPU 端）：

1K pts --> 604 FPS
10K 点 --> 399 FPS
100K 点 --> 95 FPS
100 万点 --> 14 FPS

我使用 GL_TRIANGLE_STRIP 渲染相同的数据，并且同样无法区分（尽管由于额外的光栅化而速度较预期慢）。如果有人想要的话，我也可以发布这些号码。。

问题

是什么？
我需要做什么才能实现 VBO 所承诺的性能增益？
我缺少什么？

原文

After many years of hearing about Vertex Buffer Objects (VBOs), I finally decided to experiment with them (my stuff isn't normally performance critical, obviously...)

I'll describe my experiment below, but to make a long story short, I'm seeing indistinguishable performance between "simple" direct mode (glBegin()/glEnd()), vertex array (CPU side) and VBO (GPU side) rendering modes. I'm trying to understand why this is, and under what conditions I can expect to see the VBOs significantly outshine their primitive (pun intended) ancestors.

Experiment Details

For the experiment, I generated a (static) 3D Gaussian cloud of a large number of points. Each point has vertex & color information associated with it. Then I rotated the camera around the cloud in successive frames in sort of an "orbiting" behavior. Again, the points are static, only the eye moves (via gluLookAt()). The data are generated once prior to any rendering & stored in two arrays for use in the rendering loop.

For direct rendering, the entire data set is rendered in a single glBegin()/glEnd() block with a loop containing a single call each to glColor3fv() and glVertex3fv().

For vertex array and VBO rendering, the entire data set is rendered with a single glDrawArrays() call.

Then, I simply run it for a minute or so in a tight loop and measure average FPS with the high performance timer.

Performance Results ##

As mentioned above, performance was indistinguishable on both my desktop machine (XP x64, 8GB RAM, 512 MB Quadro 1700), and my laptop (XP32, 4GB ram, 256 MB Quadro NVS 110). It did scale as expected with the number of points, however. Obviously, I also disabled vsync.

Specific results from laptop runs (rendering w/GL_POINTS):

glBegin()/glEnd():

1K pts --> 603 FPS
10K pts --> 401 FPS
100K pts --> 97 FPS
1M pts --> 14 FPS

Vertex Arrays (CPU side):

1K pts --> 603 FPS
10K pts --> 402 FPS
100K pts --> 97 FPS
1M pts --> 14 FPS

Vertex Buffer Objects (GPU side):

1K pts --> 604 FPS
10K pts --> 399 FPS
100K pts --> 95 FPS
1M pts --> 14 FPS

I rendered the same data with GL_TRIANGLE_STRIP and got similarly indistinguishable (though slower as expected due to extra rasterization). I can post those numbers too if anybody wants them.
.

Question(s)

What gives?
What do I have to do to realize the promised performance gain of VBOs?
What am I missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紫罗兰の梦幻 2024-07-17 18:39:11

优化 3D 渲染有很多因素。
通常有 4 个瓶颈：

CPU（创建顶点、APU 调用、其他一切）
总线（CPU<->GPU 传输）
顶点（固定功能管道执行上的顶点着色器）
像素（填充、片段着色器执行和 ROPS）

您的测试是给出倾斜的结果，因为您有大量的 CPU（和总线），同时最大化顶点或像素吞吐量。 VBO 用于降低 CPU 利用率（更少的 api 调用，与 CPU DMA 传输并行）。由于您不受 CPU 限制，因此它们不会给您带来任何好处。这就是优化 101。例如，在游戏中，CPU 变得很宝贵，因为 AI 和物理等其他事物都需要它，而不仅仅是发出大量 api 调用。很容易看出，直接将顶点数据（例如 3 个浮点数）写入内存指针比调用将 3 个浮点数写入内存的函数要快得多 - 至少可以节省调用周期。

回复收藏 0 原文

卷耳 2024-07-17 18:39:11

可能缺少一些东西：

这是一个疯狂的猜测，但是您的笔记本电脑卡可能根本缺少这种操作（即模拟它）。
您是否将数据复制到 GPU 内存（通过带有 GL_STATIC_DRAW 或 GL_DYNAMIC_DRAW 参数的 glBufferData(GL_ARRAY_BUFFER ）或者您是否在内存中使用指向主（非 GPU）数组的指针？（这需要每帧复制它，因此性能很慢）
您是否将索引作为通过 glBufferData 和 GL_ELEMENT_ARRAY_BUFFER 参数发送的另一个缓冲区传递？

如果完成这三件事，性能增益会很大。
对于 Python (v/pyOpenGl)，在大于 100 个元素的数组上，速度大约快 1000 倍，
C++ 速度提高了 5 倍，但在数组 50k-10m 顶点上。

以下是我对 c++ (Core2Duo/8600GTS) 的测试结果：

 pts   vbo glb/e  ratio
 100  3900  3900   1.00
  1k  3800  3200   1.18
 10k  3600  2700   1.33
100k  1500   400   3.75
  1m   213    49   4.34
 10m    24     5   4.80

因此，即使有 10m 顶点，帧速率也是正常的，而使用 glB/e 时帧速率很慢。

There might be a few things missing:

It's a wild guess, but your laptop's card might be missing this kind of operation at all (i.e. emulating it).
Are you copying the data to GPU's memory (via glBufferData(GL_ARRAY_BUFFER with either GL_STATIC_DRAW or GL_DYNAMIC_DRAW param) or are you using pointer to main (non GPU) array in memory? (that requires copying it every frame and therefore performance is slow)
Are you passing indices as another buffer sent via glBufferData and GL_ELEMENT_ARRAY_BUFFER params?

If those three things are done, the performance gain is big.
For Python (v/pyOpenGl) it's about 1000 times faster on arrays bigger than a couple 100 elemnts,
C++ up to 5 times faster, but on arrays 50k-10m vertices.

Here are my test results for c++ (Core2Duo/8600GTS):

 pts   vbo glb/e  ratio
 100  3900  3900   1.00
  1k  3800  3200   1.18
 10k  3600  2700   1.33
100k  1500   400   3.75
  1m   213    49   4.34
 10m    24     5   4.80

So even with 10m vertices it was normal framerate while with glB/e it was sluggish.

回复收藏 0 原文

浅浅淡淡 2024-07-17 18:39:11

在阅读红皮书时，我记得有一段话指出 VBO 可能更快取决于硬件。有些硬件会优化这些，而另一些则不会。您的硬件可能没有。

回复收藏 0 原文

娜些时光，永不杰束 2024-07-17 18:39:11

14Mpoints/s 并不是很多。很可疑。我们能看到完成绘图和初始化的完整代码吗？（将 14M/s 与 Slava Vishnyakov 获得的 240M/s (!) 进行比较）。更可疑的是，1K 次绘制时，它的速度下降到 640K/s（与他的 3.8M/s 相比，无论如何，这看起来都受到了 ~3800 SwapBuffer 的限制）。

我敢打赌，这个测试并不能衡量你认为它衡量的东西。

回复收藏 0 原文