关于优化 Z 缓冲区实现的建议？

发布于 2024-12-18 07:04:22 字数 1453 浏览 6 评论 0原文

我正在编写一个 3D 图形库作为我项目的一部分，目前一切正常，但还不够好。

特别是，我最头痛的是我的像素填充率非常慢——在我的目标机器上绘制一个跨越半个 800x600 窗口的三角形时，我什至无法管理 30 FPS（这诚然是一台较旧的计算机，但它应该能够管理 this 。）

我在我的可执行文件上运行了 gprof，最终得到了以下有趣的行：

  %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
43.51      9.50     9.50                             vSwap
34.86     17.11     7.61   179944     0.04     0.04  grInterpolateHLine
13.99     20.17     3.06                             grClearDepthBuffer
<snip>
0.76      21.78     0.17      624     0.27    12.46  grScanlineFill

函数。 vSwap 是我的双缓冲区交换函数，它还执行 vsyching，所以对我来说测试程序将花费大量时间在那里等待是有意义的。 grScanlineFill 是我的三角形绘制函数，它创建一个边缘列表，然后调用 grInterpolateHLine 来实际填充三角形。

我的引擎当前正在使用 Z 缓冲区来执行隐藏表面去除。如果我们忽略（假定的）垂直同步开销，那么结果表明测试程序花费了大约 85% 的执行时间来清除深度缓冲区，或者根据深度缓冲区中的值写入像素。我的深度缓冲区清除功能本身很简单：将浮点数的最大值复制到每个元素中。函数 grInterpolateHLine 是：

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    for(; x1 <= x2; x1 ++, z += zstep) {
        if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
        }
    }
}

我真的不知道如何改进它，特别是考虑到 vSetPixel 是一个宏。

我所有的优化想法都被精简为一个：

使用整数/定点深度缓冲区。

我在整数/定点深度缓冲区方面遇到的问题是插值可能非常烦人，而且我实际上还没有定点数字库。还有其他想法吗？任何建议将不胜感激。

原文

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.

In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)

I ran gprof on my executable, and I end up with the following interesting lines:

  %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
43.51      9.50     9.50                             vSwap
34.86     17.11     7.61   179944     0.04     0.04  grInterpolateHLine
13.99     20.17     3.06                             grClearDepthBuffer
<snip>
0.76      21.78     0.17      624     0.27    12.46  grScanlineFill

The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.

My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    for(; x1 <= x2; x1 ++, z += zstep) {
        if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
        }
    }
}

I really don't see how I can improve that, especially considering that vSetPixel is a macro.

My entire stock of ideas for optimization has been whittled down to precisely one:

Use an integer/fixed-point depth buffer.

The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

緦唸λ蓇 2024-12-25 07:04:22

您应该看一下 Quake 之类的源代码 - 考虑到 15 年前它在 Pentium 上可以实现什么。它的 z 缓冲区实现使用跨度而不是每像素（或片段）深度。否则，您可以查看 Mesa 中的光栅化代码。

回复收藏 0 原文

陌路黄昏 2024-12-25 07:04:22

在不查看其余代码的情况下，很难真正判断可以完成哪些更高阶的优化。不过，我有一些小观察。

无需在 grInterpolateHLine 中多次计算 x1 + y * VIDEO_WIDTH。即：

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    int offset = x1 + (y * VIDEO_WIDTH);
    for(; x1 <= x2; x1 ++, z += zstep, offset++) {
        if(z < grDepthBuffer[offset]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[offset] = z;
        }
    }
}

同样，我猜测您的 vSetPixel 会进行类似的计算，因此您也应该能够在那里使用相同的偏移量，然后您只需要在每次循环迭代中增加偏移量而不是 x1。有可能这可以扩展到调用 grInterpolateHLine 的函数，然后您只需要对每个三角形执行一次乘法。

您还可以使用深度缓冲区执行其他一些操作。大多数情况下，如果该行的第一个像素失败或通过深度测试，则该行的其余部分将具有相同的结果。因此，在第一次测试之后，您可以编写一个更高效的组装块来一次性测试整行，然后如果它通过，您可以使用更高效的块内存设置器来块设置像素和深度值，而不是在一次。如果线条仅部分被遮挡，您只需要测试/设置每个像素。

另外，不确定您所说的旧计算机是什么意思，但如果您的目标计算机是多核的，那么您可以将其分解为多个核心。您也可以对缓冲区清除功能执行此操作。它可以有很大帮助。

Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.

There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    int offset = x1 + (y * VIDEO_WIDTH);
    for(; x1 <= x2; x1 ++, z += zstep, offset++) {
        if(z < grDepthBuffer[offset]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[offset] = z;
        }
    }
}

Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.

There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.

Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.

回复收藏 0 原文