哪些 OpenGL 函数不是 GPU 加速的？

发布于 2024-08-30 10:29:56 字数 1196 浏览 16 评论 0原文

当我读到这篇文章时，我感到震惊（来自 OpenGL wiki）：

gl平移、gl旋转、gl缩放
这些硬件是否加速？
不，没有已知的 GPU 可以执行这个。驱动程序计算 CPU上的矩阵并将其上传到 GPU。
所有其他矩阵运算都是也在CPU上完成： glPushMatrix、glPopMatrix、 glLoadIdentity、glFrustum、glOrtho。
这就是这些函数的原因在 GL 3.0 中被视为已弃用。你应该有自己的数学图书馆，构建您自己的矩阵，上传您的矩阵到着色器。

在很长一段时间里，我认为大多数 OpenGL 函数都使用 GPU 来进行计算。我不确定这是否是一个常见的误解，但经过一段时间的思考，这是有道理的。由于状态切换太多，旧的 OpenGL 函数（2.x 及更早版本）确实不适合现实世界的应用程序。

这让我意识到，可能许多 OpenGL 函数根本不使用 GPU。

所以，问题是：

哪些 OpenGL 函数调用不使用 GPU？

我相信知道上述问题的答案将帮助我成为一名更好的 OpenGL 程序员。请分享您的一些见解。

编辑：

我知道这个问题很容易导致优化级别。很好，但这不是这个问题的目的。

如果有人知道不使用 GPU 的某个流行实现（如 AshleysBrain 建议的 nVidia/ATI，可能依赖于操作系统）上的一组 GL 函数，那就是我所追求的！

合理的优化指南稍后会出现。对于本主题，让我们重点关注功能。

编辑2：

本主题与矩阵变换的工作原理无关。还有其他主题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

动听の歌 2024-09-06 10:29:57

老兄，这是一个大话题吗？

首先，我将从显而易见的开始：由于您是从 CPU 调用函数（任何函数），因此它必须至少部分在 CPU 上运行。所以问题实际上是，有多少工作是在 CPU 上完成的，有多少工作是在 GPU 上完成的。

其次，为了让 GPU 执行某些命令，CPU 必须准备一个命令描述来传递下去。这里的最小集合是描述要做什么的命令标记，以及要执行的操作的数据。 CPU如何触发GPU执行命令也有些重要。由于大多数情况下，这是昂贵的，CPU 不会经常这样做，而是在命令缓冲区中批处理命令，然后简单地发送整个缓冲区供 GPU 处理。

所有这些都表明，将工作传递给 GPU 并不是一件容易的事。该成本必须与仅在 CPU 上运行该函数的成本相比较（无论我们在谈论什么）。

退后一步，您必须问自己为什么需要 GPU。事实上，纯 CPU 实现可以完成这项工作（正如 AshleysBrain 提到的）。 GPU 的强大之处在于它的设计能够处理：

专门任务（光栅化、混合、纹理过滤、位块传输等）、
高度并行的工作负载（DeadMG 在他的回答中指出了这一点），而 CPU 更适合处理单线程工作。

这些是决定芯片内容时应遵循的指导原则。任何可以从中受益的东西都应该在 GPU 上运行。其他任何事情都应该在 CPU 上。

顺便说一句，这很有趣。 GL 的某些功能（大部分是在弃用之前）确实没有明确描述。显示列表可能是此类功能的最佳示例。每个驱动程序都可以自由地将其想要的内容从显示列表流推送到 GPU（通常以某种命令缓冲区形式）以便以后执行，只要保留 GL 显示列表的语义（这有点一般来说很难）。因此，某些实现仅选择将显示列表中的有限调用子集推送为计算格式，并选择简单地在 CPU 上重放命令流的其余部分。

选择是另一个尚不清楚在 GPU 上执行是否有价值的问题。

最后，我不得不说，一般来说，API 调用与 CPU 或 GPU 上的工作量之间几乎没有相关性。状态设置 API 往往只修改驱动程序数据中某处的结构。它的效果仅在调用 Draw 或类似函数时可见。

很多 GL API 都是这样工作的。此时，询问glEnable(GL_BLEND)是在CPU上执行还是在GPU上执行就毫无意义了。重要的是调用 Draw 时混合是否会在 GPU 上发生。因此，从这个意义上说，大多数 GL 入口点根本没有加速。

我还可以对数据传输进行一些扩展，但丹维尔谈到了它。

我将以小“软件路径”结束。从历史上看，无论硬件特殊情况如何，GL 都必须遵守规范。这意味着如果硬件不处理特定的 GL 功能，那么它必须模拟它，或者在软件中完全实现它。这样的例子有很多，但让很多人震惊的是 GLSL 的出现。

由于没有实用的方法来估计 GLSL 着色器的代码大小，因此决定 GL 应将任何着色器长度视为有效。其含义相当明确：要么实现可以采用任意长度着色器的硬件（当时不现实），要么实现 as/w 着色器模拟（或者，正如某些供应商选择的那样，根本无法兼容）。因此，如果您在片段着色器上触发此条件，则您的 GL 的整个很可能最终在 CPU 上执行，即使您的 GPU 处于空闲状态，至少对于该绘制而言也是如此。

Boy, is this a big subject.

First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.

Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.

All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).

Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:

specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.

And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.

It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.

Selection is another one where it's unclear whether there is value to executing on the GPU.

Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.

A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.

I could also expand a bit on data transfer but Danvil touched on it.

I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.

Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.

回复收藏 0 原文

赤濁 2024-09-06 10:29:57

问题也许应该是“哪些函数占用了意想不到的大量 CPU 时间？”

保留用于投影和视图的矩阵堆栈并不是 GPU 能够比 CPU 更好地处理的事情（相反......）。另一个例子是着色器编译。为什么它应该在 GPU 上运行？有一个解析器、一个编译器……，它们只是普通的 CPU 程序，如 C++ 编译器。

例如，潜在的“危险”函数调用是 glReadPixels，因为数据可以通过有限的总线从主机 (=CPU) 内存复制到设备 (=GPU) 内存。此类别中还有诸如 glTexImage_D 或 glBufferData 之类的函数。

所以一般来说，如果您想知道 OpenGL 调用消耗了多少 CPU 时间，请尝试了解其功能。并注意所有将数据从主机复制到设备并返回的功能！

回复收藏 0 原文

风尘浪孓 2024-09-06 10:29:57

通常，如果某个操作是针对某事物的，那么它将发生在 GPU 上。一个例子是实际的变换——每个顶点完成一次。另一方面，如果每个大型操作仅发生一次，则它将在 CPU 上 - 例如创建变换矩阵，每次对象状态更改时或每帧一次仅执行一次。

这只是一个一般性的答案，一些功能将以相反的方式发生——并且依赖于实现。然而，通常情况下，这对程序员来说并不重要。只要您在进行游戏模拟或其他操作时允许 GPU 有足够的时间来完成工作，或者拥有可靠的线程模型，您就不需要太担心它。

@将数据发送到GPU：据我所知（仅使用Direct3D），这一切都是在着色器中完成的，这就是着色器的用途。

回复收藏 0 原文

梦在深巷 2024-09-06 10:29:57

glTranslate、glRotate 和 glScale 更改当前活动的变换矩阵。这当然是CPU操作。模型视图和投影矩阵仅描述 GPU 在发出渲染命令时应如何变换顶点。

因此，例如通过调用 glTranslate 还没有翻译任何内容。在渲染之前，将当前投影和模型视图矩阵相乘（MVP = 投影 * 模型视图），然后将此单个矩阵复制到 GPU，然后 GPU 对每个顶点执行矩阵 * 顶点乘法（“T&L”）。因此顶点的平移/缩放/投影是由GPU完成的。

另外，如果您不在某处的内部循环中使用这些函数，您真的不应该担心性能。 glTranslate 会产生三个的添加。 glScale 和 glRotate 稍微复杂一些。

我的建议是你应该多学一点线性代数。这对于使用 3D API 至关重要。

回复收藏 0 原文

绿萝 2024-09-06 10:29:57

有 OpenGL 的软件渲染实现，因此可能没有 OpenGL 函数在 GPU 上运行。还有一些硬件不支持硬件中的某些渲染状态，因此，如果您设置某种状态，切换到软件渲染，那么 GPU 上将不再运行任何内容（即使那里有一个）。所以我认为“GPU 加速函数”和“非 GPU 加速函数”之间没有任何明显的区别。

为了安全起见，让事情尽可能简单。简单的顶点渲染和 Z 缓冲等基本功能最有可能是硬件加速的，所以如果你能坚持以最小的状态改变，你将最有可能保留东西硬件加速。这也是最大限度地提高硬件加速渲染性能的方法 - 显卡喜欢保持在一种状态并只处理一堆顶点。

回复收藏 0 原文

~没有更多了~