哪些 OpenGL 函数不是 GPU 加速的?
当我读到这篇文章时,我感到震惊(来自 OpenGL wiki):
gl平移、gl旋转、gl缩放
这些硬件是否加速?
不,没有已知的 GPU 可以 执行这个。驱动程序计算 CPU上的矩阵并将其上传到 GPU。
所有其他矩阵运算都是 也在CPU上完成: glPushMatrix、glPopMatrix、 glLoadIdentity、glFrustum、glOrtho。
这就是这些函数的原因 在 GL 3.0 中被视为已弃用。 你应该有自己的数学图书馆, 构建您自己的矩阵,上传您的 矩阵到着色器。
在很长一段时间里,我认为大多数 OpenGL 函数都使用 GPU 来进行计算。我不确定这是否是一个常见的误解,但经过一段时间的思考,这是有道理的。由于状态切换太多,旧的 OpenGL 函数(2.x 及更早版本)确实不适合现实世界的应用程序。
这让我意识到,可能许多 OpenGL 函数根本不使用 GPU。
所以,问题是:
哪些 OpenGL 函数调用不使用 GPU?
我相信知道上述问题的答案将帮助我成为一名更好的 OpenGL 程序员。请分享您的一些见解。
编辑:
我知道这个问题很容易导致优化级别。很好,但这不是这个问题的目的。
如果有人知道不使用 GPU 的某个流行实现(如 AshleysBrain 建议的 nVidia/ATI,可能依赖于操作系统)上的一组 GL 函数,那就是我所追求的!
合理的优化指南稍后会出现。对于本主题,让我们重点关注功能。
编辑2:
I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
老兄,这是一个大话题吗?
首先,我将从显而易见的开始:由于您是从 CPU 调用函数(任何函数),因此它必须至少部分在 CPU 上运行。所以问题实际上是,有多少工作是在 CPU 上完成的,有多少工作是在 GPU 上完成的。
其次,为了让 GPU 执行某些命令,CPU 必须准备一个命令描述来传递下去。这里的最小集合是描述要做什么的命令标记,以及要执行的操作的数据。 CPU如何触发GPU执行命令也有些重要。由于大多数情况下,这是昂贵的,CPU 不会经常这样做,而是在命令缓冲区中批处理命令,然后简单地发送整个缓冲区供 GPU 处理。
所有这些都表明,将工作传递给 GPU 并不是一件容易的事。该成本必须与仅在 CPU 上运行该函数的成本相比较(无论我们在谈论什么)。
退后一步,您必须问自己为什么需要 GPU。事实上,纯 CPU 实现可以完成这项工作(正如 AshleysBrain 提到的)。 GPU 的强大之处在于它的设计能够处理:
这些是决定芯片内容时应遵循的指导原则。任何可以从中受益的东西都应该在 GPU 上运行。其他任何事情都应该在 CPU 上。
顺便说一句,这很有趣。 GL 的某些功能(大部分是在弃用之前)确实没有明确描述。显示列表可能是此类功能的最佳示例。每个驱动程序都可以自由地将其想要的内容从显示列表流推送到 GPU(通常以某种命令缓冲区形式)以便以后执行,只要保留 GL 显示列表的语义(这有点一般来说很难)。因此,某些实现仅选择将显示列表中的有限调用子集推送为计算格式,并选择简单地在 CPU 上重放命令流的其余部分。
选择是另一个尚不清楚在 GPU 上执行是否有价值的问题。
最后,我不得不说,一般来说,API 调用与 CPU 或 GPU 上的工作量之间几乎没有相关性。状态设置 API 往往只修改驱动程序数据中某处的结构。它的效果仅在调用 Draw 或类似函数时可见。
很多 GL API 都是这样工作的。此时,询问glEnable(GL_BLEND)是在CPU上执行还是在GPU上执行就毫无意义了。重要的是调用 Draw 时混合是否会在 GPU 上发生。因此,从这个意义上说,大多数 GL 入口点根本没有加速。
我还可以对数据传输进行一些扩展,但丹维尔谈到了它。
我将以小“软件路径”结束。从历史上看,无论硬件特殊情况如何,GL 都必须遵守规范。这意味着如果硬件不处理特定的 GL 功能,那么它必须模拟它,或者在软件中完全实现它。这样的例子有很多,但让很多人震惊的是 GLSL 的出现。
由于没有实用的方法来估计 GLSL 着色器的代码大小,因此决定 GL 应将任何着色器长度视为有效。其含义相当明确:要么实现可以采用任意长度着色器的硬件(当时不现实),要么实现 as/w 着色器模拟(或者,正如某些供应商选择的那样,根本无法兼容)。因此,如果您在片段着色器上触发此条件,则您的 GL 的整个很可能最终在 CPU 上执行,即使您的 GPU 处于空闲状态,至少对于该绘制而言也是如此。
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether
glEnable(GL_BLEND)
is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
问题也许应该是“哪些函数占用了意想不到的大量 CPU 时间?”
保留用于投影和视图的矩阵堆栈并不是 GPU 能够比 CPU 更好地处理的事情(相反......)。另一个例子是着色器编译。为什么它应该在 GPU 上运行?有一个解析器、一个编译器……,它们只是普通的 CPU 程序,如 C++ 编译器。
例如,潜在的“危险”函数调用是
glReadPixels
,因为数据可以通过有限的总线从主机 (=CPU) 内存复制到设备 (=GPU) 内存。此类别中还有诸如glTexImage_D
或glBufferData
之类的函数。所以一般来说,如果您想知道 OpenGL 调用消耗了多少 CPU 时间,请尝试了解其功能。并注意所有将数据从主机复制到设备并返回的功能!
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example
glReadPixels
, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions likeglTexImage_D
orglBufferData
.So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
通常,如果某个操作是针对某事物的,那么它将发生在 GPU 上。一个例子是实际的变换——每个顶点完成一次。另一方面,如果每个大型操作仅发生一次,则它将在 CPU 上 - 例如创建变换矩阵,每次对象状态更改时或每帧一次仅执行一次。
这只是一个一般性的答案,一些功能将以相反的方式发生——并且依赖于实现。然而,通常情况下,这对程序员来说并不重要。只要您在进行游戏模拟或其他操作时允许 GPU 有足够的时间来完成工作,或者拥有可靠的线程模型,您就不需要太担心它。
@将数据发送到GPU:据我所知(仅使用Direct3D),这一切都是在着色器中完成的,这就是着色器的用途。
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
@sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate、glRotate 和 glScale 更改当前活动的变换矩阵。这当然是CPU操作。模型视图和投影矩阵仅描述 GPU 在发出渲染命令时应如何变换顶点。
因此,例如通过调用 glTranslate 还没有翻译任何内容。在渲染之前,将当前投影和模型视图矩阵相乘(MVP = 投影 * 模型视图),然后将此单个矩阵复制到 GPU,然后 GPU 对每个顶点执行矩阵 * 顶点乘法(“T&L”)。因此顶点的平移/缩放/投影是由GPU完成的。
另外,如果您不在某处的内部循环中使用这些函数,您真的不应该担心性能。 glTranslate 会产生三个 的添加。 glScale 和 glRotate 稍微复杂一些。
我的建议是你应该多学一点线性代数。这对于使用 3D API 至关重要。
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
有 OpenGL 的软件渲染实现,因此可能没有 OpenGL 函数在 GPU 上运行。还有一些硬件不支持硬件中的某些渲染状态,因此,如果您设置某种状态,切换到软件渲染,那么 GPU 上将不再运行任何内容(即使那里有一个)。所以我认为“GPU 加速函数”和“非 GPU 加速函数”之间没有任何明显的区别。
为了安全起见,让事情尽可能简单。简单的顶点渲染和 Z 缓冲等基本功能最有可能是硬件加速的,所以如果你能坚持以最小的状态改变,你将最有可能保留东西硬件加速。这也是最大限度地提高硬件加速渲染性能的方法 - 显卡喜欢保持在一种状态并只处理一堆顶点。
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.