为什么绘制调用很昂贵?
假设纹理、顶点和着色器数据已经在显卡上,则不需要向显卡发送太多数据。有几个字节来标识数据,大概是一个 4x4 矩阵,以及一些其他的参数。
那么所有的开销从哪里来呢?这些操作是否需要与 GPU 进行某种握手?
为什么发送包含一堆小模型(在 CPU 上计算)的单个网格通常比发送顶点 ID 和变换矩阵更快? (第二个选项看起来应该发送较少的数据,除非模型小于 4x4 矩阵)
assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,我假设“绘制调用”是指告诉 GPU 将一组特定顶点渲染为具有特定状态(着色器、混合状态等)的三角形的命令。
绘制调用不一定很昂贵。在旧版本的 Direct3D 中,许多调用需要上下文切换,这是昂贵的,但在新版本中并非如此。
减少绘制调用的主要原因是图形硬件转换和渲染三角形的速度比您提交三角形的速度快得多。如果每次调用都提交少量三角形,您将完全受到 CPU 的束缚,并且GPU 大部分时间都处于空闲状态。 CPU 无法足够快地为 GPU 提供数据。
使用两个三角形进行一次绘制调用的成本很低,但如果每次调用提交的数据太少,您将没有足够的 CPU 时间来向 GPU 提交尽可能多的几何图形。
进行绘制调用会产生一些实际成本,它需要设置一堆状态(要使用哪一组顶点,要使用什么着色器等等),并且状态更改在硬件方面都会产生成本(更新一堆状态)寄存器)和驱动程序端(验证和转换设置状态的调用)。
但是绘制调用的主要成本仅在每次调用提交的数据太少时才适用,因为这将导致您受到 CPU 限制,并阻止您充分利用硬件。
正如 Josh 所说,绘制调用也会导致命令缓冲区被刷新,但根据我的经验,这通常发生在调用 SwapBuffers 时,而不是提交几何图形时。视频驱动程序通常会尝试尽可能多地缓冲(有时是几帧!),以尽可能多地从 GPU 中挤出并行性。
您应该阅读 nVidia 演示文稿 Batch Batch Batch!,它相当旧,但是正好涵盖了这个主题。
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Direct3D 等图形 API 将其 API 级调用转换为与设备无关的命令,并将它们在缓冲区中排队。刷新该缓冲区以执行实际工作是昂贵的 - 既因为它意味着现在正在执行实际工作,又因为它可能会导致芯片上从用户模式切换到内核模式(然后再返回),这并不是说便宜的。
在刷新缓冲区之前,只要 CPU 不发出阻塞请求(例如将数据映射回 CPU),GPU 就能够与 CPU 并行执行一些准备工作。但 GPU 不会(也不能)在需要实际绘制之前准备好一切。仅仅因为某些顶点或纹理数据位于卡上并不意味着它已正确排列,并且在设置顶点布局或绑定着色器等之前可能无法排列。大部分实际工作发生在命令刷新和绘制调用期间。
DirectX SDK 有一个有关准确分析 D3D 性能的部分,其中虽然与您的问题没有直接关系,但可以提供一些关于什么是昂贵和不昂贵以及(在某些情况下)原因的提示。
更相关的是这篇博文(以及后续帖子此处和此处),它很好地概述了 GPU 的逻辑、低级操作过程。
但是,本质上(尝试直接回答您的问题),调用费用昂贵的原因并不是因为需要传输大量数据,而是因为需要进行大量工作不仅仅是通过总线传输数据,这些数据会被推迟到命令缓冲区被刷新为止。
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
简短的回答:驱动程序缓冲部分或全部实际工作,直到您调用绘制。这将显示为绘制调用中花费的相对可预测的时间量,具体取决于状态更改的程度。
这样做有几个原因:
备用答案:
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
Alternate answer(s):