使用 OpenGL、GLSL 和 Framebuffer 对象在 GPU 上进行图像处理 - 有关性能的问题
我参与了一个项目,该项目在 CPU 上进行图像处理,目前正在扩展为使用GPU,希望主要使用 GPU,如果事实证明这更快,则将 CPU 处理部分作为后备。我是 GPU 编程新手,有一些问题,我已经在其他线程中看到过这些问题的各个方面的讨论,但一直无法找到我需要的答案。
如果我们从头开始,您会推荐什么技术在 GPU 上进行图像处理,以实现覆盖范围(如在客户端计算机上的支持)和速度的最佳组合? 我们采用了 OpenGL + GLSL 方式来覆盖尽可能多的显卡,我很好奇这是否是最佳选择。例如,您对 OpenCL 有何看法?
鉴于我们已经开始使用 OpenGL 和着色器实现 GPU 模块,我想了解一下我们是否以最有效的方式做到这一点。
我们使用帧缓冲区对象来读取和渲染纹理。在大多数情况下,正在读取的区域和正在写入的区域大小相同,但是我们读取和写入的纹理可以是任意大小。换句话说,我们要求 FBO 读取被视为其输入纹理的子区域,并写入被视为其输出纹理的子区域。为此,输出纹理“附加”到帧缓冲区对象(使用 glFramebufferTexture2DEXT()),但输入纹理则不然。这需要纹理被“附加”和“分离”,因为它们改变了它们的角色(即纹理最初可以用于写入,但在下一次传递中它可以用作读取的输入)。
相比之下,强制输入和输出大小相同并始终将它们附加到 FBO 更有意义,以有效使用 FBO 并实现更好的性能或我们已经做的听起来足够好吗?
该项目最初设计为在 CPU 上渲染,因此需要注意一次渲染尽可能少的像素的请求。因此,例如,每当发生鼠标移动时,只会重新渲染光标周围的一小部分区域。或者,当渲染覆盖屏幕的整个图像时,它可能会被切成条带,然后依次渲染和显示。 在 GPU 上渲染时,这种碎片有意义吗?确定渲染请求(即输出纹理)的最佳大小的最佳方法是什么,以便充分利用 GPU?< /p>
分析在 GPU 上运行的代码(以提高性能)时需要考虑哪些因素? (将其与 CPU 上的渲染进行比较。) 测量调用返回所需的时间(并调用 glFinish() 以确保命令已在 GPU 上完成)听起来有用吗?还是有其他需要记住的事情?
非常感谢!
我想我需要添加一些细节来澄清我的问题:
2)我们实际上并没有同时使用相同的纹理作为渲染目标和读取源。 只有当渲染完成时,“输出”纹理才会变成“输入” - 即,当需要读取渲染作业的结果以用于另一遍或作为另一个过滤器的输入时。
我关心的是附加的纹理是否会受到不同的处理,例如与未附加的纹理相比,FBO 或着色器是否可以更快地访问它们。
我最初的(尽管可能不完全准确)分析并没有显示出显着的差异,所以我想我们并没有犯下那么多的性能犯罪。我将使用您建议的计时函数进行更多测试 - 这些看起来很有用。
3) 我想知道将图片切成小块(比如鼠标移动小至 100 x 100 像素)并要求它们一张一张地渲染会更慢还是更快(或者是否这并不重要)在 GPU 上,这可能会并行化大量工作。我的直觉是,这可能是过度热心的优化,在最好的情况下,不会给我们带来太多好处,而在最坏的情况下,可能会损害性能,所以想知道是否有一种正式的方式来告诉特定的实现。最后,我想我们会选择各种显卡上看起来合理的方式。
I was included in a project, which does image processing on the CPU and is currently being extended to use the GPU as well, with the hopes being to use mainly the GPU, if that proves to be faster, and have the CPU processing part as a fall-back. I am new to GPU programming and have a few questions, aspects of which I have seen discussed in other threads, but haven’t been able to find the answers I need.
If we were starting from scratch, what technology would you recommend for image processing on the GPU, in order to achieve the optimum combination of coverage (as in support on client machines) and speed?
We've gone down the OpenGL + GLSL way as a way of covering as many graphics cards as possible and I am curious whether this is the optimal choice. What would you say about OpenCL, for example?Given we have already started implementing the GPU module with OpenGL and shaders, I would like to get an idea of whether we are doing that the most efficient way.
We use Framebuffer Objects to read from and to render to textures. In most cases the area that is being read and the area that is being written to are the same size, but the textures we read from and write to could be an arbitrary size. In other words, we ask the FBO to read a subarea of what is considered to be its input texture and write to a subarea of what is considered to be its output texture. For that purpose the output texture is "attached" to the Framebuffer Object (with glFramebufferTexture2DEXT()), but the input one is not. This requires textures to be "attached" and "detached", as they change their roles (i.e. a texture could be initially used for writing to, but in the next pass it could be used as an input to read from).
Would, instead of that, forcing the inputs and outputs to be the same size and always having them attached to the FBO make more sense, in terms of using the FBO efficiently and achieving better performance or does what we already do sound good enough?
The project was initially designed to render on the CPU, so care was taken for requests to be made to render as fewer pixels as possible at a time. So, whenever a mouse move happens, for example, only a very small area around the cursor would be re-rendered. Or, when rendering a whole image that covers the screen, it might be chopped into strips to be rendered and displayed one after the other.
Does such fragmentation make sense, when rendering on the GPU? What would be the best way to determine the optimum size for a render request (i.e. an output texture), so that the GPU is fully utilised?What considerations would there be when profiling code (for performance), that runs on the GPU? (To compare it with rendering on the CPU.)
Does measuring how long calls take to return (and calling glFinish() to ensure commands have completed on the GPU) sound useful or is there anything else to keep in mind?
Thank you very much!
I think I need to add a couple of details to clarify my questions:
2) We aren't actually using the same texture as a rendering target and reading source at the same time.
It's only when rendering has finished that an "output" texture becomes "input" - i.e. when the result of a render job needs to be read for another pass or as an input for another filter.
What I was concerned with was whether attached textures are treated differently, as in whether the FBO or shader would have faster access to them, compared with when they aren't attached.
My initial (though probably not totally accurate) profiling didn't show dramatic differences, so I guess we aren't committing that much of a performance crime. I'll do more tests with the timing functions you suggested - these look useful.
3) I was wondering whether chopping a picture into tiny pieces (say as small as 100 x 100 pixels for a mouse move) and requesting them to be rendered one by one would be slower or faster (or whether it wouldn't matter) on a GPU, which could potentially paralellise a lot of the work. My gut feeling is that that might be overzealous optimisation that, in the best case, won't buy us much and in the worst, might hurt performance, so was wondering whether there is a formal way of telling for a particular implementation. In the end, I guess we'd go with what seems reasonable across various graphics cards.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对你的项目没有太多的了解,但我会尝试提供一些简单的答案,也许其他人可以更详细:
只要你做通常的modify-output-pixels-using-some-由于图像处理中的输入像素任务没有太多同步,您应该可以使用通常的屏幕大小的四边形与片段着色器方法(对这些奇怪的短语感到抱歉)。而且你可以免费获得图像过滤(如双线性插值)(我不知道 CUDA 或 OpenCL 是否支持图像过滤,尽管它们应该支持,因为无论如何都有硬件)。
无论如何,您都无法读取用作渲染目标的纹理(尽管我认为它们可能仍然附加),所以您当前的方法应该没问题。要求它们大小相同只是为了让它们附加到 FBO 会极大地限制灵活性(我认为附加成本可以忽略不计)。
最佳大小实际上取决于实现,但限制渲染范围,因此片段着色器调用应该始终是一个好主意,只要这些限制计算不会持续太长时间(使用glScissor的简单边界框 我认为,或者只是使用小于屏幕尺寸的四边形)。
还有其他可能更准确的 GPU 计时方法(例如,查看
GL_ARB_timer_query
扩展)。对于分析和调试,我认为您可以使用通用 GPU 分析器和调试器,例如 gDEBugger 等。尽管我对此类工具没有太多经验。编辑:给您编辑的问题:
我真的怀疑附加纹理的读取速度比未附加的纹理更快。您唯一能获得的好处是,当您想要写入时,无需重新附加它,但正如我所说,如果有的话,该成本应该可以忽略不计。
我不会通过将其平铺成太小的部分来过度优化。就像我说的,在使用 GL 时,您可以使用剪刀测试和模板测试来完成此类事情。但我认为,这一切都必须经过测试才能确保性能提升。我不知道鼠标移动是什么意思,因为当您将鼠标移到窗口上时,窗口系统通常会将光标渲染为覆盖层,因此您无需再次重新绘制底层图像,因为它是我认为是由窗口系统缓冲的。
I don't have too much insight into your project, but I'll try to provide some simple answers, perhaps others can be more detailed:
As long as you do the usual modify-output-pixels-using-some-input-pixels tasks from image processing without much synchronization, you should be fine with the usual screen-sized-quad-with-fragment-shader approach (sorry for these strange phrases). And you get image filtering (like bilinear interpolation) for free (I don't know if CUDA or OpenCL support image filtering, although they should, as the hardware is there anyway).
You cannot read from a texture that is used as a render target anyway (although they may still be attached I think), so your current approach should be fine. Requiring them to be same size only to let them attached to the FBO would limit the flexibility very much for quite nothing (I think the attaching cost is negligable).
The optimal size is really implementation dependent, but limiting the rendered range and therefore the fragment shader invocations should always be a good idea, as long as these limiting computations don't last too long (simple bounding boxes with
glScissor
are your friend, I think, or just using a smaller than screen size quad).There are other perhaps much more accurate methods for timing the GPU (look at the
GL_ARB_timer_query
extension, for example). For profiling and debugging you can use general GPU profilers and debuggers, as gDEBugger and the like, I think. Although I don't have much experience with such tools.EDIT: To you edited questions:
I really doubt, that an attached texture is read faster than a non-attached one. The only thing you would gain is that you don't need to reattach it when you want to write into it, but as I said, that cost should be negligable, if any.
I would not over optimize that by tiling it into too small pieces. Like I said, when working with GL, you can use the scissor test and the stencil test for such things. But it all has to be tested, I think, to be sure of the performance gain. I don't know, what you mean with mouse move, as when you just move the mouse over your window, the window system usually takes care of rendering the cursor as an overlay so you need not redraw the underlying image again, as it is buffered by the window system, I think.