使用 CUDA 绘制三角形
我正在编写自己的图形库(是的,它的作业:)并使用 cuda 快速完成所有渲染和计算。
我在绘制填充三角形时遇到问题。我这样写的,一个进程画一个三角形。当场景中有很多小三角形时它工作得很好,但当三角形很大时它会完全破坏性能。
我的想法是进行两次传递。在第一个仅计算选项卡中,包含有关扫描线的信息(从这里绘制到那里)。这将是每个进程计算的三角形,就像当前算法一样。在第二遍中,每个三角形确实绘制了多个过程的扫描线。
但它会足够快吗?也许有更好的解决方案?
I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以查看此博客:CUDA 中的软件渲染管道。我认为这不是最佳方法,但至少作者分享了一些有用的资源。
其次,请阅读这篇论文:可编程并行渲染架构。我认为这是最新的论文之一,而且也是基于 CUDA 的。
如果我必须这样做,我会使用像 Larrabee(即 TBR)甚至 REYES 中的数据并行光栅化管道,并将其适应 CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/ larrabee/Standford%20Forsyth%20Larrabee%202010.zip(请参阅演示文稿的第二部分)
http://graphics.stanford.edu/papers/mprast/
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/
我怀疑您对 CUDA 以及如何使用它有一些误解,特别是当您提到“进程”时,在 CUDA 术语中,不存在这样的东西。
对于大多数 CUDA 应用程序来说,要获得良好的性能,有两件重要的事情:优化内存访问并确保 warp 中的每个“活动”CUDA 线程与 warp 中的其他活动线程同时执行相同的操作。这两个听起来对您的应用程序都很重要。
为了优化内存访问,您需要确保对全局内存的读取和对全局内存的写入合并。您可以在 CUDA 编程指南中阅读有关此内容的更多信息,但这本质上意味着半扭曲中的相邻线程必须读取或写入相邻的内存位置。此外,每个线程应一次读取或写入 4、8 或 16 个字节。
如果您的内存访问模式是随机的,那么您可能需要考虑使用纹理内存。当您需要引用块中已被其他线程读取的内存时,您应该使用共享内存。
就您而言,我不确定您的输入数据是什么,但您至少应该确保您的写入已合并。您可能需要投入一些不小的努力才能使您的阅读有效地进行。
对于第二部分,我建议每个 CUDA 线程处理输出图像中的一个像素。使用此策略,您应该注意内核中的循环,这些循环将根据每个线程的数据执行更长或更短的时间。扭曲中的每个线程应该以相同的顺序执行相同数量的步骤。唯一的例外是,如果线程束中的某些线程不执行任何操作,而其余线程一起执行相同的操作,则不会造成真正的性能损失。
因此,我建议每个线程检查其像素是否在给定的三角形内。如果没有,它不应该执行任何操作。如果是,它应该计算该像素的输出颜色。
另外,我强烈建议您阅读更多有关 CUDA 的内容,因为您似乎在没有很好地理解一些基本原理的情况下就进入了深渊。
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
无意冒犯,但这不就是显卡的设计初衷吗?似乎使用标准 OpenGL 和 Direct3D API 更有意义。
为什么不使用 API 来进行基本渲染,而不是使用级别低得多的 CUDA?然后,如果您希望执行不支持的其他操作,您可以使用 CUDA 将它们应用到上面。或者也许将它们实现为着色器。
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.