当前位置：文江博客话题详情

使用 CUDA 进行光线追踪

发布于 2024-07-04 19:54:51 字数 1727 浏览 7 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

此刻的回忆 2024-07-11 19:54:51

这当然可以做到，已经做到了，并且是目前光线追踪和 Cuda 专家中的热门话题。我首先仔细阅读 http://www.nvidia.com/object/cuda_home.html< /a>

但这基本上是一个研究问题。做得好的人正在从中获得经过同行评审的研究论文。但在这一点上仍然意味着最好的 GPU/Cuda 结果与 CPU/多核/SSE 上的最佳解决方案大致具有竞争力。因此，我认为现在假设使用 Cuda 将加速光线追踪器还为时过早。问题在于，虽然光线追踪是“令人尴尬的并行”（正如他们所说），但它并不是那种直接映射到 GPU 的“固定输入和输出大小”问题——你需要树、堆栈、动态数据结构等可以用Cuda/GPU来完成，但是比较棘手。

您的问题不清楚您的经验水平或项目的目标。如果这是您的第一个光线追踪器并且您只是想学习，我会避免使用 Cuda - 它会花费您 10 倍的时间来开发，并且您可能不会获得很好的速度。如果您是一位经验丰富的 Cuda 程序员，并且正在寻找一个具有挑战性的项目，并且光线追踪是一件有趣的事情，那么无论如何，请尝试在 Cuda 中进行。如果你正在制作一个商业应用程序，并且你希望获得有竞争力的速度优势——好吧，这可能是一次糟糕的拍摄……你可能会获得性能优势，但代价是更困难的开发和对特定硬件的依赖。

一年后回顾一下，经过一两代 GPU 速度、Cuda 编译器开发和研究社区经验后，答案可能会有所不同。

回复收藏 0 原文

为你鎻心 2024-07-11 19:54:51

在 CUDA 中需要非常警惕的一件事是，由于底层 GPU 硬件的结构，内核代码中的不同控制流绝对会降低性能。 GPU 通常具有大量数据并行工作负载和高度一致的控制流（即，您有几百万个像素，其中每个像素（或至少其中的大部分）将由完全相同相同的像素进行操作着色器程序，甚至通过所有分支采取相同的方向，这使他们能够进行一些硬件优化，例如在理想情况下，每组 32 个线程仅具有单个指令缓存、获取单元和解码逻辑。它们在图形中很常见，它们可以在同一周期内向所有 32 组执行单元广播相同的指令（这称为 SIMD 或单指令多数据）。它们可以模拟 MIMD（多数据）。 -指令）和 SPMD（单程序），但是当流式多处理器 (SM) 中的线程发散（从分支中取出不同的代码路径）时，问题逻辑实际上会逐周期在每个代码路径之间切换您可以想象，在最坏的情况下，所有线程都位于不同的路径上，您的硬件利用率会下降 32 倍，从而有效消除在 GPU 上运行而不是在 CPU 上运行所带来的任何好处，尤其是。考虑通过 PCIe 将数据集从 CPU 编组到 GPU 的相关开销。

也就是说，光线追踪虽然在某种意义上是数据并行的，但即使对于不太复杂的场景也具有广泛不同的控制流。即使您设法将一堆紧密间隔的光线映射到同一个 SM 上，这些光线彼此相邻地投射到同一个 SM 上，初始反弹所拥有的数据和指令局部性也不会保持很长时间。例如，想象所有 32 条高度相干的光线从球体上反射。在这次反弹之后，它们都会朝着相当不同的方向前进，并且可能会撞击由不同材料、不同照明条件等制成的物体。每种材质和一组照明、遮挡等条件都有其自己的关联指令流（用于计算折射、反射、吸收等），因此即使在很大一部分上运行相同的指令流也变得相当困难SM 中的线程数。对于光线跟踪代码中当前最先进的技术而言，此问题会将 GPU 利用率降低 16-32 倍，这可能会使应用程序的性能无法接受，尤其是实时应用程序（例如游戏）。对于例如渲染农场来说，它仍然可能优于 CPU。

研究界现在正在研究一类新兴的 MIMD 或 SPMD 加速器。我将它们视为软件实时光线追踪的逻辑平台。

如果您对所涉及的算法以及将它们映射到代码感兴趣，请查看 POVRay。还要研究光子映射，这是一种有趣的技术，甚至比光线追踪更接近于表示物理现实。

One thing to be very wary of in CUDA is that divergent control flow in your kernel code absolutely KILLS performance, due to the structure of the underlying GPU hardware. GPUs typically have massively data-parallel workloads with highly-coherent control flow (i.e. you have a couple million pixels, each of which (or at least large swaths of which) will be operated on by the exact same shader program, even taking the same direction through all the branches. This enables them to make some hardware optimizations, like only having a single instruction cache, fetch unit, and decode logic for each group of 32 threads. In the ideal case, which is common in graphics, they can broadcast the same instruction to all 32 sets of execution units in the same cycle (this is known as SIMD, or Single-Instruction Multiple-Data). They can emulate MIMD (Multiple-Instruction) and SPMD (Single-Program), but when threads within a Streaming Multiprocessor (SM) diverge (take different code paths out of a branch), the issue logic actually switches between each code path on a cycle-by-cycle basis. You can imagine that, in the worst case, where all threads are on separate paths, your hardware utilization just went down by a factor of 32, effectively killing any benefit you would've had by running on a GPU over a CPU, particularly considering the overhead associated with marshalling the dataset from the CPU, over PCIe, to the GPU.

That said, ray-tracing, while data-parallel in some sense, has widely-diverging control flow for even modestly-complex scenes. Even if you manage to map a bunch of tightly-spaced rays that you cast out right next to each other onto the same SM, the data and instruction locality you have for the initial bounce won't hold for very long. For instance, imagine all 32 highly-coherent rays bouncing off a sphere. They will all go in fairly different directions after this bounce, and will probably hit objects made out of different materials, with different lighting conditions, and so forth. Every material and set of lighting, occlusion, etc. conditions has its own instruction stream associated with it (to compute refraction, reflection, absorption, etc.), and so it becomes quite difficult to run the same instruction stream on even a significant fraction of the threads in an SM. This problem, with the current state of the art in ray-tracing code, reduces your GPU utilization by a factor of 16-32, which may make performance unacceptable for your application, especially if it's real-time (e.g. a game). It still might be superior to a CPU for e.g. a render farm.

There is an emerging class of MIMD or SPMD accelerators being looked at now in the research community. I would look at these as logical platforms for software, real-time raytracing.

If you're interested in the algorithms involved and mapping them to code, check out POVRay. Also look into photon mapping, it's an interesting technique that even goes one step closer to representing physical reality than raytracing.

回复收藏 0 原文

~没有更多了~

关于作者

舟遥客

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 CUDA 进行光线追踪

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

使用 CUDA 进行光线追踪

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。