提高光线追踪器性能
我正在用 D 编写一个相对简单的光线追踪器/路径追踪器(http://dsource.org/projects/stacy),但即使进行了全面优化,每条射线仍然需要数千个处理器周期。 我还能做些什么来加快速度吗? 更一般地说,您知道光线追踪的良好优化/更快方法吗?
编辑:这就是我已经在做的事情。
- 代码已经在运行 高度并行
- 临时数据以高速缓存高效的方式构建,并与 16b 对齐
- 屏幕分为 32x32 块
- 目标数组的排列方式使得块中的所有后续像素在内存中都是连续的
- 基本场景图执行优化
- 常见的对象组合(如方框中的平面-平面 CSG)被替换为预先优化的对象
- 向量结构能够利用 GDC 的自动向量化支持
- 通过延迟评估找到光线上的后续命中; 这可以防止对
- 既不支持也不优先的 CSG 三角形进行不必要的计算。 普通基元,以及 CSG 操作和基本材质属性边界
- 仅支持
I'm writing a comparatively straightforward raytracer/path tracer in D (http://dsource.org/projects/stacy), but even with full optimization it still needs several thousand processor cycles per ray. Is there anything else I can do to speed it up? More generally, do you know of good optimizations / faster approaches for ray tracing?
Edit: this is what I'm already doing.
- Code is already running highly parallel
- temporary data is structured in a cache-efficient fashion as well as aligned to 16b
- Screen divided into 32x32-tiles
- Destination array is arranged in such a way that all subsequent pixels in a tile are sequential in memory
- Basic scene graph optimizations are performed
- Common combinations of objects (plane-plane CSG as in boxes) are replaced with preoptimized objects
- Vector struct capable of taking advantage of GDC's automatic vectorization support
- Subsequent hits on a ray are found via lazy evaluation; this prevents needless calculations for CSG
- Triangles neither supported nor priority. Plain primitives only, as well as CSG operations and basic material properties
- Bounding is supported
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
光线追踪器速度的典型一阶改进是某种空间分区方案。 仅根据您的项目大纲页面,您似乎还没有这样做。
最常用的方法可能是八叉树,但最好的方法很可能是多种方法的组合(例如空间分区树和邮箱等)。 边界框/球体测试是一种快速、廉价且令人讨厌的方法,但您应该注意两件事:1)它们在许多情况下没有多大帮助;2)如果您的对象已经是简单的基元,您将不会获得太多收益(甚至可能会失败)。 您可以更轻松地(比八叉树)实现用于空间分区的规则网格,但它只适用于稍微均匀分布的场景(就表面位置而言),
很大程度上取决于您所表示的对象的复杂性,您的内部设计(即您是否允许局部变换、对象的引用副本、隐式曲面等),以及您想要达到的精确程度。 如果您正在编写具有隐式曲面的全局照明算法,则权衡可能与您为网格对象或其他对象编写基本光线跟踪器时有所不同。 我没有详细看过你的设计,所以我不确定你已经考虑过上述哪些内容(如果有的话)。
与任何性能优化过程一样,您必须首先进行衡量,以找出您实际在哪里花费了时间,然后进行改进(根据偏好进行算法,然后根据需要进行代码改进)
The typical first order improvement of raytracer speed is some sort of spatial partitioning scheme. Based only on your project outline page, it seems you haven't done this.
Probably the most usual approach is an octree, but the best approach may well be a combination of methods (e.g. spatial partitioning trees and things like mailboxing). Bounding box/sphere tests are a quick cheap and nasty approach, but you should note two things: 1) they don't help much in many situations and 2) if your objects are already simple primitives, you aren't going to gain much (and might even lose). You can more easily (than octree) implement a regular grid for spatial partitioning, but it will only work really well for scenes that are somewhat uniformly distributed (in terms of surface locations)
A lot depends on the complexity of the objects you represent, your internal design (i.e. do you allow local transforms, referenced copies of objects, implicit surfaces, etc), as well as how accurate you're trying to be. If you are writing a global illumination algorithm with implicit surfaces the tradeoffs may be a bit different than if you are writing a basic raytracer for mesh objects or whatever. I haven't looked at your design in detail so I'm not sure what, if any, of the above you've already thought about.
Like any performance optimization process, you're going to have to measure first to find where you're actually spending the time, then improving things (algorithmically by preference, then code bumming by necessity)
我从光线追踪器中学到的一件事是,许多旧规则不再适用。 例如,许多光线追踪算法会进行大量测试,以便“尽早结束”计算量大的计算。 在某些情况下,我发现消除额外的测试并始终运行计算直至完成要好得多。 现代机器上的算术速度很快,但错过的分支预测代价高昂。 通过用最少的条件分支重写光线多边形相交测试,我得到了大约 30% 的加速。
有时最好的方法是反直觉的。 例如,我发现当我将许多包含一些大对象的场景分解为大量较小的对象时,它们运行得更快。 根据场景几何形状,这可以让您的空间细分算法丢弃大量相交测试。 让我们面对现实吧,交叉测试的速度就这么快。 您必须消除它们才能显着提高速度。
分层边界体积有很大帮助,但我最终摸索了 kd 树,并且速度有了巨大的提高。 当然,构建树的成本可能会使其无法实现实时动画。
注意同步瓶颈。
您必须进行分析,以确保将注意力集中在正确的地方。
One thing I learned with my ray tracer is that a lot of the old rules don't apply anymore. For example, many ray tracing algorithms do a lot of testing to get an "early out" of a computationally expensive calculation. In some cases, I found it was much better to eliminate the extra tests and always run the calculation to completion. Arithmetic is fast on a modern machine, but a missed branch prediction is expensive. I got something like a 30% speed-up on my ray-polygon intersection test by rewriting it with minimal conditional branches.
Sometimes the best approach is counter-intuitive. For example, I found that many scenes with a few large objects ran much faster when I broke them down into a large number of smaller objects. Depending on the scene geometry, this can allow your spatial subdivision algorithm to throw out a lot of intersection tests. And let's face it, intersection tests can be made only so fast. You have to eliminate them to get a significant speed-up.
Hierarchical bounding volumes help a lot, but I finally grokked the kd-tree, and got a HUGE increase in speed. Of course, building the tree has a cost that may make it prohibitive for real-time animation.
Watch for synchronization bottlenecks.
You've got to profile to be sure to focus your attention in the right place.
我还能做些什么来加快速度吗?
D,根据实现和编译器,提供相当好的性能。 由于您还没有解释您已经使用的光线追踪方法和优化,那么我无法为您提供太多帮助。
然后,下一步是对程序运行时序分析,并重新编码最常用的代码或最慢的代码,这些代码对汇编中的性能影响最大。
更一般地说,请查看这些问题中的资源:
我真的很喜欢使用显卡(大规模并行计算机)的想法)做一些工作。
该网站上还有许多其他与光线追踪相关的资源,其中一些资源列在该问题的侧栏中,其中大部分可以在 光线追踪标签。
Is there anything else I can do to speed it up?
D, depending on the implementation and compiler, puts forth reasonably good performance. As you haven't explained what ray tracing methods and optimizations you're using already, then I can't give you much help there.
The next step, then, is to run a timing analysis on the program, and recode the most frequently used code or slowest code than impacts performance the most in assembly.
More generally, check out the resources in these questions:
I really like the idea of using a graphics card (a massively parallel computer) to do some of the work.
There are many other raytracing related resources on this site, some of which are listed in the sidebar of this question, most of which can be found in the raytracing tag.
我根本不懂D,所以无法看代码找到具体的优化,但我可以大致说一下。
这实际上取决于您的要求。 最简单的优化之一就是减少任何特定光线可以遵循的反射/折射数量,但随后您就会开始失去“完美结果”。
光线追踪也是一个“尴尬并行”问题,因此如果您有资源(例如多核处理器),您可以考虑并行计算多个像素。
除此之外,您可能只需要分析并找出到底是什么花了这么长时间,然后尝试优化它。 是交叉口检测吗? 然后致力于优化代码,等等。
I don't know D at all, so I'm not able to look at the code and find specific optimizations, but I can speak generally.
It really depends on your requirements. One of the simplest optimizations is just to reduce the number of reflections/refractions that any particular ray can follow, but then you start to lose out on the "perfect result".
Raytracing is also an "embarrassingly parallel" problem, so if you have the resources (such as a multi-core processor), you could look into calculating multiple pixels in parallel.
Beyond that, you'll probably just have to profile and figure out what exactly is taking so long, then try to optimize that. Is it the intersection detection? Then work on optimizing the code for that, and so on.
一些建议。
Some suggestions.
每隔一个像素进行光线追踪。 通过插值获得两者之间的颜色。 如果颜色变化很大(您位于对象的边缘),请对中间的像素进行光线追踪。 这是作弊行为,但在简单的场景中,它几乎可以使性能提高一倍,同时牺牲一些图像质量。
在 GPU 上渲染场景,然后将其加载回来。 这将为您提供 GPU 速度下的第一个光线/场景命中。 如果场景中没有很多反射表面,这会将您的大部分工作减少到普通的旧渲染。 遗憾的是,在 GPU 上渲染 CSG 并不完全简单。
阅读 PovRay 的源代码以获得灵感。 :)
Raytrace every other pixel. Get the color in between by interpolation. If the colors vary greatly (you are on an edge of an object), raytrace the pixel in between. It is cheating, but on simple scenes it can almost double the performance while you sacrifice some image quality.
Render the scene on GPU, then load it back. This will give you the first ray/scene hit at GPU speeds. If you do not have many reflective surfaces in the scene, this would reduce most of your work to plain old rendering. Rendering CSG on GPU is unfortunately not completely straightforward.
Read source code of PovRay for inspiration. :)
您首先必须确保使用非常快的算法(实现它们可能会很痛苦,但是您想要做什么,想要走多远以及应该有多快,这是一种权衡)。
我还有一些提示
- 不要使用邮箱技术,在论文中有时会讨论,由于计数开销,它们不能很好地适应实际架构
- 不要使用 BSP/Octtrees,它们相对较慢。
- 不要使用 GPU 进行光线追踪,它对于反射、阴影、折射和光子映射等高级效果来说太慢了(我仅将它用于着色,但这是我的)啤酒)
对于完整的静态场景,kd-Trees 是无与伦比的,对于动态场景,有一些聪明的算法,可以在四核上很好地扩展(我不确定上面的性能)。
当然,为了获得真正好的性能,您需要使用非常多的 SSE 代码(当然没有太多的跳转),但对于不是“那么好的”性能(我在这里谈论的可能是 10-15%),编译器内部函数就足够了来实现你的 SSE 东西。
还有一些关于我正在谈论的算法的不错的论文:
“Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes”
(非常快非常好的并行化(SSE)AABB-Ray命中测试)(注意,论文中的代码不是全部代码,只需google一下论文标题,你就会找到)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
“使用动态边界的光线追踪可变形场景卷层次结构”
http://www .sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
如果您知道上述算法的工作原理,那么这是一个更强大的算法:
“预计算三角形的使用动态场景中加速光线追踪的集群”
http://garanzha.com /Documents/UPTC-ART-DS-8-600dpi.pdf
我还使用 pluecker-test 来快速确定(不是那么准确,但是,你不能拥有全部)如果我击中了多边形,与 SSE 及以上版本配合得非常好。
所以我的结论是,有很多很棒的论文涉及很多与光线追踪相关的主题(如何构建快速、高效的树以及如何着色(BRDF 模型)等等),这是一个真正的令人惊奇且有趣的“实验”领域,但你也需要有很多业余时间,因为它是如此复杂但有趣。
You have first to make sure that you use very fast algorithms (implementing them can be a real pain, but what do you want to do and how far want you to go and how fast should it be, that's a kind of a tradeof).
some more hints from me
- don't use mailboxing techniques, in papers it is sometimes discussed that they don't scale that well with the actual architectures because of the counting overhead
- don't use BSP/Octtrees, they are relative slow.
- don't use the GPU for Raytracing, it is far too slow for advanced effects like reflection and shadows and refraction and photon-mapping and so on ( i use it only for shading, but this is my beer)
For a complete static scene kd-Trees are unbeatable and for dynamic scenes there are clever algorithms there that scale very well on a quadcore (i am not sure about the performance above).
And of course, for a realy good performance you need to use very much SSE code (with of course not too much jumps) but for not "that good" performance (im talking here about 10-15% maybe) compiler-intrinsics are enougth to implement your SSE stuff.
And some decent Papers about some Algorithms i was talking about:
"Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes"
( very fast very good paralelisizable (SSE) AABB-Ray hit test )( note, the code in the paper is not all code, just google for the title of the paper, youll find it)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
"Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies"
http://www.sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
if you know how the above algorithm works then this is a much greater algorithm:
"The Use of Precomputed Triangle Clusters for Accelerated Ray Tracing in Dynamic Scenes"
http://garanzha.com/Documents/UPTC-ART-DS-8-600dpi.pdf
I'm also using the pluecker-test to determine fast (not thaat accurate, but well, you can't have all) if i hit a polygon, works very pretty with SSE and above.
So my conclusion is that there are so many great papers out there about so much Topics that do relate to raytracing (How to build fast, efficient trees and how to shade (BRDF models) and so on and so on), it is an realy amazing and interesting field of "experimentating", but you need to have also much sparetime because it is so damn complicated but funny.
我的第一个问题是 - 您是否正在尝试优化单个静止屏幕的跟踪,
或者这是关于优化多个屏幕的跟踪以计算动画?
优化单个镜头是一回事,如果您想计算动画中的连续帧,则需要考虑/优化许多新事物。
My first question is - are you trying to optimize the tracing of one single still screen,
or is this about optimizing the tracing of multiple screens in order to calculate an animation ?
Optimizing for a single shot is one thing, if you want to calculate successive frames in an animation there are lots of new things to think about / optimize.
您可以
- 但这些是我立即想到的建议的。 简而言之:
您可以基于统计数据构建优化的层次结构,以便在与几何图形相交时快速识别候选节点。 在您的情况下,您必须将自动层次结构与建模层次结构结合起来,这要么限制构建,要么让它最终克隆建模信息。
“数据包遍历”意味着您使用 SIMD 指令计算 4 个并行标量,每个标量都有一条自己的射线用于遍历层次结构(通常是热点),以便从硬件中榨取最大性能。
您可以执行一些每光线统计,以便根据对结果像素颜色的贡献来控制采样率(次级光线发射的数量)。
在图块上使用面积曲线可以减少像素之间的平均空间距离,从而降低性能从缓存命中中受益的可能性。
You could
much more - but those were the suggestions I could immediately think of. In more words:
You can build an optimized hierarchy based on statistics in order to quickly identify candidate nodes when intersecting geometry. In your case you'll have to combine the automatic hierarchy with the modeling hierarchy, that is either constrain the build or have it eventually clone modeling information.
"Packet traversal" means you use SIMD instructions to compute 4 parallel scalars, each of an own ray for traversing the hierarchy (which is typically the hot spot) in order to squeeze the most performance out of the hardware.
You can perform some per-ray-statistics in order to control the sampling rate (number of secondary rays shot) based on the contribution to the resulting pixel color.
Using an area curve on the tile allows you to decrease the average space distance between the pixels and thus the probability that your performance benefits from cache hits.