实用的循环展开技术
我正在寻找一个实用的循环展开技术示例。
我认为达夫的设备是一个很好的技巧。
但Duff 设备的目的地从未增加。它对于将数据复制到串行设备的嵌入式程序员非常有用,而不是一般程序员。
你能给我一个很好且有用的例子吗?
如果你曾经在实际代码中使用过它,那就更好了。
I'm finding a pragmatic loop unrolling technique example.
I think Duff's device is a one of nice tip.
But Duff's device's destination is never increased. It could be useful for embeded programmer who copies data to serial device, not general programmers.
Could you give me a nice and useful example?
If you have ever used it in your real code, it will be better.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
最实用的技术是学习并喜欢编译器的优化选项,如果在分析中遇到热点,偶尔会手动检查生成的程序集。
The most pragmatic technique would be to learn and love your compiler's optimization options, and occasionally inspect the generated assembly by hand if you encounter hotspots in profiling.
我不确定你所说的“目的地永远不会增加”是什么意思。
手动循环展开并不常见。如今的嵌入式微处理器速度足够快,因此无需进行此类优化(并且会浪费宝贵的程序内存)。
我在线性求解器内核中使用了 Duff 设备的变体。每个
fwd_step
必须有一个back_step
,并且它们以四个为一组执行。请注意,前向和后向循环是通过 goto 实现的。当跳过
fwd_step
中的if
时,执行会跳转到向后循环的中间。所以它实际上是一种双达夫装置。这不是任何一种“实用”技术,这只是我能找到的表达一些非常复杂的流量控制的最佳方式。
I'm not sure what you mean by "destination is never increased."
Manual loop unrolling is rather uncommon. Embedded microprocessors today are fast enough that such optimization is unnecessary (and would waste valuable program memory).
I use a variation of Duff's device in a linear solver kernel. There must be one
back_step
for eachfwd_step
, and they are performed in groups of four.Note that the forward and backward-going loops are implemented by
goto
s. When theif
infwd_step
is skipped, execution jumps into the middle of the backward loop. So it's really a kind of double Duff's device.This isn't any kind of "pragmatic" technique, it's just the best way I could find to express some very convoluted flow control.
(为了其他人的利益,可以在此处找到 Duff 设备的背景信息 此处)
我在图像处理优化中遇到过它,特别是处理边界条件时要复制的像素比完整的图块或内核要少(这可以避免在每个坐标处进行测试。)
(For the benefit of others, background on Duff's device can be found here and here)
I've encountered it in image processing optimizations, especially to handle border conditions where fewer pixels than a complete tile or kernel are to be copied (and this can avoid the test at each coordinate.)