如何从紧密循环中分解出分支?
我的问题是:如何在处理循环中添加功能,而无需检查用户设置的真/假来导航分支?循环中所有迭代的设置都是相同的。具有分支预测功能的现代处理器是否使这变得不必要?
我的程序遵循以下模式:
- 用户调整设置、复选框、滑块、数字条目。
- 触发更新时处理数据
- 将设置应用于局部变量
- 在大型数据集上运行循环
- 添加 if 语句以绕过用户设置中未使用的代码。
- 从循环中返回
- 返回转换后的数据
如何提前模板化或内联所有排列?
示例:
bool setting1 = true;
bool setting2 = false;
vector<float> data;
for(int i=0;i<100000;i++)
data.push_back(i);
for(int i=0;i<100000;i++) {
if (setting1)
{
doStuff(data[i]);
....
}
if (setting2)
{
doMoreStuff(data[i]);
.....
}
.... //etc
}
我知道这是一个愚蠢的示例。但我想知道当有很多分支时,什么模式会缩放。
My question is: How can I add features to my processing loop without the overhead of checking the true/falseness of the user settings to navigate the branches? The settings are the same for all iterations on the loop. Do modern processors with branch prediction make this unnecessary?
My programs over follow this pattern:
- User adjusts settings, checkboxes, sliders, numerical entries.
- Data is processed when an update is triggered
- Apply settings to local variables
- Run loop over a large dataset
- add if statements to bypass unused code from the user settings.
- return from loop
- return transformed data
How can you template or inline out all permutations ahead of time?
example:
bool setting1 = true;
bool setting2 = false;
vector<float> data;
for(int i=0;i<100000;i++)
data.push_back(i);
for(int i=0;i<100000;i++) {
if (setting1)
{
doStuff(data[i]);
....
}
if (setting2)
{
doMoreStuff(data[i]);
.....
}
.... //etc
}
I know this is a silly example. But I'd like to know what pattern scales when there are lots of branches.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先,除非与循环迭代的成本(分支+循环开销)相比,操作非常便宜,否则不要担心它并执行最可读的操作。过早的优化是万恶之源;不要只是假设事情会很慢,进行一些分析以便您知道。
如果您确实发现自己确实花费了更多的时间进行迭代而不是做有用的工作 - 也就是说,您的开销太高 - 您需要找到一种明智的方法来减少开销;因此,可以在针对特定输入组合优化的不同循环体/实现之间进行选择。
将条件从循环中分解出来以形成多个循环,最初似乎是一个好主意,但是,如果大多数设置都已启用,并且您的实际操作足够便宜,足以使开销成为一个问题,您可能会发现性能基本不变 - 每个新循环都有每次迭代成本!
如果是这种情况,一种前进的方法可能是使用模板或其他方法来实例化最常见输入组合的循环体变体,在合适的循环可用时在调用这些循环之间进行高级选择,然后当情况并非如此时,回到一般情况。
Firstly, unless the operations are incredibly cheap compared to the cost of an iteration of the loop (branches + loop overhead), simply don't worry about it and do whatever is most readable. Premature optimisation is the root of much evil; don't just assume things will be slow, do some profiling so that you know.
If you do find yourself genuinely spending more time iterating than doing useful work - that is, your overhead is too high - you need to find a sensible way to reduce the overhead; so, to select between different loop bodies/implementations optimised for particular combinations of inputs.
Factoring the conditions out of the loop, to make multiple loops, might initially seem like a good idea, however if most of the settings are mostly enabled and your actual operations are cheap enough to make the overhead an issue at all, you may find the performance largely unchanged - each of the new loops has a per-iteration cost!
If that is the case, a way forward might be to use templates or other means to instantiate variants of the loop body for the most common combinations of inputs, select at a high level between loops calling those when a suitable one is available, and fall back to the generic case when it is not.
如果在编译时已知数据集的大小,则编译器可能会执行:
如果是数学运算,则
您还可以从内到外执行逻辑:
它是 正如一个人所说,对内存局部性不利。
当且仅当错过分支的成本小于给定的加速时,分支预测才会对您有很大帮助(对于大型数据集,它应该有所帮助,而不是有害)。
如果您的数据操作是完全并行的,即您正在运行 SIMD,您可以研究线程化操作:例如,打开 3 个线程,并让所有 3 个线程执行
i % t
操作,t
是线程索引,i
是数据索引。 (您可以以不同的方式对数据进行分区)。对于足够大的数据集,假设您没有同步操作,您将看到线性加速。如果您正在为专用系统(例如具有给定 CPU 的工业计算机)编写此代码,并且您可以假设您将始终拥有该 CPU,则可以针对该 CPU 可以支持的内容进行更大幅度的优化。像确切的缓存大小、管道深度等都可以编码。除非您可以假设,否则尝试假设这些计数是粗略的。
If the size of the dataset is known at compile time, then the compiler can potentially perform:
If it is a mathematical operation
You can also do the logic inside-out:
It is bad for memory locality, as one person said.
Branch prediction will help you a good deal, if and only if the cost of the missed branch is less than the speedup given (for a large dataset, it should help, not hurt).
If your data operation is fully parallel, ie, you are running SIMD, you could investigate threading out the operations: e.g., open up 3 threads, and have all 3 take the
i % t
operation,t
being the thread index,i
being the data index. (You can partition the data different ways). For a large enough data set, presuming you don't have synch operations, you will see a linear speedup.If you are writing this for a specialized system, e.g., an industrial computer with a given CPU, and you can assume you will always have that CPU, you can optimize much more heavily for what that CPU can support. Things like exact cache size, depth of pipeline, etc, can all be coded in. Unless you can assume that, it's sketchy to try and assume on those counts.
您可以通过这种方式避免开销(假设settingx不影响settingy):
但是,在我看来,最好的解决方案是保留您的代码。今天的分支预测单元非常强大,考虑到您将循环数千次且每个分支都具有相同的结果,您可以承受几个周期的预热;)
编辑:
我将我们解决问题的方法与一个简单的控制台程序进行了比较(来源,尽管它是 c#)。该循环执行了 1000000 次,我使用了三角函数和双精度浮点运算。测试2就是我上面展示的解决方案,三个字母分别是setting1、setting2、setting3的值。
结果是:
我还进行了一次测试运行,所有三个测试函数都为空,以证明循环和调用开销最小:
实际上,我的解决方案慢了大约 1%。我的答案的第二点被证明是正确的:循环开销是完全可解释的。
You can avoid the overhead this way (supposing settingx doesn't affect settingy):
But, in my opinion, the best solution is keeping your code. Today's branch prediction units are very powerful, and considering that you'll loop many thousands of times with every branch having the same result, you can afford a few cycles of warm up ;)
EDIT:
I compared our approaches to the problem with a simple console program (sources, it's c# though). The loop is executed 1000000 times and I used trigonometrical functions together with double precision floating point operations. Test 2 is the solution I showed above, and the three letters are the value of setting1, setting2 and setting3.
Results are:
I also did a test run with all three test functions empty to prove loop and calling overhead is minimal:
Effectively, my solution is about 1% slower. The second point of my answer, though is proven to be correct: loop overhead is completely trascurable.
对主循环使用模板。
当您更改设置时:(您可能可以减少此代码)
您希望留在循环()中直到设置更改。
应谨慎使用,因为它可能会导致膨胀。
分析了答案(G++ O2 优化):
Use templates for the main loop.
When you change settings: (you could probably make this less code)
You want to stay in loop() until settings change.
This should be used with care as it can lead to bloat.
Profiled the answers (G++ O2 optimization):