如何从紧密循环中分解出分支？

发布于 2024-12-07 02:14:59 字数 713 浏览 6 评论 0原文

我的问题是：如何在处理循环中添加功能，而无需检查用户设置的真/假来导航分支？循环中所有迭代的设置都是相同的。具有分支预测功能的现代处理器是否使这变得不必要？

我的程序遵循以下模式：

用户调整设置、复选框、滑块、数字条目。
触发更新时处理数据
1. 将设置应用于局部变量
2. 在大型数据集上运行循环
  - 添加 if 语句以绕过用户设置中未使用的代码。
  - 从循环中返回
3. 返回转换后的数据

如何提前模板化或内联所有排列？

示例：

bool setting1 = true;
bool setting2 = false;
vector<float> data;

for(int i=0;i<100000;i++)
    data.push_back(i);

for(int i=0;i<100000;i++) {
    if (setting1)
    {
       doStuff(data[i]);
       ....
    }
    if (setting2)
    {
       doMoreStuff(data[i]);
       .....
    }

    .... //etc
}

我知道这是一个愚蠢的示例。但我想知道当有很多分支时，什么模式会缩放。

原文

My question is: How can I add features to my processing loop without the overhead of checking the true/falseness of the user settings to navigate the branches? The settings are the same for all iterations on the loop. Do modern processors with branch prediction make this unnecessary?

My programs over follow this pattern:

User adjusts settings, checkboxes, sliders, numerical entries.
Data is processed when an update is triggered
1. Apply settings to local variables
2. Run loop over a large dataset
  - add if statements to bypass unused code from the user settings.
  - return from loop
3. return transformed data

How can you template or inline out all permutations ahead of time?

example:

bool setting1 = true;
bool setting2 = false;
vector<float> data;

for(int i=0;i<100000;i++)
    data.push_back(i);

for(int i=0;i<100000;i++) {
    if (setting1)
    {
       doStuff(data[i]);
       ....
    }
    if (setting2)
    {
       doMoreStuff(data[i]);
       .....
    }

    .... //etc
}

I know this is a silly example. But I'd like to know what pattern scales when there are lots of branches.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鼻尖触碰 2024-12-14 02:15:00

首先，除非与循环迭代的成本（分支+循环开销）相比，操作非常便宜，否则不要担心它并执行最可读的操作。过早的优化是万恶之源；不要只是假设事情会很慢，进行一些分析以便您知道。

如果您确实发现自己确实花费了更多的时间进行迭代而不是做有用的工作 - 也就是说，您的开销太高 - 您需要找到一种明智的方法来减少开销；因此，可以在针对特定输入组合优化的不同循环体/实现之间进行选择。

将条件从循环中分解出来以形成多个循环，最初似乎是一个好主意，但是，如果大多数设置都已启用，并且您的实际操作足够便宜，足以使开销成为一个问题，您可能会发现性能基本不变 - 每个新循环都有每次迭代成本！

如果是这种情况，一种前进的方法可能是使用模板或其他方法来实例化最常见输入组合的循环体变体，在合适的循环可用时在调用这些循环之间进行高级选择，然后当情况并非如此时，回到一般情况。

回复收藏 0 原文

吖咩 2024-12-14 02:15:00

如果在编译时已知数据集的大小，则编译器可能会执行：

循环展开

如果是数学运算，则

矢量化可以发挥作用

您还可以从内到外执行逻辑：

if (epic-setting)
{
//massive for loop
}

它是正如一个人所说，对内存局部性不利。

当且仅当错过分支的成本小于给定的加速时，分支预测才会对您有很大帮助（对于大型数据集，它应该有所帮助，而不是有害）。

如果您的数据操作是完全并行的，即您正在运行 SIMD，您可以研究线程化操作：例如，打开 3 个线程，并让所有 3 个线程执行 i % t 操作，t 是线程索引，i 是数据索引。（您可以以不同的方式对数据进行分区）。对于足够大的数据集，假设您没有同步操作，您将看到线性加速。

如果您正在为专用系统（例如具有给定 CPU 的工业计算机）编写此代码，并且您可以假设您将始终拥有该 CPU，则可以针对该 CPU 可以支持的内容进行更大幅度的优化。像确切的缓存大小、管道深度等都可以编码。除非您可以假设，否则尝试假设这些计数是粗略的。

If the size of the dataset is known at compile time, then the compiler can potentially perform:

loop unrolling

If it is a mathematical operation

vectorization can come into play

You can also do the logic inside-out:

if (epic-setting)
{
//massive for loop
}

It is bad for memory locality, as one person said.

Branch prediction will help you a good deal, if and only if the cost of the missed branch is less than the speedup given (for a large dataset, it should help, not hurt).

If your data operation is fully parallel, ie, you are running SIMD, you could investigate threading out the operations: e.g., open up 3 threads, and have all 3 take the i % t operation, t being the thread index, i being the data index. (You can partition the data different ways). For a large enough data set, presuming you don't have synch operations, you will see a linear speedup.

If you are writing this for a specialized system, e.g., an industrial computer with a given CPU, and you can assume you will always have that CPU, you can optimize much more heavily for what that CPU can support. Things like exact cache size, depth of pipeline, etc, can all be coded in. Unless you can assume that, it's sketchy to try and assume on those counts.

回复收藏 0 原文

挽心 2024-12-14 02:15:00

您可以通过这种方式避免开销（假设settingx不影响settingy）：

if(setting1) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

但是，在我看来，最好的解决方案是保留您的代码。今天的分支预测单元非常强大，考虑到您将循环数千次且每个分支都具有相同的结果，您可以承受几个周期的预热;)

编辑：
我将我们解决问题的方法与一个简单的控制台程序进行了比较（来源，尽管它是 c#）。该循环执行了 1000000 次，我使用了三角函数和双精度浮点运算。测试2就是我上面展示的解决方案，三个字母分别是setting1、setting2、setting3的值。
结果是：

test 1 - fft: 13974 ms
test 2 - fft: 14106 ms

test 1 - tft: 27728 ms
test 2 - tft: 28081 ms

test 1 - ttt: 41833 ms
test 2 - ttt: 41982 ms

我还进行了一次测试运行，所有三个测试函数都为空，以证明循环和调用开销最小：

test 1 - fft: 4 ms
test 2 - fft: 4 ms

test 1 - tft: 8 ms
test 2 - tft: 8 ms

test 1 - ttt: 12 ms
test 2 - ttt: 12 ms

实际上，我的解决方案慢了大约 1%。我的答案的第二点被证明是正确的：循环开销是完全可解释的。

You can avoid the overhead this way (supposing settingx doesn't affect settingy):

if(setting1) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

if(setting3) {
    for(int i=0;i<100000;i++) {
        // ...
    }
}

But, in my opinion, the best solution is keeping your code. Today's branch prediction units are very powerful, and considering that you'll loop many thousands of times with every branch having the same result, you can afford a few cycles of warm up ;)

EDIT:
I compared our approaches to the problem with a simple console program (sources, it's c# though). The loop is executed 1000000 times and I used trigonometrical functions together with double precision floating point operations. Test 2 is the solution I showed above, and the three letters are the value of setting1, setting2 and setting3.
Results are:

test 1 - fft: 13974 ms
test 2 - fft: 14106 ms

test 1 - tft: 27728 ms
test 2 - tft: 28081 ms

test 1 - ttt: 41833 ms
test 2 - ttt: 41982 ms

I also did a test run with all three test functions empty to prove loop and calling overhead is minimal:

test 1 - fft: 4 ms
test 2 - fft: 4 ms

test 1 - tft: 8 ms
test 2 - tft: 8 ms

test 1 - ttt: 12 ms
test 2 - ttt: 12 ms

Effectively, my solution is about 1% slower. The second point of my answer, though is proven to be correct: loop overhead is completely trascurable.

回复收藏 0 原文

北笙凉宸 2024-12-14 02:14:59

对主循环使用模板。

template <bool A, bool B>
void loop() {
  while (1) {
      if (A) // will get compiled out if A == false
      {
         doStuff(data[i]);
         ....
      }
      if (B)
      {
         doMoreStuff(data[i]);
         .....
      }

      .... //etc
  }
}

当您更改设置时：（您可能可以减少此代码）

if (setting1) {
  if (setting2)
    loop<1,1>;
  else 
    loop<1,0>;
}
else {
  if (setting2)
    loop<0,1>;
  else 
    loop<0,0>;
}

您希望留在循环（）中直到设置更改。

应谨慎使用，因为它可能会导致膨胀。

分析了答案（G++ O2 优化）：

 %Time
 46.15      0.84     0.84                             fb() (blackbear)
 38.37      1.53     0.69                             fa() (OP)
 16.13      1.82     0.29                             fc() (pubby8)

Use templates for the main loop.

template <bool A, bool B>
void loop() {
  while (1) {
      if (A) // will get compiled out if A == false
      {
         doStuff(data[i]);
         ....
      }
      if (B)
      {
         doMoreStuff(data[i]);
         .....
      }

      .... //etc
  }
}

When you change settings: (you could probably make this less code)

if (setting1) {
  if (setting2)
    loop<1,1>;
  else 
    loop<1,0>;
}
else {
  if (setting2)
    loop<0,1>;
  else 
    loop<0,0>;
}

You want to stay in loop() until settings change.

This should be used with care as it can lead to bloat.

Profiled the answers (G++ O2 optimization):

 %Time
 46.15      0.84     0.84                             fb() (blackbear)
 38.37      1.53     0.69                             fa() (OP)
 16.13      1.82     0.29                             fc() (pubby8)

回复收藏 0 原文

~没有更多了~