优化该函数(C++中)

发布于 2024-12-15 07:34:54 字数 1457 浏览 5 评论 0原文

我有一个消耗 cpu 的代码,其中一些带有循环的函数被执行很多次。该循环中的每一次优化都会带来显着的性能提升。问题:你将如何优化这个循环(虽然没有太多需要优化的地方......)?

void theloop(int64_t in[], int64_t out[], size_t N)
{
    for(uint32_t i = 0; i < N; i++) {
        int64_t v = in[i];
        max += v;
        if (v > max) max = v;
        out[i] = max;
    }
}

我尝试了一些事情,例如,我用在每个循环中递增的指针替换数组,但是(令人惊讶的是)我失去了一些性能而不是获得...

编辑:

  • 更改了一个变量的名称(>itsMaximums,错误)
  • 该函数是一个类的方法,
  • in 和 put 都是 int64_t ,因此负数和正数
  • `(v > max) 可以 评估为真:考虑实际最大值为负数时的情况
  • 代码在32位PC(开发)和64位(生产)上运行
  • N在编译时未知
  • 我尝试了一些SIMD,但未能提高性能...(将变量移动到 _m128i、执行和存储回来的开销高于 SSE 速度增益。但我不是 SSE 专家,所以也许我的代码很差

) :

我添加了一些循环展开, Alex 帖子中的一个不错的技巧。下面我粘贴一些结果:

  1. 原始:14.0s
  2. 展开循环(4 次迭代):10.44s
  3. Alex 的技巧:10.89s
  4. 2) 和 3) 一次:11.71s

stage,4) 不比 3) 和 4) 快。 4)的代码如下:

for(size_t i = 1; i < N; i+=CHUNK) {
    int64_t t_in0 = in[i+0];
    int64_t t_in1 = in[i+1];
    int64_t t_in2 = in[i+2];
    int64_t t_in3 = in[i+3];


    max &= -max >> 63;
    max += t_in0;
    out[i+0] = max;

    max &= -max >> 63;
    max += t_in1;
    out[i+1] = max;

    max &= -max >> 63;
    max += t_in2;
    out[i+2] = max;

    max &= -max >> 63;
    max += t_in3;
    out[i+3] = max;

}

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?

void theloop(int64_t in[], int64_t out[], size_t N)
{
    for(uint32_t i = 0; i < N; i++) {
        int64_t v = in[i];
        max += v;
        if (v > max) max = v;
        out[i] = max;
    }
}

I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...

Edit:

  • changed name of one variable (itsMaximums, error)
  • the function is an a method of a class
  • in and put are int64_t , so are negative and positive
  • `(v > max) can evaluate to true: consider the situation when actual max is negative
  • the code runs on 32-bit pc (development) and 64-bit (production)
  • N is unknown at compile time
  • I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)

Results:

I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:

  1. original: 14.0s
  2. unfolded loop (4 iterations): 10.44s
  3. Alex'es trick: 10.89s
  4. 2) and 3) at once: 11.71s

strage, that 4) is not faster than 3) and 4). Below code for 4):

for(size_t i = 1; i < N; i+=CHUNK) {
    int64_t t_in0 = in[i+0];
    int64_t t_in1 = in[i+1];
    int64_t t_in2 = in[i+2];
    int64_t t_in3 = in[i+3];


    max &= -max >> 63;
    max += t_in0;
    out[i+0] = max;

    max &= -max >> 63;
    max += t_in1;
    out[i+1] = max;

    max &= -max >> 63;
    max += t_in2;
    out[i+2] = max;

    max &= -max >> 63;
    max += t_in3;
    out[i+3] = max;

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

七色彩虹 2024-12-22 07:34:55

首先,您需要查看生成的程序集。否则,您无法知道执行此循环时实际上发生了什么

现在:这段代码在 64 位机器上运行吗?如果没有,那些 64 位的附加功能可能会有点伤害。

该循环似乎是使用 SIMD 指令的明显候选者。 SSE2 支持许多用于整数算术的 SIMD 指令,包括一些处理两个 64 位值的指令。

除此之外,查看编译器是否正确展开循环,如果没有,请自行展开。展开循环的几次迭代,然后重新排序。将所有内存加载放在循环的顶部,以便可以尽早启动它们。

对于 if 行,检查编译器是否生成条件移动,而不是分支。

最后,查看您的编译器是否支持 restrict/__restrict 关键字。它不是 C++ 中的标准,但它对于向编译器指示 inout 不指向相同的地址非常有用。

大小 (N) 在编译时已知吗? ,请将其设为模板参数(然后尝试将 inout 作为对适当大小的数组的引用传递,因为这也可能有助于编译器进行别名分析)

如果是这样 我的脑海中浮现出一些想法。但再次,研究反汇编。您需要知道编译器为您做什么,尤其是它不为您做什么。

编辑

您的编辑:

max &= -max >> 63;
max += t_in0;
out[i+0] = max;

令我印象深刻的是您添加了一个巨大依赖链。
在计算结果之前,必须对 max 求反,必须对结果进行移位,that 的结果必须与其原始值进行运算,并且结果的 必须添加到另一个变量中。

换句话说,所有这些操作都必须串行化。在前一个任务完成之前,您无法启动其中一个任务。这不一定是加速。现代流水线乱序 CPU 喜欢并行执行许多事情。将其与一长串依赖指令捆绑在一起是您可以做的最严重的事情之一。 (当然,如果它可以与其他迭代交错,它可能会效果更好。但我的直觉是简单的条件移动指令会更好)

First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.

Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.

This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.

Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.

For the if line, check that the compiler is generating a conditional move, rather than a branch.

Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.

Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)

Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.

Edit

with your edit:

max &= -max >> 63;
max += t_in0;
out[i+0] = max;

what strikes me is that you added a huge dependency chain.
Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.

In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)

晚雾 2024-12-22 07:34:55


> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
   **Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)


好的,我已经根据您的代码版本发布了一个基准测试,以及我建议使用 partial_sum 的基准。

在此处查找所有代码 https://gist.github .com/1368992#file_test.cpp

功能

对于默认配置,

#define MAGNITUDE     20
#define ITERATIONS    1024
#define VERIFICATION  1
#define VERBOSE       0

#define LIMITED_RANGE 0    // hide difference in output due to absense of overflows
#define USE_FLOATS    0

它将(参见输出片段此处):

  • 运行 100 x 1024 次迭代(即100 个不同的随机种子)
  • ,数据长度为 1048576 (2^20)。
  • 随机输入数据在整个元素数据类型 (int64_t) 上均匀分布
  • 通过生成输出的哈希摘要来验证输出数组并将其与 OP 中的参考实现进行比较。

结果

有许多(令人惊讶或不足为奇)的结果:

  1. 任何算法之间没有显着性能差异(对于整数数据),前提是您正在启用优化的情况下进行编译。 (参见 Makefile;我的架构是 64 位,Intel Core Q9550 和 gcc-4.6.1)< /p>

  2. < p>这些算法不相同(您会看到哈希值不同):特别是 Alex 提出的位小提琴并没有以完全相同的方式处理整数溢出(这可以隐藏定义

    <前><代码>#define LIMITED_RANGE 1

    限制输入数据,因此不会发生溢出;请注意,partial_sum_in Correct 版本显示了等效的 C++ 非按位算术运算,它们产生相同的不同结果:

    返回 max<0 ? v : 最大值 + v; 
    

    也许,这适合您的目的?)

  3. 令人惊讶同时计算最大值算法的两个定义并不更昂贵。您可以看到这是在 partial_sum_ Correct 中完成的:它在同一循环中计算 max 的两个“公式”;这实际上只不过是一个琐事,因为这两种方法都没有明显更快......

  4. 更令人惊讶的是,当您能够使用时,可以实现巨大的性能提升 >float 而不是 int64_t。可以将快速而肮脏的黑客应用于基准测试

    <前><代码>#define USE_FLOATS 0

    表明,当使用 float 而不是 int64_t 时,基于 STL 的算法 (partial_sum_in Correct) 的运行速度大约快 2.5 倍 >(!!!)。
    注意:

    • partial_sum_in Correct 的命名仅与整数溢出有关,不适用于浮点数;这可以从哈希值匹配的事实看出,所以实际上它是 partial_sum_float_ Correct :)
    • 当前的 partial_sum_ Correct 实现正在做双重工作,导致其在浮点模式下表现不佳。请参阅项目符号 3。
  5. (我之前提到的 OP 中的循环展开版本中存在 1 倍误差的错误)

部分求和

出于您的兴趣,部分求和应用程序在 C++11 中如下所示:

std::partial_sum(data.begin(), data.end(), output.begin(), 
        [](int64_t max, int64_t v) -> int64_t
        { 
            max += v;
            if (v > max) max = v;
            return max;
        });


> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
   **Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)


Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.

Find all the code here https://gist.github.com/1368992#file_test.cpp

Features

For a default config of

#define MAGNITUDE     20
#define ITERATIONS    1024
#define VERIFICATION  1
#define VERBOSE       0

#define LIMITED_RANGE 0    // hide difference in output due to absense of overflows
#define USE_FLOATS    0

It will (see output fragment here):

  • run 100 x 1024 iterations (i.e. 100 different random seeds)
  • for data length 1048576 (2^20).
  • The random input data is uniformly distributed over the full range of the element data type (int64_t)
  • Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.

Results

There are a number of (surprising or unsurprising) results:

  1. there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)

  2. The algorithms are not equivalent (you'll see hash sums differ): notably the bit fiddle proposed by Alex doesn't handle integer overflow in quite the same way (this can be hidden defining

    #define LIMITED_RANGE 1
    

    which limits the input data so overflows won't occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:

    return max<0 ? v :  max + v; 
    

    Perhaps, it is ok for your purpose?)

  3. Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both 'formulations' of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster...

  4. Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark

    #define USE_FLOATS    0
    

    showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).
    Note:

    • that the naming of partial_sum_incorrect only relates to integer overflow, which doesn't apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct :)
    • that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
  5. (And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)

Partial sum

For your interest, the partial sum application looks like this in C++11:

std::partial_sum(data.begin(), data.end(), output.begin(), 
        [](int64_t max, int64_t v) -> int64_t
        { 
            max += v;
            if (v > max) max = v;
            return max;
        });
爱给你人给你 2024-12-22 07:34:55

有时,你需要退后一步,重新审视。第一个问题显然是,你需要这个吗?是否有一种替代算法可以表现得更好?

话虽这么说,假设为了这个问题您已经选择了这个算法,我们可以尝试推理我们实际拥有的内容。

免责声明:我描述的方法受到 Tim Peters 用于改进传统 introsort 实现的成功方法的启发,导致 TimSort< /a>.所以请耐心等待;)

1。提取属性

我看到的主要问题是迭代之间的依赖性,这将阻止许多可能的优化并阻止许多并行化尝试。

int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;

让我们以功能性的方式重新编写这段代码:

max = calc(in[i], max);
out[i] = max;

哪里:

int64_t calc(int64_t const in, int64_t const max) {
  int64_t const bumped = max + in;
  return in > bumped ? in : bumped;
}

或者更确切地说,一个简化的版本(由于未定义而没有溢出):

int64_t calc(int64_t const in, int64_t const max) {
  return 0 > max ? in : max + in;
}

你注意到提示点了吗?行为的变化取决于 ill-named(*) max 是正数还是负数。

这个临界点使得更仔细地观察 in 中的值变得有趣,特别是根据它们可能对 max 产生的影响:

  • max max in 0in[i] 0 然后 out[i] = in[i] 0
  • 最大< 0in[i]> 0 然后 out[i] = in[i] > 0
  • 最大> 0in[i] 0 然后 out[i] = (max + in[i]) ?? 0
  • 最大> 0in[i]> 0 则 out[i] = (max + in[i]) > 0

(*) 名字不好,因为它也是一个累加器,但名称隐藏了它。不过我也没有更好的建议。

2.优化操作

这使我们发现了有趣的情况:

  • 如果我们有一个仅包含负值的数组切片[i, j)(我们称之为负切片),那么我们可以这样做a std::copy(in + i, in + j, out + i)max = out[j-1]
  • 如果我们有一个切片 [ i, j) 只包含正值的数组,然后它是一个纯粹的累积代码(可以很容易地展开)
  • 一旦 in[i] 为正数,max 就会变为正数

因此,它可能很有趣(但也许不是,我不作任何承诺)在实际使用输入之前建立输入的配置文件。请注意,对于大输入,可以逐块制作配置文件,例如根据缓存行大小调整块大小。

作为参考,这 3 个例程:

void copy(int64_t const in[], int64_t out[],
          size_t const begin, size_t const end)
{
  std::copy(in + begin, in + end, out + begin);
} // copy

void accumulate(int64_t const in[], int64_t out[],
                size_t const begin, size_t const end)
{
  assert(begin != 0);

  int64_t max = out[begin-1];

  for (size_t i = begin; i != end; ++i) {
    max += in[i];
    out[i] = max;
  }
} // accumulate

void regular(int64_t const in[], int64_t out[],
             size_t const begin, size_t const end)
{
  assert(begin != 0);

  int64_t max = out[begin - 1];

  for (size_t i = begin; i != end; ++i)
  {
    max = 0 > max ? in[i] : max + in[i];
    out[i] = max;
  }
}

现在,假设我们可以使用简单的结构以某种方式表征输入:

struct Slice {
  enum class Type { Negative, Neutral, Positive };
  Type type;
  size_t begin;
  size_t end;
};

typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);

Func select(Type t) {
  switch(t) {
  case Type::Negative: return ©
  case Type::Neutral: return ®ular;
  case Type::Positive: return &accumulate;
  }
}

void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
  for (Slice const& slice: slices) {
    Func const f = select(slice.type);
    (*f)(in, out, slice.begin, slice.end);
  }
}

现在,除非 introsort 循环中的工作很少,所以计算特征可能成本太高......但是它会导致本身就可以并行化

3.简单并行化

请注意,表征是输入的纯函数。因此,假设您以每个块的方式工作,则可以并行地拥有:

  • Slice Producer:一个表征器线程,用于计算 Slice::Type
  • Slice Consumer:一个工作线程线程,实际执行代码

即使输入本质上是随机的,只要块足够小(例如,CPU L1 缓存行),也可能有它可以工作的块。两个线程之间的同步可以通过简单的 Slice 线程安全队列(生产者/消费者)并添加 bool last 属性来停止消费或通过创建 Slice 来完成。 code>Slice 具有 Unknown 类型的向量,并让消费者阻塞直到它已知(使用原子)。

注意:因为人物塑造是纯粹的,所以它是令人尴尬的平行。

4.更多并行化:推测性工作

请记住这句天真的言论:只要 in[i] 为正值,max 就会变为正值

假设我们可以(可靠地)猜测 Slice[j-1] 将产生一个负的 max 值,那么对 Slice[j] 的计算 与之前的内容无关,我们现在就可以开始工作!

当然,这是一个猜测,所以我们可能是错的......但是一旦我们完全表征了所有切片,我们就有了闲置的核心,所以我们不妨将它们用于推测性工作!如果我们错了呢?好吧,消费者线程将简单地删除我们的错误并用正确的值替换它。

推测性计算 Slice 的启发式应该很简单,并且必须进行调整。它也可能具有适应性......但这可能更困难!

结论

分析您的数据集并尝试查找是否可以打破依赖关系。如果是的话,即使不使用多线程,您也可以利用它。

Sometimes, you need to step backward and look over it again. The first question is obviously, do you need this ? Could there be an alternative algorithm that would perform better ?

That being said, and supposing for the sake of this question that you already settled on this algorithm, we can try and reason about what we actually have.

Disclaimer: the method I am describing is inspired by the successful method Tim Peters used to improve the traditional introsort implementation, leading to TimSort. So please bear with me ;)

1. Extracting Properties

The main issue I can see is the dependency between iterations, which will prevent much of the possible optimizations and thwart many attempts at parallelizing.

int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;

Let us rework this code in a functional fashion:

max = calc(in[i], max);
out[i] = max;

Where:

int64_t calc(int64_t const in, int64_t const max) {
  int64_t const bumped = max + in;
  return in > bumped ? in : bumped;
}

Or rather, a simplified version (baring overflow since it's undefined):

int64_t calc(int64_t const in, int64_t const max) {
  return 0 > max ? in : max + in;
}

Do you notice the tip point ? The behavior changes depending on whether the ill-named(*) max is positive or negative.

This tipping point makes it interesting to watch the values in in more closely, especially according to the effect they might have on max:

  • max < 0 and in[i] < 0 then out[i] = in[i] < 0
  • max < 0 and in[i] > 0 then out[i] = in[i] > 0
  • max > 0 and in[i] < 0 then out[i] = (max + in[i]) ?? 0
  • max > 0 and in[i] > 0 then out[i] = (max + in[i]) > 0

(*) ill-named because it is also an accumulator, which the name hides. I have no better suggestion though.

2. Optimizing operations

This leads us to discover interesting cases:

  • if we have a slice [i, j) of the array containing only negative values (which we call negative slice), then we could do a std::copy(in + i, in + j, out + i) and max = out[j-1]
  • if we have a slice [i, j) of the array containing only positive values, then it's a pure accumulation code (which can easily be unrolled)
  • max gets positive as soon as in[i] is positive

Therefore, it could be interesting (but maybe not, I make no promise) to establish a profile of the input before actually working with it. Note that the profile could be made chunk by chunk for large inputs, for example tuning the chunk size based on the cache line size.

For references, the 3 routines:

void copy(int64_t const in[], int64_t out[],
          size_t const begin, size_t const end)
{
  std::copy(in + begin, in + end, out + begin);
} // copy

void accumulate(int64_t const in[], int64_t out[],
                size_t const begin, size_t const end)
{
  assert(begin != 0);

  int64_t max = out[begin-1];

  for (size_t i = begin; i != end; ++i) {
    max += in[i];
    out[i] = max;
  }
} // accumulate

void regular(int64_t const in[], int64_t out[],
             size_t const begin, size_t const end)
{
  assert(begin != 0);

  int64_t max = out[begin - 1];

  for (size_t i = begin; i != end; ++i)
  {
    max = 0 > max ? in[i] : max + in[i];
    out[i] = max;
  }
}

Now, supposing that we can somehow characterize the input using a simple structure:

struct Slice {
  enum class Type { Negative, Neutral, Positive };
  Type type;
  size_t begin;
  size_t end;
};

typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);

Func select(Type t) {
  switch(t) {
  case Type::Negative: return ©
  case Type::Neutral: return ®ular;
  case Type::Positive: return &accumulate;
  }
}

void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
  for (Slice const& slice: slices) {
    Func const f = select(slice.type);
    (*f)(in, out, slice.begin, slice.end);
  }
}

Now, unless introsort the work in the loop is minimal, so computing the characteristics might be too costly as is... however it leads itself well to parallelization.

3. Simple parallelization

Note that the characterization is a pure function of the input. Therefore, supposing that you work in a chunk per chunk fashion, it could be possible to have, in parallel:

  • Slice Producer: a characterizer thread, which computes the Slice::Type value
  • Slice Consumer: a worker thread, which actually executes the code

Even if the input is essentially random, providing the chunk is small enough (for example, a CPU L1 cache line) there might be chunks for which it does work. Synchronization between the two threads can be done with a simple thread-safe queue of Slice (producer/consumer) and adding a bool last attribute to stop consumption or by creating the Slice in a vector with a Unknown type, and having the consumer block until it's known (using atomics).

Note: because characterization is pure, it's embarrassingly parallel.

4. More Parallelization: Speculative work

Remember this innocent remark: max gets positive as soon as in[i] is positive.

Suppose that we can guess (reliably) that the Slice[j-1] will produce a max value that is negative, then the computation on Slice[j] are independent of what preceded them, and we can start the work right now!

Of course, it's a guess, so we might be wrong... but once we have fully characterized all the Slices, we have idle cores, so we might as well use them for speculative work! And if we're wrong ? Well, the Consumer thread will simply gently erase our mistake and replace it with the correct value.

The heuristic to speculatively compute a Slice should be simple, and it will have to be tuned. It may be adaptative as well... but that may be more difficult!

Conclusion

Analyze your dataset and try to find if it's possible to break dependencies. If it is you can probably take advantage of it, even without going multi-thread.

仅此而已 2024-12-22 07:34:55

如果 maxin[] 的值远离 64 位最小值/最大值(例如,它们始终在 -261 和+261),您可以尝试没有条件分支的循环,这可能会导致一些性能下降:

for(uint32_t i = 1; i < N; i++) {
    max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
    max += in[i];
    out[i] = max;
}

理论上编译器也可能会执行类似的技巧,但如果没有看到反汇编,则很难来判断它是否做到了。

If values of max and in[] are far away from 64-bit min/max (say, they are always between -261 and +261), you may try a loop without the conditional branch, which may be causing some perf degradation:

for(uint32_t i = 1; i < N; i++) {
    max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
    max += in[i];
    out[i] = max;
}

In theory the compiler may do a similar trick as well, but without seeing the disassembly, it's hard to tell if it does it.

画中仙 2024-12-22 07:34:55

代码看起来已经相当快了。根据 in 数组的性质,您可以尝试特殊的大小写,例如,如果您碰巧知道在特定调用中所有输入数字均为正数,则 out[i] 将等于累积和,而无需一个 if 分支。

The code appears already pretty fast. Depending on the nature of the in array, you could try special casing, for instance if you happen to know that in a particular invokation all the input numbers are positive, out[i] will be equal to the cumulative sum, with no need for an if branch.

那支青花 2024-12-22 07:34:55

确保方法不是虚拟内联_属性_((always_inline)) -funroll-loops 似乎是值得探索的不错选择。

只有通过对它们进行基准测试,我们才能确定它们是否值得在您的大型程序中进行优化。

ensuring the method isn't virtual, inline, _attribute_((always_inline)) and -funroll-loops seem like good options to explore.

Only by you benchmarking them can we determine if they were worthwhile optimizations in your bigger program.

画离情绘悲伤 2024-12-22 07:34:55

唯一想到的可能有一点帮助的是在循环中使用指针而不是数组索引,类似于

void theloop(int64_t in[], int64_t out[], size_t N)
{
    int64_t max = in[0];
    out[0] = max;
    int64_t *ip = in + 1,*op = out+1;

    for(uint32_t i = 1; i < N; i++) {
        int64_t v = *ip; 
        ip++;
        max += v;
        if (v > max) max = v;
        *op = max;
        op++
    }
}

这里的想法是数组的索引很容易编译为数组的基地址,将元素的大小乘以索引,然后将结果相加以获得元素地址。保持运行指针可以避免这种情况。我猜一个好的优化编译器已经可以做到这一点,所以您需要研究当前的汇编器输出。

The only thing that comes to mind that might help a small bit is to use pointers rather than array indices within your loop, something like

void theloop(int64_t in[], int64_t out[], size_t N)
{
    int64_t max = in[0];
    out[0] = max;
    int64_t *ip = in + 1,*op = out+1;

    for(uint32_t i = 1; i < N; i++) {
        int64_t v = *ip; 
        ip++;
        max += v;
        if (v > max) max = v;
        *op = max;
        op++
    }
}

The thinking here is that an index into an array is liable to compile as taking the base address of the array, multiplying the size of element by the index, and adding the result to get the element address. Keeping running pointers avoids this. I'm guessing a good optimizing compiler will do this already, so you'd need to study the current assembler output.

人海汹涌 2024-12-22 07:34:55
int64_t max = 0, i;

for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */  
{  
    max = in[i] > 0 ? max+in[i] : in[i];
    out[i] = max;

    --i;  /* Will reduce checking of i>=0 by N/2 times */  

    max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */  
    out[i] = max;         
}

if(0 == i) /* When N is odd */
{ 
    max = in[i] > 0 ? max+in[i] : in[i]; 
    out[i] = max;
}
int64_t max = 0, i;

for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */  
{  
    max = in[i] > 0 ? max+in[i] : in[i];
    out[i] = max;

    --i;  /* Will reduce checking of i>=0 by N/2 times */  

    max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */  
    out[i] = max;         
}

if(0 == i) /* When N is odd */
{ 
    max = in[i] > 0 ? max+in[i] : in[i]; 
    out[i] = max;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文