对于简单的二进制减法,SSE 的最大理论加速是多少?

发布于 2024-08-05 19:22:19 字数 782 浏览 6 评论 0原文

在试图弄清楚我的代码的内部循环是否遇到了硬件设计障碍或对我的部分障碍缺乏理解时。还有更多内容,但我可以回答的最简单的问题如下:

如果我有以下代码:

float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;

initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);

for(i=0;i<32768-1;i++) {
  xref=px[i];
  yref=py[i];
  zref=pz[i];
  for(j=0;j<32768-1;j++ {
    deltx=xref-px[j];
    delty=yref-py[j];
    deltz=zref-pz[j];
  } }

在某种情况下通过转到 SSE 指令我能够看到什么类型的最大理论加速我可以完全控制代码(程序集、内在函数等),但无法控制架构以外的运行时环境(即,它是一个多用户环境,因此我无法对操作系统内核如何为我的特定进程分配时间做任何事情) 。

现在我看到我的代码速度提高了 3 倍,而我本以为使用 SSE 会给我带来比 3 倍加速所指示的更多的向量深度(大概 3 倍加速告诉我我有 4 倍最大理论值)吞吐量)。 (我已经尝试过让 deltx/delty/deltz 成为数组,以防编译器不够智能来自动提升它们,但我仍然看到只有 3 倍的速度提升。)我正在使用 intel C 编译器用于矢量化的适当编译器标志,但显然没有内在函数。

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:

If I have the following code:

float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;

initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);

for(i=0;i<32768-1;i++) {
  xref=px[i];
  yref=py[i];
  zref=pz[i];
  for(j=0;j<32768-1;j++ {
    deltx=xref-px[j];
    delty=yref-py[j];
    deltz=zref-pz[j];
  } }

What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).

Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

卸妝后依然美 2024-08-12 19:22:20

考虑一下:浮动有多宽? SSEx 指令有多宽?该比率应该给你某种合理的上限。

还值得注意的是,无序管道会对获得良好的加速估计造成破坏。

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.

It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

埋情葬爱 2024-08-12 19:22:20

您应该考虑循环平铺 - 您访问内部循环中的值的方式可能会导致L1 数据缓存中存在大量抖动。这还不算太糟糕,因为所有内容可能仍然适合 384 KB 的 L2,但 L1 缓存命中和 L2 缓存命中之间很容易存在数量级差异,因此这可能会给您带来很大的不同。

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

清风无影 2024-08-12 19:22:19

这取决于CPU。但理论上的最大值不会超过 4 倍。我不知道哪个 CPU 可以在每个时钟周期执行多个 SSE 指令,这意味着它每个周期最多可以计算 4 个值。

大多数 CPU 每个周期至少可以执行一条浮点标量指令,因此在这种情况下,您会看到理论上最大的 4 倍加速。

但您必须查找正在运行的 CPU 的具体指令吞吐量。

不过,3 倍的实际加速已经相当不错了。

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.

Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.

But you'll have to look up the specific instruction throughput for the CPU you're running on.

A practical speedup of 3x is pretty good though.

旧话新听 2024-08-12 19:22:19

我认为你可能必须以某种方式交错内部循环。 3 分量向量可以一次性完成,但一次只能进行 3 个操作。要达到 4,您需要从第一个向量中提取 3 个分量,从下一个向量中提取 1 个分量,然后是 2 个分量和 2 个分量,依此类推。如果您建立了某种队列,一次加载和处理数据 4 个组件,然后将其分开,这可能会起作用。

编辑:您可以展开内部循环,每次迭代处理 4 个向量(假设数组大小始终是 4 的倍数)。这样就可以完成我上面说的事情了。

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.

Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文