计算距离平方的最快方法

发布于 2024-12-14 20:25:12 字数 599 浏览 0 评论 0原文

我的代码很大程度上依赖于计算 3D 空间中两点之间的距离。 为了避免昂贵的平方根,我始终使用平方距离。 但它仍然占用了计算时间的主要部分,我想用更快的函数替换我的简单函数。 我现在有:

double distance_squared(double *a, double *b)
{
  double dx = a[0] - b[0];
  double dy = a[1] - b[1];
  double dz = a[2] - b[2];

  return dx*dx + dy*dy + dz*dz;
}

我还尝试使用宏来避免函数调用,但它没有多大帮助。

#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])

我考虑过使用 SIMD 指令,但找不到一个好的示例或完整的指令列表(最好是两个向量上的乘法+加法)。

GPU 不是一种选择,因为每次函数调用时只知道一组点。

计算距离平方的最快方法是什么?

My code relies heavily on computing distances between two points in 3D space.
To avoid the expensive square root I use the squared distance throughout.
But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster.
I now have:

double distance_squared(double *a, double *b)
{
  double dx = a[0] - b[0];
  double dy = a[1] - b[1];
  double dz = a[2] - b[2];

  return dx*dx + dy*dy + dz*dz;
}

I also tried using a macro to avoid the function call but it doesn't help much.

#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])

I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).

GPU's are not an option since only one set of points is known at each function call.

What would be the fastest way to compute the distance squared?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

伴我老 2024-12-21 20:25:12

一个好的编译器会像你管理的那样优化它。如果一个好的编译器认为 SIMD 指令有用的话,它就会使用它。确保为编译器打开所有此类可能的优化。不幸的是,3 维向量并不适合 SIMD 单元。

我怀疑您将不得不接受编译器生成的代码可能非常接近最佳并且无法获得显着的收益。

A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.

I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.

满栀 2024-12-21 20:25:12

第一个明显的事情是使用 restrict 关键字。

现在,ab 是可别名的(因此,从编译器的角度来看,假设最坏的可能情况别名) 。没有编译器会自动对其进行向量化,因为这样做是错误的。

更糟糕的是,编译器不仅不能矢量化这样的循环,如果您还存储(幸运的是不在您的示例中),它还必须每次重新加载值。始终清楚别名,因为它极大地影响编译器。

接下来,如果您可以接受这一点,请使用 float 而不是 double 并填充到 4 个浮点数,即使其中一个浮点数未使用,这是一种更“自然”的数据布局大多数CPU(这在某种程度上是特定于平台的,但对于大多数平台来说4个浮点数是一个很好的猜测——3个双精度数,即“典型”CPU上的1.5个SIMD寄存器,在任何地方都不是最佳的)。

(对于手写的 SIMD 实现(这比您想象的要难),首先要确保数据对齐。接下来,查看您的指令在目标机器上的延迟,并首先执行最长的延迟。例如在 Prescott 之前的 Intel 上,首先将每个组件洗牌到寄存器中,然后与自身相乘,即使使用 3 次乘法而不是 1 次,因为洗牌有很长的延迟,在后来的模型上,洗牌需要一个周期。 ,所以这将是完全反优化的。
这再次表明,将其留给编译器并不是一个坏主意。)

The first obvious thing would be to use the restrict keyword.

As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.

Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.

Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).

(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)

萌无敌 2024-12-21 20:25:12

用于执行此操作的 SIMD 代码(使用 SSE3):

movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0

但您需要四个值向量 (x,y,z,0) 才能实现此操作。如果您只有三个值,那么您需要做一些调整才能获得所需的格式,这会抵消上述的任何好处。

但总的来说,由于 CPU 的超标量流水线架构,获得性能的最佳方法是对大量数据执行相同的操作,这样您就可以交错各个步骤并进行一些循环展开以避免流水线停顿。基于“不能直接使用修改后的值”原则,上面的代码肯定会在最后三个指令上停止 - 第二条指令必须等待前一条指令的结果完成,这在管道系统。

同时对两个或多个不同的点集进行计算可以消除上述瓶颈 - 在等待一个计算结果的同时,您可以开始下一个点的计算:

movaps xmm0,a1
                  movaps xmm2,a2
movaps xmm1,b1
                  movaps xmm3,b2
subps xmm0,xmm1
                  subps xmm2,xmm3
mulps xmm0,xmm0
                  mulps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2

The SIMD code to do this (using SSE3):

movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0

but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.

In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.

Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:

movaps xmm0,a1
                  movaps xmm2,a2
movaps xmm1,b1
                  movaps xmm3,b2
subps xmm0,xmm1
                  subps xmm2,xmm3
mulps xmm0,xmm0
                  mulps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2
haddps xmm0,xmm0
                  haddps xmm2,xmm2
如梦初醒的夏天 2024-12-21 20:25:12

如果您想优化某些内容,请首先分析代码并检查汇编程序输出。

使用 gcc -O3 (4.6.1) 编译后,我们将得到 SIMD 的良好反汇编输出:

movsd   (%rdi), %xmm0
movsd   8(%rdi), %xmm2
subsd   (%rsi), %xmm0
movsd   16(%rdi), %xmm1
subsd   8(%rsi), %xmm2
subsd   16(%rsi), %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0

If you would like to optimize something, at first profile code and inspect assembler output.

After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:

movsd   (%rdi), %xmm0
movsd   8(%rdi), %xmm2
subsd   (%rsi), %xmm0
movsd   16(%rdi), %xmm1
subsd   8(%rsi), %xmm2
subsd   16(%rsi), %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0
我的鱼塘能养鲲 2024-12-21 20:25:12

此类问题在 MD 模拟中经常出现。通常通过cutoffs和邻居列表来减少计算量,从而减少计算次数。然而,平方距离的实际计算完全按照您的问题中给出的方式完成(使用编译器优化和固定类型 float[3])。

因此,如果您想减少平方计算量,您应该告诉我们有关该问题的更多信息。

This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.

So if you want to reduce the amount of squared calculations you should tell us more about the problem.

固执像三岁 2024-12-21 20:25:12

也许直接传递 6 个双精度数作为参数可以使其更快(因为它可以避免数组取消引用):

inline double distsquare_coord(double xa, double ya, double za, 
                               double xb, double yb, double zb) 
{ 
  double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
  return dx*dx + dy*dy + dz*dz; 
}

或者,如果附近有很多点,您可以通过线性近似计算距离(到同一固定的其他点)其他近点的距离。

Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):

inline double distsquare_coord(double xa, double ya, double za, 
                               double xb, double yb, double zb) 
{ 
  double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
  return dx*dx + dy*dy + dz*dz; 
}

Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.

泅渡 2024-12-21 20:25:12

如果您可以重新排列数据以同时处理两对输入向量,则可以使用此代码(仅限 SSE2)

// @brief Computes two squared distances between two pairs of 3D vectors
// @param a
//  Pointer to the first pair of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param b
//  Pointer to the second pairs of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param c
//  Pointer to the output 2 element array.
//  Must be aligned by 16 (2 doubles).
//  The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
    // diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
    __m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
    // diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
    __m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
    // diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
    __m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));

    // prod0 = ( d0.y * d0.y, d0.x * d0.x )
    __m128d prod0 = _mm_mul_pd(diff0, diff0);
    // prod1 = ( d1.x * d1.x, d0.z * d0.z )
    __m128d prod1 = _mm_mul_pd(diff1, diff1);
    // prod2 = ( d1.z * d1.z, d1.y * d1.y )
    __m128d prod2 = _mm_mul_pd(diff1, diff1);

    // _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
    // psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
    __m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);

    // _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
    // dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
    __m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);

    __mm_store_pd(c, dotprod);
}

If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)

// @brief Computes two squared distances between two pairs of 3D vectors
// @param a
//  Pointer to the first pair of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param b
//  Pointer to the second pairs of 3D vectors.
//  The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
//  Must be aligned by 16 (2 doubles).
// @param c
//  Pointer to the output 2 element array.
//  Must be aligned by 16 (2 doubles).
//  The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
    // diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
    __m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
    // diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
    __m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
    // diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
    __m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));

    // prod0 = ( d0.y * d0.y, d0.x * d0.x )
    __m128d prod0 = _mm_mul_pd(diff0, diff0);
    // prod1 = ( d1.x * d1.x, d0.z * d0.z )
    __m128d prod1 = _mm_mul_pd(diff1, diff1);
    // prod2 = ( d1.z * d1.z, d1.y * d1.y )
    __m128d prod2 = _mm_mul_pd(diff1, diff1);

    // _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
    // psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
    __m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);

    // _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
    // dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
    __m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);

    __mm_store_pd(c, dotprod);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文