使用 SSE 将向量乘以常数

发布于 2024-10-21 10:55:46 字数 874 浏览 2 评论 0原文

我有一些在 4D 向量上运行的代码，目前我正在尝试将其转换为使用 SSE。我在 64b linux 上同时使用 clang 和 gcc。
仅对向量进行操作就可以很好地理解这一点。但现在我必须将整个向量乘以一个常数 - 像这样：

float y[4];
float a1 =   25.0/216.0;  

for(j=0; j<4; j++){  
    y[j] = a1 * x[j];  
}

到这样的东西：

float4 y;
float a1 =   25.0/216.0;  

y = a1 * x;

其中：

typedef double v4sf __attribute__ ((vector_size(4*sizeof(float)))); 

typedef union float4{
    v4sf v;
    float x,y,z,w;
} float4;

这当然行不通，因为我正在尝试对不兼容的数据类型进行乘法。< br> 现在，我可以做这样的事情：
float4 a1 = (v4sf){25.0/216.0, 25.0/216.0, 25.0/216.0, 25.0/216.0} 但只是让我觉得很傻，即使我写了一个宏来做到这一点。另外，我非常确定这不会产生非常有效的代码。

谷歌搜索没有得到明确的答案（请参阅将常量浮点数加载到SSE寄存器）。

那么将整个向量乘以同一个常数的最佳方法是什么？

原文

I have some code that operates on 4D vectors and I'm currently trying to convert it to use SSE. I'm using both clang and gcc on 64b linux.
Operating only on vectors is all fine -grasped that. But now comes a part where i have to multiply an entire vector by a single constant - Something like this:

float y[4];
float a1 =   25.0/216.0;  

for(j=0; j<4; j++){  
    y[j] = a1 * x[j];  
}

to something like this:

float4 y;
float a1 =   25.0/216.0;  

y = a1 * x;

where:

typedef double v4sf __attribute__ ((vector_size(4*sizeof(float)))); 

typedef union float4{
    v4sf v;
    float x,y,z,w;
} float4;

This of course will not work because I'm trying to do a multiplication of incompatiple data types.
Now, i could do something like:
float4 a1 = (v4sf){25.0/216.0, 25.0/216.0, 25.0/216.0, 25.0/216.0}
but just makes me feel silly, even if if i write a macro to do this.
Also, I'm pretty certain that will not result in very efficient code.

Googling this brought no clear answers ( see Load constant floats into SSE registers).

So what is the best way to multiply an entire vector by the same constant?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柳絮泡泡 2024-10-28 10:55:46

只需使用内在函数并让编译器处理它，例如，

__m128 vb = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); // vb = { 1.0, 2.0, 3.0, 4.0 }
__m128 va = _mm_set1_ps(25.0f / 216.0f); // va = { 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f }
__m128 vc = _mm_mul_ps(va, vb); // vc = va * vb

如果您查看生成的代码，它应该非常高效 - 25.0f / 16.0f 值将在编译时计算，并且 _mm_set1_ps生成通常会生成相当有效的代码来绘制矢量。

另请注意，您通常只在进入将执行大部分实际工作的循环之前初始化一次常量向量，例如 va，因此它往往不是性能关键型。

Just use intrinsics and let the compiler take care of it, e.g.

__m128 vb = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); // vb = { 1.0, 2.0, 3.0, 4.0 }
__m128 va = _mm_set1_ps(25.0f / 216.0f); // va = { 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f }
__m128 vc = _mm_mul_ps(va, vb); // vc = va * vb

If you look at the generated code it should be quite efficient - the 25.0f / 16.0f value will be calculated at compile time and _mm_set1_ps generates usually generates reasonably efficient code for splatting a vector.

Note also that you normally only initialise a constant vector such as va just once, prior to entering a loop where you will be doing most of the actual work, so it tends not to be performance-critical.

回复收藏 0 原文

梦醒时光 2024-10-28 10:55:46

没有理由必须为此使用内在函数。 OP只是想进行广播。这是与 SIMD 加法一样基本的 SIMD 操作。任何像样的 SIMD 库/扩展都必须支持广播。 Agner Fog 的向量类当然可以，OpenCL 也可以，GCC 文档清楚地表明确实如此。

a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */

下面的代码编译得很好

#include <stdio.h>
int main() {     
    typedef float float4 __attribute__ ((vector_size (16)));

    float4 x = {1,2,3,4};
    float4 y = (25.0f/216.0f)*x;
    printf("%f %f %f %f\n", y[0], y[1], y[2], y[3]);
    //0.115741 0.231481 0.347222 0.462963
}

你可以在 http://coliru.stacked-crooked.com 看到结果/a/de79cca2fb5d4b11

将该代码与内部代码进行比较，就可以清楚哪一个更具可读性。它不仅更具可读性，而且更容易移植到 ARM Neon 等。它看起来也与 OpenCL C 代码非常相似。

There is no reason one should have to use intrinsics for this. The OP just wants to do a broadcast. That's as basic a SIMD operation as SIMD addition. Any decent SIMD library/extension has to support broadcasts. Agner Fog's vector class certainly does, OpenCL does, the GCC documention clearly shows that it does.

a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */

The following code compiles just fine

#include <stdio.h>
int main() {     
    typedef float float4 __attribute__ ((vector_size (16)));

    float4 x = {1,2,3,4};
    float4 y = (25.0f/216.0f)*x;
    printf("%f %f %f %f\n", y[0], y[1], y[2], y[3]);
    //0.115741 0.231481 0.347222 0.462963
}

You can see the results at http://coliru.stacked-crooked.com/a/de79cca2fb5d4b11

Compare that code to the intrinsic code and it's clear which one is more readable. Not only is it more readable it's easier to port to e.g. ARM Neon. It also looks very similar to OpenCL C code.

回复收藏 0 原文

玩物 2024-10-28 10:55:46

这也许不是最好的方法，但这是我在涉足 SSE 时采取的方法。

float4 scale(const float s, const float4 a)
{
  v4sf sv = { s, s, s, 0.0f };
  float4 r = { .v = __builtin_ia32_mulps(sv, a.v) };
  return r;
}

float4 y;
float a1;

y = scale(a1, y);

This perhaps might not be the best way but this was the approach I took when I was dabbling around in SSE.

float4 scale(const float s, const float4 a)
{
  v4sf sv = { s, s, s, 0.0f };
  float4 r = { .v = __builtin_ia32_mulps(sv, a.v) };
  return r;
}

float4 y;
float a1;

y = scale(a1, y);

回复收藏 0 原文

~没有更多了~