使用 SSE 将向量乘以常数
我有一些在 4D 向量上运行的代码,目前我正在尝试将其转换为使用 SSE。我在 64b linux 上同时使用 clang 和 gcc。
仅对向量进行操作就可以很好地理解这一点。但现在我必须将整个向量乘以一个常数 - 像这样:
float y[4];
float a1 = 25.0/216.0;
for(j=0; j<4; j++){
y[j] = a1 * x[j];
}
到这样的东西:
float4 y;
float a1 = 25.0/216.0;
y = a1 * x;
其中:
typedef double v4sf __attribute__ ((vector_size(4*sizeof(float))));
typedef union float4{
v4sf v;
float x,y,z,w;
} float4;
这当然行不通,因为我正在尝试对不兼容的数据类型进行乘法。< br> 现在,我可以做这样的事情:float4 a1 = (v4sf){25.0/216.0, 25.0/216.0, 25.0/216.0, 25.0/216.0}
但只是让我觉得很傻,即使我写了一个宏来做到这一点。 另外,我非常确定这不会产生非常有效的代码。
谷歌搜索没有得到明确的答案(请参阅将常量浮点数加载到SSE寄存器)。
那么将整个向量乘以同一个常数的最佳方法是什么?
I have some code that operates on 4D vectors and I'm currently trying to convert it to use SSE. I'm using both clang and gcc on 64b linux.
Operating only on vectors is all fine -grasped that. But now comes a part where i have to multiply an entire vector by a single constant - Something like this:
float y[4];
float a1 = 25.0/216.0;
for(j=0; j<4; j++){
y[j] = a1 * x[j];
}
to something like this:
float4 y;
float a1 = 25.0/216.0;
y = a1 * x;
where:
typedef double v4sf __attribute__ ((vector_size(4*sizeof(float))));
typedef union float4{
v4sf v;
float x,y,z,w;
} float4;
This of course will not work because I'm trying to do a multiplication of incompatiple data types.
Now, i could do something like:float4 a1 = (v4sf){25.0/216.0, 25.0/216.0, 25.0/216.0, 25.0/216.0}
but just makes me feel silly, even if if i write a macro to do this.
Also, I'm pretty certain that will not result in very efficient code.
Googling this brought no clear answers ( see Load constant floats into SSE registers).
So what is the best way to multiply an entire vector by the same constant?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只需使用内在函数并让编译器处理它,例如,
如果您查看生成的代码,它应该非常高效 -
25.0f / 16.0f
值将在编译时计算,并且_mm_set1_ps
生成通常会生成相当有效的代码来绘制矢量。另请注意,您通常只在进入将执行大部分实际工作的循环之前初始化一次常量向量,例如
va
,因此它往往不是性能关键型。Just use intrinsics and let the compiler take care of it, e.g.
If you look at the generated code it should be quite efficient - the
25.0f / 16.0f
value will be calculated at compile time and_mm_set1_ps
generates usually generates reasonably efficient code for splatting a vector.Note also that you normally only initialise a constant vector such as
va
just once, prior to entering a loop where you will be doing most of the actual work, so it tends not to be performance-critical.没有理由必须为此使用内在函数。 OP只是想进行广播。这是与 SIMD 加法一样基本的 SIMD 操作。任何像样的 SIMD 库/扩展都必须支持广播。 Agner Fog 的向量类当然可以,OpenCL 也可以,GCC 文档清楚地表明确实如此。
下面的代码编译得很好
你可以在 http://coliru.stacked-crooked.com 看到结果/a/de79cca2fb5d4b11
将该代码与内部代码进行比较,就可以清楚哪一个更具可读性。它不仅更具可读性,而且更容易移植到 ARM Neon 等。它看起来也与 OpenCL C 代码非常相似。
There is no reason one should have to use intrinsics for this. The OP just wants to do a broadcast. That's as basic a SIMD operation as SIMD addition. Any decent SIMD library/extension has to support broadcasts. Agner Fog's vector class certainly does, OpenCL does, the GCC documention clearly shows that it does.
The following code compiles just fine
You can see the results at http://coliru.stacked-crooked.com/a/de79cca2fb5d4b11
Compare that code to the intrinsic code and it's clear which one is more readable. Not only is it more readable it's easier to port to e.g. ARM Neon. It also looks very similar to OpenCL C code.
这也许不是最好的方法,但这是我在涉足 SSE 时采取的方法。
This perhaps might not be the best way but this was the approach I took when I was dabbling around in SSE.