旧版本 GCC 上的 DPPS
嘿!
我需要优化 c 中的一些矩阵乘法代码,并且我正在使用 SSE 向量指令来完成它。我还发现 SSE4.1 已经有点积、dpps 指令。
问题是,在该软件应该可以运行的机器上安装了旧版本的 gcc (4.1.2),它不支持 SSE4.1,但它有一个支持它的处理器(不要问我为什么gcc 版本比处理器旧...)。所以我不能使用 _mm_dp_ps 函数。
我正在尝试向 c 中添加一些汇编代码。问题是我以前从未使用过汇编代码,所以这真的很令人困惑。另外,在汇编程序中编写处理向量指令的所有代码是否更有效?
所以我在这里问是否还有其他方法如何使用 dpps 指令,以及它是否值得使用?
Hei!
I need to optimize some matrix multiplication code in c, and I'm doing it using SSE vector instructions. I also found that there exists SSE4.1 that already has instruction for dot-product, dpps.
The problem is that on machine this software is supposed to work there is an old version of gcc installed (4.1.2), which has no support for SSE4.1, but it has a processor that supports it (don't ask me why gcc version is older than processor...). So I cannot use _mm_dp_ps function.
I was playing around a bit with adding some assembler code to c. The problem is I have never before used assembler code so it's really confusing. Also is it more efficient to write all the code that is dealing with vector instructions in assembler?
So I am asking here if there are any other ways how to use dpps instruction, and if it is even worth using?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
坦率地说,我不认为有什么问题。从你的描述来看,最终代码需要执行的机器似乎支持SSE4.1和
DPPS
。因此,一旦您的源代码(包括内在的(或程序集))被编译,它就可以在本机上执行。您只需使用较新版本的编译器来编译代码,方法是在您正在讨论的计算机上安装较新版本,或者在不同的计算机上进行编译,然后将可执行文件复制到它必须要执行的计算机上。继续跑。至于使用 DPPS 进行优化是否值得,这取决于您的代码(即,优化的潜力有多大——您应该彻底分析以找出瓶颈所在)以及如何进行优化重要的性能实际上是在这种特定情况下(即,它值得你的时间吗?;时间就是金钱)
显然,如果你没有什么汇编经验,在asm中实现你的例程,或者甚至只是编写你自己的例程围绕
DPPS
的asm包装函数变得不那么有吸引力了。 (但这当然是可能做到的。)Frankly, I do not see the problem. From your description, it seems that the machine on which the final code needs to be executed supports SSE4.1 and
DPPS
. Therefore, once your source code - including the instrinsic (or assembly) - is compiled, it can be executed on this machine. You would only have to get your code compiled with a newer version of the compiler, either by installing a newer version on the machine you are talking about or by compiling on a different machine and then copying the executable to the machine it'll have to run on.As to whether optimisation with
DPPS
is worth the effort, that will depend on your code (i.e., how much potential for optimisation there is -- you should profile thoroughly to find out where your bottlenecks are) and how important performance actually is in this specific case (i.e. is it worth your time?; time is money)Obviously, if you have little assembly experience, implementing your routine in asm, or maybe even just writing your own asm wrapper function around
DPPS
, becomes less attractive. (But it is certainly possible to do.)