Intel x86 SSE SIMD 指令入门
I want to learn more about using the SSE.
What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals?
Mainly I'm interested to work with the GCC X86 Built-in Functions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先,我不建议使用内置函数 - 它们不可移植(跨同一体系结构的编译器)。
使用内在函数,GCC 在将 SSE 内在函数优化为更优化的代码方面做得非常出色。您始终可以查看程序集并了解如何充分利用 SSE。
内部函数很简单 - 就像普通函数调用一样:
使用
_mm_load_ps
或_mm_loadu_ps
从数组加载数据。当然还有更多的选择,SSE 真的很强大,而且在我看来相对容易学习。
另请参阅 https://stackoverflow.com/tags/sse/info 了解一些指南链接。
First, I don't recommend on using the built-in functions - they are not portable (across compilers of the same arch).
Use intrinsics, GCC does a wonderful job optimizing SSE intrinsics into even more optimized code. You can always have a peek at the assembly and see how to use SSE to it's full potential.
Intrinsics are easy - just like normal function calls:
Use
_mm_load_ps
or_mm_loadu_ps
to load data from arrays.Of course there are way more options, SSE is really powerful and in my opinion relatively easy to learn.
See also https://stackoverflow.com/tags/sse/info for some links to guides.
由于您要求提供资源:
将 SSE 与 C++ 结合使用的实用指南:关于如何使用 SSE 的良好概念概述有效地使用 SSE,并举例说明。
MSDN 编译器内部函数列表:供所有人参考你的内在需求。这是 MSDN,但几乎这里列出的所有内在函数都受到 GCC 和 ICC 的支持。
Christopher Wright 的 SSE 页面:有关 SSE 操作码含义的快速参考。我猜英特尔手册可以提供相同的功能,但速度更快。
最好用内在函数编写大部分代码,但请检查编译器输出的 objdump 以确保它生成有效的代码。 SIMD 代码生成仍然是一项相当新的技术,编译器在某些情况下很可能会出错。
Since you asked for resources:
A practical guide to using SSE with C++: Good conceptual overview on how to use SSE effectively, with examples.
MSDN Listing of Compiler Intrinsics: Comprehensive reference for all your intrinsic needs. It's MSDN, but pretty much all the intrinsics listed here are supported by GCC and ICC as well.
Christopher Wright's SSE Page: Quick reference on the meanings of the SSE opcodes. I guess the Intel Manuals can serve the same function, but this is faster.
It's probably best to write most of your code in intrinsics, but do check the objdump of your compiler's output to make sure that it's producing efficient code. SIMD code generation is still a fairly new technology and it's very possible that the compiler might get it wrong in some cases.
我发现阿格纳·福格博士的研究和研究优化指南非常有价值!他还有一些图书馆和图书馆。我还没有尝试过的测试工具。
http://www.agner.org/optimize/
I find Dr. Agner Fog's research & optimization guides very valuable! He also has some libraries & testing tools that I have not tried yet.
http://www.agner.org/optimize/
第 1 步:手动编写一些程序集
我建议您首先尝试手动编写自己的程序集,以便在开始学习时准确地查看和控制发生的情况。
那么问题就变成了如何观察程序中发生了什么,答案是:
print
和assert
东西自己使用 C 标准库需要一点点工作,但没什么。 我在 Linux 上的测试设置的以下文件中很好地为您完成了这项工作:
使用这些帮助程序,然后我开始学习基础知识,例如:
addpd.S
GitHub 上游。
paddq.S
GitHub 上游< /a>.
第 2 步:编写一些内在函数
但是,对于生产代码,您可能希望使用预先存在的内在函数而不是原始程序集,如以下所述:https://stackoverflow.com/a/1390802/895245
所以现在我尝试将前面的示例转换为具有内在函数的或多或少等效的 C 代码。
addpq.c
GitHub上游。
paddq.c
GitHub上游。
第 3 步:优化一些代码并对其进行基准测试
最后也是最重要和最困难的一步,当然是实际使用内在函数来使代码更快,然后对您的改进进行基准测试。
这样做可能需要您了解一些有关 x86 微体系结构的知识,而我自己对此并不了解。 CPU 与 IO 限制可能是出现的问题之一:术语“CPU 限制”和“I/O 限制”是什么意思?
如所述:https://stackoverflow.com/a/12172046/895245 这几乎不可避免地需要阅读 Agner Fog 的文档,该文档似乎比英特尔本身发布的任何内容都要好。
不过,希望步骤 1 和 2 能够作为至少尝试功能性非性能方面并快速了解指令正在执行的操作的基础。
TODO:在这里生成一个此类优化的最小有趣示例。
Step 1: write some assembly manually
I recommend that you first try to write your own assembly manually to see and control exactly what is happening when you start learning.
Then the question becomes how to observe what is happening in the program, and the answers are:
print
andassert
thingsUsing the C standard library yourself requires a little bit of work, but nothing much. I have for example done this work nicely for you on Linux in the following files of my test setup:
Using those helpers, I then start playing around with the basics, such as:
addpd.S
GitHub upstream.
paddq.S
GitHub upstream.
Step 2: write some intrinsics
For production code however, you will likely want to use the pre-existing intrinsics instead of raw assembly as mentioned at: https://stackoverflow.com/a/1390802/895245
So now I try to convert the previous examples into more or less equivalent C code with intrinsics.
addpq.c
GitHub upstream.
paddq.c
GitHub upstream.
Step 3: go and optimize some code and benchmark it
The final, and most important and hard step, is of course to actually use the intrinsics to make your code fast, and then to benchmark your improvement.
Doing so, will likely require you to learn a bit about the x86 microarchitecture, which I don't know myself. CPU vs IO bound will likely be one of the things that comes up: What do the terms "CPU bound" and "I/O bound" mean?
As mentioned at: https://stackoverflow.com/a/12172046/895245 this will almost inevitably involve reading Agner Fog's documentation, which appear to be better than anything Intel itself has published.
Hopefully however steps 1 and 2 will serve as a basis to at least experiment with functional non-performance aspects and quickly see what instructions are doing.
TODO: produce a minimal interesting example of such optimization here.
您可以使用 SIMD-Visualiser 以图形方式可视化和动画操作。它将极大地帮助理解数据通道的处理方式
You can use the SIMD-Visualiser to graphically visualize and animate the operations. It'll greatly help understanding how the data lanes are processed