矢量化是什么意思?
对代码进行矢量化是个好主意吗?何时执行此操作有哪些良好做法?下面会发生什么?
Is it a good idea to vectorize the code? What are good practices in terms of when to do it? What happens underneath?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
矢量化意味着编译器检测到您的独立指令可以作为一条 SIMD 指令执行。通常的例子是,如果你做类似的事情
它将被向量化为(使用向量表示法)
基本上编译器会选择一个可以同时对数组的 VF 元素执行的操作,并执行 N/VF 次而不是执行单次操作N次。
它提高了性能,但对架构提出了更多要求。
Vectorization means that the compiler detects that your independent instructions can be executed as one SIMD instruction. Usual example is that if you do something like
It will be vectorized as (using vector notation)
Basically the compiler picks one operation that can be done on VF elements of the array at the same time and does this N/VF times instead of doing the single operation N times.
It increases performance, but puts more requirement on the architecture.
如上所述,矢量化用于利用 SIMD 指令,它可以对打包到大型寄存器中的不同数据执行相同的操作。
使编译器能够自动矢量化循环的通用准则是确保循环的不同迭代中不存在流依赖和反依赖黑白数据元素。
http://en.wikipedia.org/wiki/Data_dependency
一些编译器,例如 Intel C++/Fortran编译器能够自动向量化代码。如果无法对循环进行矢量化,英特尔编译器能够报告无法执行此操作的原因。这些报告可用于修改代码,使其变得可矢量化(假设可能)。
“优化现代架构的编译器:基于依赖关系的方法”一书中深入介绍了依赖关系
As mentioned above, vectorization is used to make use of SIMD instructions, which can perform identical operations of different data packed into large registers.
A generic guideline to enable a compiler to autovectorize a loop is to ensure that there are no flow- and anti-dependencies b/w data elements in different iterations of a loop.
http://en.wikipedia.org/wiki/Data_dependency
Some compilers like the Intel C++/Fortran compilers are capable of autovectorizing code. In case it was not able to vectorize a loop, the Intel compiler is capable of reporting why it could not do that. There reports can be used to modify the code such that it becomes vectorizable (assuming it's possible)
Dependencies are covered in depth in the book 'Optimizing Compilers for Modern Architectures: A Dependence-based Approach'
矢量化不必局限于可以容纳大量数据的单个寄存器。就像使用“128”位寄存器来保存“4 x 32”位数据一样。这取决于架构限制。某些架构具有不同的执行单元,这些执行单元具有自己的寄存器。在这种情况下,可以将部分数据馈送到该执行单元,并且可以从与该执行单元对应的寄存器中获取结果。
例如,考虑以下情况。
如果我正在开发一个具有两个执行单元的架构,那么我的向量大小定义为 2。上面提到的循环将被重新构造为
由于我有两个执行单元,循环内的两个语句将被输入到两个执行单元中。总和将分别累加到执行单元中。最后将计算累加值(来自两个执行单元)的总和。
好的做法是
1. 在对循环进行矢量化之前,需要检查依赖性(循环的不同迭代之间)等约束。
2.需要防止函数调用。
3. 指针访问会产生别名,需要加以防止。
Vectorization need not be limited to single register which can hold large data. Like using '128' bit register to hold '4 x 32' bit data. It depends on architectural limitations. Some architecture have different execution units which have registers of their own. In that case, a part of the data can be fed to that execution unit and the result can be taken from a register corresponding to that execution unit.
For example, consider the below case.
If I am working on an architecture which has two execution units, then my vector size is defined as two. The loop mentioned above will be reframed as
As I am having two execution units the two statements inside the loop will be fed into the two execution units. The sum will be accumulated in the execution units separately. Finally the sum of accumulated values (from two execution units) will be carried out.
The good practices are
1. The constraints like dependency (between different iterations of the loop) needs to be checked before vectorizing the loop.
2. Function calls needs to be prevented.
3. Pointer access can create aliasing and it needs to be prevented.
这是 SSE 代码生成。
您有一个包含浮点矩阵代码的循环matrix1[i][j] + matrix2[i][j],编译器生成SSE代码。
It's SSE code Generation.
You have a loop with float matrix code in it matrix1[i][j] + matrix2[i][j] and the compiler generates SSE code.
也许还可以看看 libSIMDx86 (源代码)。
一个很好解释的例子是:
选择避免分支:Altivec 的一个小例子
Maybe also have a look at libSIMDx86 (source code).
A nice example well explained is:
Choosing to Avoid Branches: A Small Altivec Example