v4 系列的 gcc
编译器可以使用 SIMD 自动矢量化循环某些现代 CPU 上的处理器,例如 AMD Athlon 或 Intel Pentium/Core 芯片。 这是怎么做到的?
The v4 series of the gcc
compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?
发布评论
评论(2)
原始页面提供了有关让 gcc 自动矢量化的详细信息
循环,包括一些示例:
http://gcc.gnu.org/projects/ tree-ssa/vectorization.html
虽然示例很棒,但事实证明,使用最新的 GCC 调用这些选项的语法似乎发生了一些变化,现在请参阅:
总之,以下选项适用于具有 SSE2 的 x86 芯片,
给出已向量化的循环的日志:
请注意 -msse 也是一种可能性,但它只会向量化循环
使用浮点数,而不是双精度数或整数。 (SSE2 是 x86-64 的基准。对于 32 位代码,也使用
-mfpmath=sse
。这是 64 位的默认设置,但不是 32 位。)现代版本的 GCC在
-O3
处启用-ftree-vectorize
,因此只需在 GCC4.x 及更高版本中使用它:(Clang 在
-O2< /code>。ICC 默认启用优化+快速数学。)
以下大部分内容是由 Peter Cordes 编写的,他本可以编写一个新答案。 随着时间的推移,随着编译器的变化,选项和编译器输出也会发生变化。 我不完全确定是否值得在这里详细跟踪它。 评论? -- 作者
要同时使用您正在编译的硬件支持的指令集扩展并对其进行调整,请使用
-march=native
。归约循环(如数组求和)需要 OpenMP 或 -ffast-math 来将 FP 数学视为关联和矢量化。 使用
-O3 -march=native -ffast-math
的 Godbolt 编译器浏览器示例 包括一个没有-ffast-math
的标量缩减(数组和)。 (好吧,GCC8 及更高版本执行 SIMD 加载,然后将其解包为标量元素,这与简单的展开相比毫无意义。单个addss
依赖链的延迟上的循环瓶颈。)有时您不这样做不需要
-ffast-math
,只需-fno-math-errno
就可以帮助 gcc 内联数学函数并对涉及sqrt
和/或的内容进行矢量化rint
/nearbyint
。其他有用的选项包括
-flto
(跨文件内联、常量传播等的链接时优化)和/或使用-fprofile-generate
/ test 进行配置文件引导优化使用实际输入/-fprofile-use
运行。 PGO 启用“热”循环的循环展开; 在现代 GCC 中,即使在 -O3 下,默认情况下也是关闭的。The original page offers details on getting gcc to automatically vectorize
loops, including a few examples:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:
In summary, the following options will work for x86 chips with SSE2,
giving a log of loops that have been vectorized:
Note that -msse is also a possibility, but it will only vectorize loops
using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use
-mfpmath=sse
as well. That's the default for 64-bit but not 32-bit.)Modern versions of GCC enable
-ftree-vectorize
at-O3
so just use that in GCC4.x and later:(Clang enables auto-vectorization at
-O2
. ICC defaults to optimization enabled + fast-math.)Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author
To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use
-march=native
.Reduction loops (like sum of an array) will need OpenMP or
-ffast-math
to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with-O3 -march=native -ffast-math
including a reduction (array sum) which is scalar without-ffast-math
. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the oneaddss
dependency chain.)Sometimes you don't need
-ffast-math
, just-fno-math-errno
can help gcc inline math functions and vectorize something involvingsqrt
and/orrint
/nearbyint
.Other useful options include
-flto
(link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with-fprofile-generate
/ test run(s) with realistic input(s) /-fprofile-use
. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.有一个 gimple(GCC 的中间表示)pass
pass_vectorize
。 此过程将在 gimple 级别启用自动矢量化。为了启用自动向量化(GCC V4.4.0),我们需要执行以下步骤:
UNITS_PER_SIMD_WORD
来完成。-modes.def
。 该文件必须位于包含机器描述的其他文件所在的目录中。 (根据配置脚本。如果您可以更改脚本,您可以将文件放置在您想要的任何目录中)。根据目标架构要考虑进行矢量化的模式。 例如,4 个字将构成一个向量,或者 8 个半字将构成一个向量,或者两个双字将构成一个向量。 详细信息需要在
-modes.def
文件中提及。 例如:VECTOR_MODES(INT,8); /* V8QI V4HI V2SI /
VECTOR_MODES(INT,16); / V16QI V8HI V4SI V2DI /
VECTOR_MODES(浮点数,8); / V4HF V2SF */
构建端口。 可以使用命令行选项
-O2 -ftree-vectorize
启用矢量化。There is a gimple (an Intermediate Representation of GCC) pass
pass_vectorize
. This pass will enable auto-vectorization at gimple level.For enabling autovectorization (GCC V4.4.0), we need to following steps:
UNITS_PER_SIMD_WORD
.<target>-modes.def
. This file has to reside in the directory where other files containing the machine descriptions are residing on. (As per the configuration script. If you can change the script you can place the file in whatever directory you want it to be in).The modes that are to be considered for vectorization as per target architecture. Like, 4 words will constitute a vector or eight half words will constitute a vector or two double-words will constitute a vector. The details of this needs to be mentioned in the
<target>-modes.def
file. For example:VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI /
VECTOR_MODES (INT, 16); / V16QI V8HI V4SI V2DI /
VECTOR_MODES (FLOAT, 8); / V4HF V2SF */
Build the port. Vectorization can be enabled using the command line options
-O2 -ftree-vectorize
.