为什么 gcc -march=znver1 限制 uint64_t 向量化?
我试图确保 gcc 对我的循环进行矢量化。事实证明,通过使用 -march=znver1
(或 -march=native
),gcc 会跳过一些循环,即使它们可以矢量化。为什么会出现这种情况?
在此代码中,将每个元素乘以标量的第二个循环未矢量化:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (size_t i = 0; i < N; ++i)
arr[i] *= 5;
for (size_t i = 0; i < N; ++i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}
gcc -O3 -fopt-info-vec-all -mavx2 main.c:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI
gcc -O3 - fopt-info-vec-all -march=znver1 main.c:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
-march=znver1
包含 -mavx2
,所以我认为由于某种原因,gcc 选择不对其进行向量化:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3
我也尝试过 clang,在这两种情况下,我相信循环都是由 32 字节向量向量化的: 备注:矢量化循环(矢量化宽度:4,交错计数:4)
我正在使用 gcc 11.2.0
编辑: 根据彼得·科德斯的要求 我意识到我实际上用乘以 4 进行基准测试已经有一段时间了。
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
代码:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (int j = 0; j < 20000000; ++j)
for (size_t i = 0; i < N; ++i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}
N = 500, arr[i] *= 4
:
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runs
N = 500, arr[i] *= 5
:
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runs
N = 512, arr [i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runs
N = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs
我认为运行在 -O2 -march=znver1
与 -O3 -march=znver1
的速度相同,这是我在文件命名方面的一个错误,当时我还没有创建 makefile,我正在使用 shell 的历史记录。
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (size_t i = 0; i < N; ++i)
arr[i] *= 5;
for (size_t i = 0; i < N; ++i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}
gcc -O3 -fopt-info-vec-all -mavx2 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI
gcc -O3 -fopt-info-vec-all -march=znver1 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
-march=znver1
includes -mavx2
, so I think gcc chooses not to vectorise it for some reason:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3
I also tried clang and in both cases the loops are vectorised by, I believe, 32 byte vectors:remark: vectorized loop (vectorization width: 4, interleaved count: 4)
I'm using gcc 11.2.0
Edit:
As requested by Peter Cordes
I realised I was actually benchmarking with a multiplication by 4 for some time.
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
Code:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (int j = 0; j < 20000000; ++j)
for (size_t i = 0; i < N; ++i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}
N = 500, arr[i] *= 4
:
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runs
N = 500, arr[i] *= 5
:
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runs
N = 512, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runs
N = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs
I think the run where -O2 -march=znver1
was the same speed as -O3 -march=znver1
was a mistake on my part with the naming of the files, I had not created the makefile back then yet, I was using my shell's history.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
默认的
-mtune=generic
具有-mprefer-vector-width=256
,并且-mavx2
不会改变这一点。znver1 意味着
-mprefer-vector-width=128
,因为这是硬件的所有本机宽度。使用 32 字节 YMM 向量的指令解码为至少 2 个微指令,如果是跨车道洗牌,则更多。对于像这样的简单垂直 SIMD,32 字节向量就可以了;管道有效地处理 2-uop 指令。 (我认为是 6 uop 宽,但只有 5 条指令宽,因此仅使用 1 uop 指令无法获得最大前端吞吐量)。但是,当矢量化需要改组时,例如使用不同元素宽度的数组,GCC 代码生成在 256 位或更宽的情况下可能会变得更加混乱。vmovdqa ymm0, ymm1
mov-elimination 仅适用于 Zen1 上的低 128 位一半。此外,通常使用 256 位向量意味着之后应该使用 vzeroupper,以避免在其他 CPU(但不是 Zen1)上出现性能问题。我不知道 Zen1 如何处理未对齐的 32 字节加载/存储,其中每个 16 字节一半都是对齐的,但位于单独的缓存行中。如果效果良好,GCC 可能会考虑将 znver1 -mprefer-vector-width 增加到 256。但是,如果不知道大小是向量的倍数,则更宽的向量意味着更多的清理代码宽度。
理想情况下,GCC 能够检测到这样的简单情况,并在那里使用 256 位向量。 (纯垂直,没有元素宽度的混合,恒定大小是 32 字节的倍数。)至少在 CPU 上这很好:znver1,但不是 bdver2,例如,由于 CPU 设计错误,256 位存储总是很慢。
您可以通过使用
vmovdqu [rdx], xmm0
对第一个循环(类似于 memset 的循环)进行矢量化的方式来查看此选择的结果。 https://godbolt.org/z/E5Tq7Gfzc因此,鉴于 GCC 决定仅使用 128-位向量,只能容纳两个
uint64_t
元素,它(正确或错误)决定它不值得使用vpsllq
/vpaddd
将 qword*5
实现为(v<<2) + v
,与执行它与一条 LEA 指令中的整数一起。在这种情况下几乎肯定是错误的,因为它仍然需要为每个元素或元素对进行单独的加载和存储。 (循环开销,因为 GCC 默认不会展开,除非使用 PGO,
-fprofile-use
。SIMD 就像循环展开,特别是在将 256 位向量作为 2 个单独的微指令处理的 CPU 上。)我不确定 GCC 所说的“未矢量化:不支持的数据类型”到底是什么意思。 x86 在 AVX-512 之前没有 SIMD
uint64_t
乘法指令,因此 GCC 可能会根据 一般情况必须用多个 32x32 来模拟它=> 64 位pmuludq
指令和一堆随机播放。只有在克服了这个困难之后,它才意识到对于像5
这样只有 2 个设置位的常量来说实际上相当便宜?这可以解释 GCC 的决策过程,但我不确定这是否完全正确。尽管如此,这些类型的因素仍然发生在像编译器这样的复杂机器中。熟练的人可以轻松做出更明智的选择,但编译器只是执行一系列优化过程,并不总是同时考虑大局和所有细节。
-mprefer-vector-width=256
没有帮助:不矢量化
uint64_t *= 5
似乎是 GCC9 回归(问题中的基准确认实际的 Zen1 CPU 获得近 2 倍的加速,正如预期的那样,在 6 uops 中执行 2x uint64,而在 5 uops 中执行 1x 标量或 4x uint64_t。具有 256 位向量的 10 个微指令,包括两个 128 位存储,这将与前端一起成为吞吐量瓶颈。)
即使使用
-march=znver1 -O3 -mprefer-vector-width=256
,我们没有得到使用 GCC9、10 或 11 或当前主干向量化的*= 5
循环。正如您所说,我们使用-march=znver2
进行操作。 https://godbolt.org/z/dMTh7Wxcq我们确实使用这些选项进行矢量化
uint32_t
(甚至将向量宽度保留为 128 位)。无论 Zen1 上的向量化是 128 位还是 256 位,标量每个向量 uop(不是指令)都会花费 4 次操作,因此这并不能告诉我们*=
是否是使成本模型决定不这样做的原因向量化,或者只是每个 128 位内部 uop 2 与 4 个元素。使用
uint64_t
,更改为arr[i] += arr[i]<<2;
仍然不会矢量化,但arr[i] < ;<= 1;
确实如此。 (https://godbolt.org/z/6PMn93Y5G)。甚至同一循环中的arr[i] <<= 2;
和arr[i] += 123
也会向量化为 GCC 认为不是的相同指令向量化*= 5
是值得的,只是不同的操作数,常数而不是原始向量。 (标量仍然可以使用一个 LEA)。很明显,成本模型并没有考虑到最终的 x86 asm 机器指令,但我不知道为什么arr[i] += arr[i]
会被认为比arr[i] += arr[i]
更昂贵code>arr[i] <<= 1; 这是完全相同的事情。即使矢量宽度为 128 位,GCC8 也会对循环进行矢量化: https://godbolt .org/z/5o6qjc7f6
使用
-march=znver1 -mprefer-vector-width=256
,将商店存储为两个 16 字节的半部vmovups xmm
/vextracti128
是 为什么 gcc 不将 _mm256_loadu_pd 解析为 single vmovupd? znver1 暗示-mavx256-split-unaligned-store
(当 GCC 不确定知道它是否对齐时,它会影响每个存储。所以它即使数据确实对齐,也会花费额外的指令)。不过,znver1 并不意味着 -mavx256-split-unaligned-load ,因此 GCC 愿意将负载作为内存源操作数折叠到代码中有用的 ALU 操作中。
The default
-mtune=generic
has-mprefer-vector-width=256
, and-mavx2
doesn't change that.znver1 implies
-mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.And
vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should usevzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1
-mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a
vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7GfzcSo given that GCC has decided to only use 128-bit vectors, which can only hold two
uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth usingvpsllq
/vpaddd
to implement qword*5
as(v<<2) + v
, vs. doing it with integer in one LEA instruction.Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO,
-fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD
uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bitpmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like5
with only 2 set bits?That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:Not vectorizing
uint64_t *= 5
seems to be a GCC9 regression(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with
-march=znver1 -O3 -mprefer-vector-width=256
, we don't get the*= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with-march=znver2
. https://godbolt.org/z/dMTh7WxcqWe do get vectorization with those options for
uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether*=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.With
uint64_t
, changing toarr[i] += arr[i]<<2;
still doesn't vectorize, butarr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Evenarr[i] <<= 2;
andarr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing*= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know whyarr[i] += arr[i]
would be considered more expensive thanarr[i] <<= 1;
which is exactly the same thing.GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
With
-march=znver1 -mprefer-vector-width=256
, doing the store as two 16-byte halves withvmovups xmm
/vextracti128
is Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? znver1 implies-mavx256-split-unaligned-store
(which affects every store when GCC doesn't know for sure that it is aligned. So it costs extra instructions even when data does happen to be aligned).znver1 doesn't imply
-mavx256-split-unaligned-load
, though, so GCC is willing to fold loads as memory source operands into ALU operations in code where that's useful.