逻辑 SSE 内在函数之间有什么区别?

发布于 2024-08-31 20:03:04 字数 394 浏览 8 评论 0 原文

不同类型的逻辑 SSE 内在函数之间有什么区别吗?例如,如果我们进行 OR 运算,则存在三个内在函数:_mm_or_ps_mm_or_pd_mm_or_si128 所有这些都执行相同的操作:计算 按位 他们的操作数或。我的问题:

  1. 使用一个或另一个内在函数(使用适当的类型转换)之间有什么区别吗?在某些特定情况下不会有任何隐藏成本,例如更长的执行时间吗?

  2. 这些内在函数映射到三个不同的 x86 指令(pororpsorpd)。有谁知道为什么英特尔浪费宝贵的操作码空间用于执行相同操作的多个指令吗?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:

  1. Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?

  2. These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

游魂 2024-09-07 20:03:04
  1. 使用一个或另一个内在函数(使用适当的类型转换)之间有什么区别吗?会不会有任何隐藏成本,例如在某些特定情况下执行时间更长?

是的,选择其中之一可能是出于性能原因。

1:有时,如果需要将整数执行单元的输出路由到 FP 执行单元的输入,或者反之亦然,则有时会存在一个或两个额外周期的延迟(转发延迟)。将 128b 数据移动到许多可能的目的地中的任何一个都需要大量的电线,因此 CPU 设计人员必须做出权衡,例如只有一条从每个 FP 输出到每个 FP 输入的直接路径,而不是所有可能的输入。

请参阅此答案,或Agner Fog 的关于旁路延迟的微架构文档。在 Agner 的文档中搜索“Nehalem 上的数据旁路延迟”;它有一些很好的实际例子和讨论。他为他分析过的每个微体系结构都有一个章节。

但是,在
不同的域或不同类型的寄存器都较小
Sandy Bridge 和 Ivy Bridge 比 Nehalem 上的,往往为零。 --
Agner Fog 的微拱文档

请记住,如果延迟不在代码的关键路径上,那么延迟并不重要 (除了有时在 Haswell/Skylake 上,它会在实际绕过很久之后感染稍后使用生成的值:/)。如果 uop 吞吐量是您的瓶颈,而不是关键路径的延迟,那么使用 pshufd 代替 movaps + shufps 可能是一个胜利。

2: 对于传统 SSE 编码,...ps 版本比其他两个版本少 1 个字节的代码。 (不是 AVX)。这将以不同的方式对齐以下指令,这对于解码器和/或 uop 缓存线可能很重要。一般来说,越小越好,可以提高 I 高速缓存中的代码密度,从 RAM 中获取代码,并打包到 uop 高速缓存中。

3:最新的 Intel CPU 只能在 port5 上运行 FP 版本。

  • Merom (Core2) 和 Penryn:orps 可以在 p0/p1/p5 上运行,但只能在整数域上运行。大概所有 3 个版本都解码为完全相同的微指令。所以就会出现跨域转发延迟的情况。 (AMD CPU 也这样做:FP 位指令在 ivec 域中运行。)

  • Nehalem / Sandybridge / IvB / Haswell / Broadwell:por 可以在 p0/p1/p5 上运行,但是 orps 只能在端口5 上运行。 shuffle 也需要 p5,但 FMA、FP add 和 FP mul 单元位于端口 0/1 上。

  • Skylake:pororps 两者每周期吞吐量为 3。英特尔的优化手册有一些有关旁路转发延迟的信息:往返 FP 指令取决于 uop 运行的端口。 (通常仍然是端口 5,因为 FP add/mul/fma 单元位于端口 0 和 1 上。)另请参阅 Haswell AVX/FMA 延迟测试比英特尔指南慢 1 个周期 - “旁路”延迟可能会影响寄存器的每次使用,直到它被覆盖了。

请注意,在 SnB/IvB(AVX 但不是 AVX2)上,只有 p5 需要处理 256b 逻辑操作,因为 vpor ymm, ymm 需要 AVX2。这可能不是改变的原因,因为 Nehalem 就是这样做的。

如何明智地选择

请记住,如果编译器愿意,可以将 por 用于 _mm_or_pd,因此其中一些主要适用于手写汇编。但有些编译器在某种程度上忠实于您选择的内在函数。

如果端口 5 上的逻辑操作吞吐量可能成为瓶颈,则使用整数版本,即使对于 FP 数据也是如此。如果您想使用整数洗牌或其他数据移动指令,则尤其如此。

AMD CPU 始终使用整数域进行逻辑处理,因此,如果您有多个整数域要做的事情,请一次性完成所有操作,以最大程度地减少域之间的往返。即使 dep 链不是代码的瓶颈,较短的延迟也会更快地从重新排序缓冲区中清除内容。

如果您只想在 FP add 和 mul 指令之间设置/清除/翻转 FP 向量中的一位,请使用 ...ps 逻辑,即使在双精度数据上也是如此,因为单 FP 和双 FP 是现有的每个 CPU 上都有相同的域,并且 ...ps 版本短一个字节(没有 AVX)。

不过,使用带有内在函数的 ...pd 版本有实际/人为因素的原因。其他人对其代码的可读性是一个因素:他们会想知道为什么你将数据视为单精度数据,而实际上它是双精度数据。对于 C/C++ 内在函数,在 __m128__m128d 之间进行强制转换是不值得的。 (无论如何,希望编译器将使用 orps 来表示 _mm_or_pd,如果不使用 AVX 进行编译,它实际上会节省一个字节。)

如果调整 insn 对齐级别很重要,请编写直接在 asm 中,而不是内在函数中! (使指令长一个字节可能会更好地适应 uop 缓存线密度和/或解码器,但具有前缀和寻址模式 您通常可以扩展指令

对于整数数据,请使用整数版本。节省一个指令字节不值得在 paddd 或其他什么之间进行旁路延迟,并且整数代码通常会使 port5 完全被混洗占用。对于 Haswell,许多 shuffle / insert / extract / pack / unpack 指令仅变为 p5,而不是 SnB/IvB 的 p1/p5。 (Ice Lake 最终在另一个端口上添加了一个洗牌单元,用于一些更常见的洗牌。)

  • 这些内在函数映射到三个不同的 x86 指令(pororps
    orpd)。有谁知道为什么英特尔浪费宝贵的操作码
    空间可以容纳多条执行相同操作的指令?
  • 如果你看看这些指令集的历史,你就能明白我们是如何走到这一步的。

    por  (MMX):     0F EB /r
    orps (SSE):     0F 56 /r
    orpd (SSE2): 66 0F 56 /r
    por  (SSE2): 66 0F EB /r
    

    MMX 在 SSE 之前就存在,因此看起来 SSE 指令的操作码 (...ps) 是从相同的 0F xx 空间中选择的。然后,对于 SSE2,...pd 版本在 ...ps 操作码中添加了 66 操作数大小前缀,而整数版本在 MMX 版本中添加了 66 前缀。

    他们可以省略orpd和/或por,但他们没有。也许他们认为未来的 CPU 设计可能在不同域之间具有更长的转发路径,因此为数据使用匹配指令将是一个更大的问题。尽管有单独的操作码,AMD 和早期的英特尔仍然将它们视为 int-vector。


    相关/接近重复:

    1. Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?

    Yes, there can be performance reasons to choose one vs. the other.

    1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.

    See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.

    However, the delays for passing data between the
    different domains or different types of registers are smaller on the
    Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. --
    Agner Fog's micro arch doc

    Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.

    2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.

    3: Recent Intel CPUs can only run the FP versions on port5.

    • Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)

    • Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.

    • Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.

    Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.

    How to choose wisely:

    Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.

    If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.

    AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.

    If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).

    There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)

    If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)

    For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)

    1. These intrinsics maps to three different x86 instructions (por, orps,
      orpd). Does anyone have any ideas why Intel is wasting precious opcode
      space for several instructions which do the same thing?

    If you look at the history of these instruction sets, you can kind of see how we got here.

    por  (MMX):     0F EB /r
    orps (SSE):     0F 56 /r
    orpd (SSE2): 66 0F 56 /r
    por  (SSE2): 66 0F EB /r
    

    MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.

    They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.


    Related / near duplicate:

    棒棒糖 2024-09-07 20:03:04

    根据 Intel 和 AMD 优化指南,将操作类型与数据类型混合会产生性能损失,因为 CPU 在内部为特定数据类型标记寄存器的 64 位半部分。当指令被解码和微指令被调度时,这似乎主要影响流水线。从功能上讲,它们产生相同的结果。整数数据类型的较新版本具有更大的编码并在代码段中占用更多空间。因此,如果代码大小是一个问题,请使用旧的操作,因为它们的编码较小。

    According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.

    瞄了个咪的 2024-09-07 20:03:04

    我认为这三个实际上是相同的,即 128 位按位运算。不同形式存在的原因可能是历史性的,但我不确定。我猜浮点版本中可能有一些额外的行为,例如当存在 NaN 时,但这纯粹是猜测。对于普通输入,指令似乎是可以互换的,例如

    #include <stdio.h>
    #include <emmintrin.h>
    #include <pmmintrin.h>
    #include <xmmintrin.h>
    
    int main(void)
    {
        __m128i a = _mm_set1_epi32(1);
        __m128i b = _mm_set1_epi32(2);
        __m128i c = _mm_or_si128(a, b);
    
        __m128 x = _mm_set1_ps(1.25f);
        __m128 y = _mm_set1_ps(1.5f);
        __m128 z = _mm_or_ps(x, y);
            
        printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
        printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
    
        c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
        z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
    
        printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
        printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
        
        return 0;
    }
    

    终端:

    $ gcc -Wall -msse3 por.c -o por
    $ ./por
    
    a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
    x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
    a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
    x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
    

    I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.

    #include <stdio.h>
    #include <emmintrin.h>
    #include <pmmintrin.h>
    #include <xmmintrin.h>
    
    int main(void)
    {
        __m128i a = _mm_set1_epi32(1);
        __m128i b = _mm_set1_epi32(2);
        __m128i c = _mm_or_si128(a, b);
    
        __m128 x = _mm_set1_ps(1.25f);
        __m128 y = _mm_set1_ps(1.5f);
        __m128 z = _mm_or_ps(x, y);
            
        printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
        printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
    
        c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
        z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
    
        printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
        printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
        
        return 0;
    }
    

    Terminal:

    $ gcc -Wall -msse3 por.c -o por
    $ ./por
    
    a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
    x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
    a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
    x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
    
    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文