为什么 ARM NEON 不比普通 C++ 更快？

发布于 2024-11-02 14:47:50 字数 2848 浏览 9 评论 0原文

这是一个 C++ 代码：

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
    {
        x[ i ] = x[ i ] + y[ i ];
    }
}

这是一个 neon 版本：

void neon_assm_tst_add( unsigned* x, unsigned* y )
{
    register unsigned i = ARR_SIZE_TEST >> 2;

    __asm__ __volatile__
    (
        ".loop1:                            \n\t"

        "vld1.32   {q0}, [%[x]]             \n\t"
        "vld1.32   {q1}, [%[y]]!            \n\t"

        "vadd.i32  q0 ,q0, q1               \n\t"
        "vst1.32   {q0}, [%[x]]!            \n\t"

        "subs     %[i], %[i], $1            \n\t"
        "bne      .loop1                    \n\t"

        : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)
        :
        : "memory"
    );
}

测试功能：

void bench_simple_types_test( )
{
    unsigned* a = new unsigned [ ARR_SIZE_TEST ];
    unsigned* b = new unsigned [ ARR_SIZE_TEST ];

    neon_tst_add( a, b );
    neon_assm_tst_add( a, b );
}

我已经测试了两种变体，这里是一份报告：

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 185 ms // SLOW!!!

我还测试了其他类型：

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms // FASTER X3!

问题：为什么 neon 对于 32 位整数类型速度较慢？

我使用了最新版本的 GCC for Android NDK。 NEON 优化标志已打开。这是反汇编的 C++ 版本：

                 MOVS            R3, #0
                 PUSH            {R4}

 loc_8
                 LDR             R4, [R0,R3]
                 LDR             R2, [R1,R3]
                 ADDS            R2, R4, R2
                 STR             R2, [R0,R3]
                 ADDS            R3, #4
                 CMP.W           R3, #0x2000000
                 BNE             loc_8
                 POP             {R4}
                 BX              LR

这是 neon 的反汇编版本：

                 MOV.W           R3, #0x200000
.loop1
                 VLD1.32         {D0-D1}, [R0]
                 VLD1.32         {D2-D3}, [R1]!
                 VADD.I32        Q0, Q0, Q1
                 VST1.32         {D0-D1}, [R0]!
                 SUBS            R3, #1
                 BNE             .loop1
                 BX              LR

这是所有基准测试：

add, char,     C++       : 83  ms
add, char,     neon asm  : 46  ms FASTER x2

add, short,    C++       : 114 ms
add, short,    neon asm  : 92  ms FASTER x1.25

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 184 ms SLOWER!!!

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms FASTER x3

add, double,   C++       : 533 ms
add, double,   neon asm  : 420 ms FASTER x1.25

问题：为什么 neon 对于 32 位整数类型速度较慢？

原文

Here is a C++ code:

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    for ( register int i = 0; i < ARR_SIZE_TEST; ++i )
    {
        x[ i ] = x[ i ] + y[ i ];
    }
}

Here is a neon version:

void neon_assm_tst_add( unsigned* x, unsigned* y )
{
    register unsigned i = ARR_SIZE_TEST >> 2;

    __asm__ __volatile__
    (
        ".loop1:                            \n\t"

        "vld1.32   {q0}, [%[x]]             \n\t"
        "vld1.32   {q1}, [%[y]]!            \n\t"

        "vadd.i32  q0 ,q0, q1               \n\t"
        "vst1.32   {q0}, [%[x]]!            \n\t"

        "subs     %[i], %[i], $1            \n\t"
        "bne      .loop1                    \n\t"

        : [x]"+r"(x), [y]"+r"(y), [i]"+r"(i)
        :
        : "memory"
    );
}

Test function:

void bench_simple_types_test( )
{
    unsigned* a = new unsigned [ ARR_SIZE_TEST ];
    unsigned* b = new unsigned [ ARR_SIZE_TEST ];

    neon_tst_add( a, b );
    neon_assm_tst_add( a, b );
}

I have tested both variants and here are a report:

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 185 ms // SLOW!!!

I also tested other types:

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms // FASTER X3!

THE QUESTION:
Why neon is slower with 32-bit integer types?

I used last version of GCC for Android NDK. NEON optimization flags were turned on.
Here is a disassembled C++ version:

                 MOVS            R3, #0
                 PUSH            {R4}

 loc_8
                 LDR             R4, [R0,R3]
                 LDR             R2, [R1,R3]
                 ADDS            R2, R4, R2
                 STR             R2, [R0,R3]
                 ADDS            R3, #4
                 CMP.W           R3, #0x2000000
                 BNE             loc_8
                 POP             {R4}
                 BX              LR

Here is disassembled version of neon:

                 MOV.W           R3, #0x200000
.loop1
                 VLD1.32         {D0-D1}, [R0]
                 VLD1.32         {D2-D3}, [R1]!
                 VADD.I32        Q0, Q0, Q1
                 VST1.32         {D0-D1}, [R0]!
                 SUBS            R3, #1
                 BNE             .loop1
                 BX              LR

Here is all bench tests:

add, char,     C++       : 83  ms
add, char,     neon asm  : 46  ms FASTER x2

add, short,    C++       : 114 ms
add, short,    neon asm  : 92  ms FASTER x1.25

add, unsigned, C++       : 176 ms
add, unsigned, neon asm  : 184 ms SLOWER!!!

add, float,    C++       : 571 ms
add, float,    neon asm  : 184 ms FASTER x3

add, double,   C++       : 533 ms
add, double,   neon asm  : 420 ms FASTER x1.25

THE QUESTION:
Why neon is slower with 32-bit integer types?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独自←快乐 2024-11-09 14:47:51

Cortex-A8 上的 NEON 管道是按顺序执行的，并且命中/未命中的情况有限（无重命名），因此您会受到内存延迟的限制（因为您使用的缓存大小超过了 L1/L2 缓存大小）。您的代码直接依赖于从内存加载的值，因此它将不断停止等待内存。这可以解释为什么 NEON 代码比非 NEON 代码稍微慢一些。

您需要展开装配循环并增加加载和使用之间的距离，例如：

vld1.32   {q0}, [%[x]]!
vld1.32   {q1}, [%[y]]!
vld1.32   {q2}, [%[x]]!
vld1.32   {q3}, [%[y]]!
vadd.i32  q0 ,q0, q1
vadd.i32  q2 ,q2, q3
...

有大量的霓虹灯寄存器，因此您可以多次展开它。整数代码也会遇到同样的问题，但程度较轻，因为 A8 整数具有更好的命中率和未命中率，而不是停滞。与 L1/L2 缓存相比，基准测试的瓶颈将是内存带宽/延迟。您可能还希望以较小的大小（4KB..256KB）运行基准测试，以查看数据完全缓存在 L1 和/或 L2 中时的效果。

The NEON pipeline on Cortex-A8 is in-order executing, and has limited hit-under-miss (no renaming), so you're limited by memory latency (as you're using more than L1/L2 cache size). Your code has immediate dependencies on the values loaded from memory, so it'll stall constantly waiting for memory. This would explain why the NEON code is slightly (by a tiny amount) slower than non-NEON.

You need to unroll the assembly loops and increase the distance between load and use, e.g:

vld1.32   {q0}, [%[x]]!
vld1.32   {q1}, [%[y]]!
vld1.32   {q2}, [%[x]]!
vld1.32   {q3}, [%[y]]!
vadd.i32  q0 ,q0, q1
vadd.i32  q2 ,q2, q3
...

There's plenty of neon registers so you can unroll it a lot. Integer code will suffer the same issue, to a lesser extent because A8 integer has better hit-under-miss instead of stalling. The bottleneck is going to be memory bandwidth/latency for benchmarks so large compared to L1/L2 cache. You might also want to run the benchmark at smaller sizes (4KB..256KB) to see effects when data is cached entirely in L1 and/or L2.

回复收藏 0 原文

夜清冷一曲。 2024-11-09 14:47:51

尽管在这种情况下您受到主内存延迟的限制，但 NEON 版本比 ASM 版本慢并不明显。

在此处使用周期计算器：

http://pulsar.webshaker.net/ccc/result .php?lng=en

您的代码在缓存未命中处罚之前应该需要 7 个周期。它比您预期的要慢，因为您使用的是未对齐的负载以及添加和存储之间的延迟。

同时，编译器生成的循环需要 6 个周期（总体上也没有很好地调度或优化）。但它只完成了四分之一的工作。

脚本中的周期计数可能并不完美，但我没有看到任何明显的错误，所以我认为它们至少会接近。如果您最大化获取带宽（也如果循环不是 64 位对齐），则有可能在分支上花费一个额外的周期，但在这种情况下，有大量的停顿可以隐藏这一点。

答案并不是 Cortex-A8 上的整数有更多机会隐藏延迟。事实上，由于 NEON 的交错管道和问题队列，它通常具有较少的数量。当然，这只适用于 Cortex-A8 - 在 Cortex-A9 上，情况很可能相反（NEON 是按顺序调度的，并且与整数并行，而整数具有乱序功能）。既然你标记了这个 Cortex-A8，我假设这就是你正在使用的。

这需要更多的调查。以下是为什么会发生这种情况的一些想法：

您没有在数组上指定任何类型的对齐方式，虽然我希望 new 对齐到 8 字节，但它可能不会对齐到 16 字节。假设您确实获得了非 16 字节对齐的数组。然后，您将在高速缓存访问之间进行分割，这可能会产生额外的惩罚（尤其是在未命中时）。
高速缓存未命中发生在存储之后；我不相信 Cortex-A8 具有任何内存消歧功能，因此必须假设负载可能来自与存储相同的行，因此需要在 L2 丢失负载发生之前耗尽写入缓冲区。由于 NEON 加载（在整数管道中启动）和存储（在 NEON 管道末尾启动）之间的管道距离比整数大得多，因此可能会出现更长的停顿。
因为每次访问加载 16 个字节而不是 4 个字节，所以关键字大小更大，因此从主内存进行关键字第一行填充的有效延迟将会更高（L2 到 L1 应该是在 128 位总线上，所以不应该有同样的问题）

您问 NEON 在这种情况下有什么好处 - 事实上，NEON 特别适合您从内存流式传输的情况。诀窍是您需要使用预加载来尽可能隐藏主内存延迟。预加载会提前将内存放入 L2（而非 L1）缓存。在这里，NEON 比整数具有很大的优势，因为它可以隐藏大量 L2 缓存延迟，这是由于其交错的管道和问题队列，而且还因为它有直接的路径。我希望你会看到有效的 L2 延迟下降到 0-6 个周期，如果你有较少的依赖性并且不会耗尽加载队列，那么会更少，而在整数上，你可能会陷入无法避免的约 16 个周期（可能是不过取决于 Cortex-A8）。

因此，我建议您将数组与缓存行大小（64 字节）对齐，展开循环以一次至少执行一个缓存行，使用对齐的加载/存储（在地址后放置 :128）并添加pld 指令加载多个高速缓存行。至于距离多少行：从小处开始，不断增加，直到你不再看到任何好处。

Although you're limited by latency to main-memory in this case it's not exactly obvious that the NEON version would be slower than the ASM version.

Using the cycle calculator here:

http://pulsar.webshaker.net/ccc/result.php?lng=en

Your code should take 7 cycles before the cache miss penalties. It's slower than you may expect because you're using unaligned loads and because of latency between the add and the store.

Meanwhile, the compiler generated loop takes 6 cycles (it's not very well scheduled or optimized in general either). But it's doing one fourth as much work.

The cycle counts from the script might not be perfect, but I don't see anything that looks blatantly wrong with it so I think they'd at least be close. There's potential for taking an extra cycle on the branch if you max out fetch bandwidth (also if the loops aren't 64-bit aligned), but in this case there are plenty of stalls to hide that.

The answer isn't that integer on Cortex-A8 has more opportunities to hide latency. In fact, it normally has less, because of NEON's staggered pipeline and issue queue. Of course, this is only true on Cortex-A8 - on Cortex-A9 the situation may well be reversed (NEON is dispatched in-order and in parallel with integer, while integer has out-of-order capabilities). Since you tagged this Cortex-A8 I'm assuming that's what you're using.

This begs more investigation. Here are some ideas why this could be happening:

You're not specifying any kind of alignment on your arrays, and while I expect new to align to 8-bytes it might not be aligning to 16-bytes. Let's say you really are getting arrays that aren't 16-byte aligned. Then you'd be splitting between lines on cache access which could have additional penalty (especially on misses)
A cache miss happens right after a store; I don't believe Cortex-A8 has any memory disambiguation and therefore must assume that the load could be from the same line as the store, therefore requiring the write buffer to drain before the L2 missing load can happen. Because there's a much bigger pipeline distance between NEON loads (which are initiated in the integer pipeline) and stores (initiated at the end of the NEON pipeline) than integer ones there'd potentially be a longer stall.
Because you're loading 16 bytes per access instead of 4 bytes the critical-word size is larger and therefore the effective latency for a critical-word-first line-fill from main memory is going to be higher (L2 to L1 is supposed to be on a 128-bit bus so shouldn't have the same problem)

You asked what good NEON is in cases like this - in reality, NEON is especially good for these cases where you're streaming to/from memory. The trick is that you need to use preloading in order to hide the main memory latency as much as possible. Preload will get memory into L2 (not L1) cache ahead of time. Here NEON has a big advantage over integer because it can hide a lot of the L2 cache latency, due to its staggered pipeline and issue queue but also because it has a direct path to it. I expect you see effective L2 latency down to 0-6 cycles and less if you have less dependencies and don't exhaust the load queue, while on integer you can be stuck with a good ~16 cycles that you can't avoid (probably depends on the Cortex-A8 though).

So I would recommend that you align your arrays to cache-line size (64 bytes), unroll your loops to do at least one cache-line at a time, use aligned loads/stores (put :128 after the address) and add a pld instruction that loads several cache-lines away. As for how many lines away: start small and keep increasing it until you no longer see any benefit.

回复收藏 0 原文

沉睡月亮 2024-11-09 14:47:51

您的 C++ 代码也没有优化。

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    unsigned int i = ARR_SIZE_TEST;
    do
    {
        *x++ += *y++;
    } (while --i);
}

该版本减少了 2 个周期/迭代。

此外，你的基准测试结果一点也不令我惊讶。

32bit：

这个功能对于NEON来说太简单了。没有足够的算术运算，没有留下任何优化空间。

是的，它是如此简单，以至于 C++ 和 NEON 版本几乎每次都会遭受管道危险，而没有任何真正的机会从双重问题功能中受益。

虽然 NEON 版本可能会受益于一次处理 4 个整数，但它也会因每种危险而遭受更多损失。就这样。

8 位：

ARM 从内存中读取每个字节的速度非常慢。
这意味着，虽然 NEON 显示出与 32 位相同的特性，但 ARM 严重落后。

16位：
这里也一样。除了 ARM 的 16 位读取还不错。

漂浮：
C++版本将编译成VFP代码。 Coretex A8 上并没有完整的 VFP，只有 VFP lite，它不会管道化任何糟糕的东西。

这并不是说 NEON 在处理 32 位时表现得很奇怪。只有ARM符合理想条件。
由于其简单性，您的函数非常不适合基准测试目的。尝试一些更复杂的东西，比如 YUV-RGB 转换：

仅供参考，我完全优化的 NEON 版本的运行速度大约是我完全优化的 C 版本的 20 倍，是我完全优化的 ARM 汇编版本的 8 倍。
我希望这能让您了解 NEON 的强大功能。

最后但并非最不重要的一点是，ARM 指令 PLD 是 NEON 最好的朋友。放置得当，会带来至少40%的性能提升。

Your C++ code isn't optimized either.

#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )

void cpp_tst_add( unsigned* x, unsigned* y )
{
    unsigned int i = ARR_SIZE_TEST;
    do
    {
        *x++ += *y++;
    } (while --i);
}

this version consumes 2 less cycles/iteration.

Besides, your benchmark results don't surprise me at all.

32bit :

This function is too simple for NEON. There aren't enough arithmetic operations leaving any room for optimizations.

Yes, it's so simple that both C++ and NEON version suffer from pipeline hazards almost every time without any real chance of benefitting from the dual issue capabilities.

While NEON version might benefit from processing 4 integers at once, it suffers much more from every hazard as well. That's all.

8bit :

ARM is VERY slow reading each byte from memory.
Which means, while NEON shows the same characteristics as with 32bit, ARM is lagging heavily.

16bit :
The same here. Except ARM's 16bit read isn't THAT bad.

float :
The C++ version will compile into VFP codes. And there isn't a full VFP on Coretex A8, but VFP lite which doesn't pipeline anything which sucks.

It's not that NEON is behaving strangely processing 32bit. It's just ARM that meets the ideal condition.
Your function is very inappropriate for benchmarking purpose due to its simpleness. Try something more complex like YUV-RGB conversion :

FYI, my fully optimized NEON version runs roughly 20 times as fast than my fully optimized C version and 8 times as fast than my fully optimized ARM assembly version.
I hope that will give you some idea how powerful NEON can be.

Last but not least, the ARM instruction PLD is NEON's best friend. Placed properly, it will bring at least 40% performance boost.

回复收藏 0 原文

人间不值得 2024-11-09 14:47:51

您可以尝试一些修改来改进代码。

如果可以的话：
- 使用第三个缓冲区来存储结果。
- 尝试将数据对齐 8 个字节。

代码应该类似于（抱歉我不知道 gcc 内联语法），

.loop1:
 vld1.32   {q0}, [%[x]:128]!
 vld1.32   {q1}, [%[y]:128]!
 vadd.i32  q0 ,q0, q1
 vst1.32   {q0}, [%[z]:128]!
 subs     %[i], %[i], $1
bne      .loop1

正如 Exophase 所说，您有一些管道延迟。
也许你可以尝试

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

sub     %[i], %[i], $1

.loop1:
vadd.i32  q2 ,q0, q1

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

vst1.32   {q2}, [%[z]:128]!
subs     %[i], %[i], $1
bne      .loop1

vadd.i32  q2 ,q0, q1
vst1.32   {q2}, [%[z]:128]!

最后，很明显你会让内存带宽饱和

你可以尝试在你的循环中添加一个小的

PLD [%[x], 192]

。

告诉我们是否更好...

You can try some modification to improve the code.

If you can:
- use a third buffer to store results.
- try to align datas on 8 bytes.

The code should be something like (sorry I do not know the gcc inline syntax)

.loop1:
 vld1.32   {q0}, [%[x]:128]!
 vld1.32   {q1}, [%[y]:128]!
 vadd.i32  q0 ,q0, q1
 vst1.32   {q0}, [%[z]:128]!
 subs     %[i], %[i], $1
bne      .loop1

As Exophase says you have some pipeline latency.
may be your can try

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

sub     %[i], %[i], $1

.loop1:
vadd.i32  q2 ,q0, q1

vld1.32   {q0}, [%[x]:128]
vld1.32   {q1}, [%[y]:128]!

vst1.32   {q2}, [%[z]:128]!
subs     %[i], %[i], $1
bne      .loop1

vadd.i32  q2 ,q0, q1
vst1.32   {q2}, [%[z]:128]!

Finaly, it is clear that you'll saturate the memory bandwidth

You can try to add a small

PLD [%[x], 192]

into your loop.

tell us if it's better...

回复收藏 0 原文

恰似旧人归 2024-11-09 14:47:51

8 毫秒的差异非常小，您可能正在测量缓存或管道的伪影。

编辑：您是否尝试过与类似的类型进行比较，例如浮动和短等？我希望编译器能够更好地优化它并缩小差距。另外，在您的测试中，您首先执行 C++ 版本，然后执行 ASM 版本，这可能会对性能产生影响，因此我会编写两个不同的程序以更加公平。

for ( register int i = 0; i < ARR_SIZE_TEST/4; ++i )
{
    x[ i ] = x[ i ] + y[ i ];
    x[ i+1 ] = x[ i+1 ] + y[ i+1 ];
    x[ i+2 ] = x[ i+2 ] + y[ i+2 ];
    x[ i+3 ] = x[ i+3 ] + y[ i+3 ];
}

最后一件事，在函数的签名中，您使用 unsigned* 而不是 unsigned[]。后者是首选，因为编译器假设数组不重叠并且允许重新排序访问。尝试使用 restrict 关键字也可以更好地防止别名。

8ms of difference is SO small that you are probably measuring artifacts of the caches or pipelines.

EDIT: Did you try comparing with something like this for types such as float and short etc? I'd expect the compiler to optimize it even better and narrow the gap. Also in your test you do the C++ version first then the ASM version, this can have impact in the performance so I'd write two different programs to be more fair.

for ( register int i = 0; i < ARR_SIZE_TEST/4; ++i )
{
    x[ i ] = x[ i ] + y[ i ];
    x[ i+1 ] = x[ i+1 ] + y[ i+1 ];
    x[ i+2 ] = x[ i+2 ] + y[ i+2 ];
    x[ i+3 ] = x[ i+3 ] + y[ i+3 ];
}

Last thing, in the signature of your function, you use unsigned* instead of unsigned[]. The latter is preferred because the compiler supposes that the arrays do not overlap and is allowed to reorder accesses. Try using the restrict keyword also for even better protection against aliasing.

回复收藏 0 原文

~没有更多了~