使用内在函数进行 Neon 优化

发布于 2024-11-02 06:57:26 字数 1379 浏览 11 评论 0原文

在了解 ARM NEON 内在函数之后，我对自己编写的一个函数进行了计时，该函数将数组中的元素加倍。使用该内在函数的版本比该函数的普通 C 版本需要更多时间。

不使用 NEON ：

    void  double_elements(unsigned int *ptr, unsigned int size)
 {
        unsigned int loop;
        for( loop= 0; loop<size; loop++)
                ptr[loop]<<=1;
        return;
 }

使用 NEON：

 void  double_elements(unsigned int *ptr, unsigned int size)
{    
        unsigned int i;
        uint32x4_t Q0,vector128Output;
        for( i=0;i<(SIZE/4);i++)
        {
                Q0=vld1q_u32(ptr);               
                Q0=vaddq_u32(Q0,Q0);
                vst1q_u32(ptr,Q0);
                ptr+=4;

        }
        return;
}

想知道数组和向量之间的加载/存储操作是否消耗更多时间，从而抵消了并行加法的好处。

更新：针对伊戈尔回复的更多信息。
1.代码贴在这里：
plain.c
plain.s
neon.c
neon.s
从两个汇编列表中的 L7 部分，我看到 neon 版本有更多数量的汇编指令。（因此花费了更多时间？）
2.我在arm-gcc上使用-mfpu=neon进行编译，没有其他标志或优化。对于普通版本，根本没有编译器标志。
3.这是一个拼写错误，SIZE 的意思是尺寸；两者是相同的。
4,5.尝试了 4000 个元素的数组。我在函数调用之前和之后使用 gettimeofday() 进行计时。NEON=230us，ordinary=155us。
6.是的，我打印了每种情况下的元素。
7.这样做了，没有任何改善。

原文

Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.

Without NEON :

    void  double_elements(unsigned int *ptr, unsigned int size)
 {
        unsigned int loop;
        for( loop= 0; loop<size; loop++)
                ptr[loop]<<=1;
        return;
 }

With NEON:

 void  double_elements(unsigned int *ptr, unsigned int size)
{    
        unsigned int i;
        uint32x4_t Q0,vector128Output;
        for( i=0;i<(SIZE/4);i++)
        {
                Q0=vld1q_u32(ptr);               
                Q0=vaddq_u32(Q0,Q0);
                vst1q_u32(ptr,Q0);
                ptr+=4;

        }
        return;
}

Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.

UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一场信仰旅途 2024-11-09 06:57:26

像这样的事情可能会运行得更快一些。

void  double_elements(unsigned int *ptr, unsigned int size)
{    
    unsigned int i;
    uint32x4_t Q0,Q1,Q2,Q3;

    for( i=0;i<(SIZE/16);i++)
    {
            Q0=vld1q_u32(ptr);               
            Q1=vld1q_u32(ptr+4);               
            Q0=vaddq_u32(Q0,Q0);
            Q2=vld1q_u32(ptr+8);               
            Q1=vaddq_u32(Q1,Q1);
            Q3=vld1q_u32(ptr+12);               
            Q2=vaddq_u32(Q2,Q2);
            vst1q_u32(ptr,Q0);
            Q3=vaddq_u32(Q3,Q3);
            vst1q_u32(ptr+4,Q1);
            vst1q_u32(ptr+8,Q2);
            vst1q_u32(ptr+12,Q3);
            ptr+=16;

    }
    return;
}

原始代码存在一些问题（其中一些优化器可能会修复，但其他可能不会，您需要在生成的代码中进行验证）：

添加的结果仅在 NEON 管道的 N3 阶段可用，因此以下商店将停止营业。
假设编译器没有展开循环，可能会产生一些与循环/分支相关的开销。
它没有利用与另一个 NEON 指令双重发出加载/存储的能力。
如果源数据不在缓存中，则加载将停止。您可以使用 __builtin_prefetch 内在函数预加载数据以加快速度。
另外，正如其他人指出的那样，该操作相当简单，您会看到更复杂的操作会带来更多收益。

如果您要使用内联汇编编写此代码，您还可以：

使用对齐的加载/存储（我认为内在函数无法生成）并确保您的指针始终是 128 位对齐的，例如 vld1.32 {q0 }, [r1 :128]
您还可以使用后增量版本（我也不确定内在函数是否会生成），例如 vld1.32 {q0}, [r1 :128]!

4000 个元素需要 95us，这听起来相当慢，在 1GHz 处理器上，每个 128 位块大约需要 95 个周期。假设您使用缓存工作，您应该能够做得更好。如果您受到外部存储器速度的限制，这个数字大约是您所期望的。

Something like this might run a bit faster.

void  double_elements(unsigned int *ptr, unsigned int size)
{    
    unsigned int i;
    uint32x4_t Q0,Q1,Q2,Q3;

    for( i=0;i<(SIZE/16);i++)
    {
            Q0=vld1q_u32(ptr);               
            Q1=vld1q_u32(ptr+4);               
            Q0=vaddq_u32(Q0,Q0);
            Q2=vld1q_u32(ptr+8);               
            Q1=vaddq_u32(Q1,Q1);
            Q3=vld1q_u32(ptr+12);               
            Q2=vaddq_u32(Q2,Q2);
            vst1q_u32(ptr,Q0);
            Q3=vaddq_u32(Q3,Q3);
            vst1q_u32(ptr+4,Q1);
            vst1q_u32(ptr+8,Q2);
            vst1q_u32(ptr+12,Q3);
            ptr+=16;

    }
    return;
}

There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):

The result of the add is only available in the N3 stage of the NEON pipeline so the following store will stall.
Assuming the compiler is not unrolling the loop there may be some overhead associated with the loop/branch.
It doesn't take advantage of the ability to dual issue load/store with another NEON instruction.
If the source data isn't in cache then the loads would stall. You can preload the data to speed this up with the __builtin_prefetch intrinsic.
Also as others have pointed out the operation is fairly trivial, you'll see more gains for more complex operations.

If you were to write this with inline assembly you could also:

Use the aligned load/stores (which I don't think the intrinsics can generate) and ensure your pointer is always 128 bit aligned, e.g. vld1.32 {q0}, [r1 :128]
You could also use the postincrement version (which I'm also not sure intrinsics will generate), e.g. vld1.32 {q0}, [r1 :128]!

95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.

回复收藏 0 原文

青春如此纠结 2024-11-09 06:57:26

这个问题相当模糊，你没有提供太多信息，但我会尽力给你一些指导。

在查看装配之前，您不会确切知道发生了什么。使用-S，卢克！
您没有指定编译器设置。您正在使用优化吗？循环展开？
第一个函数使用 size，第二个函数使用 SIZE，这是故意的吗？它们是一样的吗？
您尝试的数组的大小是多少？我不认为 NEON 对某些元素有任何帮助。
速度差异是多少？百分之几？几个数量级？
你检查过结果是一样的吗？您确定代码是等效的吗？
您使用相同的变量来获取中间结果。尝试将加法的结果存储在另一个变量中，这可能会有所帮助（尽管我希望编译器会很聪明并分配不同的寄存器）。另外，您可以尝试使用 shift (vshl_n_u32) 而不是加法。

编辑：感谢您的回答。我环顾四周，发现此讨论，其中写着（强调我的）：

将数据从 NEON 移至 ARM 寄存器
Cortex-A8 很贵，所以 NEON 在
Cortex-A8 最适合大型
使用少量 ARM 进行工作块
管道交互。

在您的情况下，没有 NEON 到 ARM 的转换，只有加载和存储。尽管如此，并行操作中的节省似乎被非 NEON 部件耗尽了。我希望在 NEON 中执行许多操作的代码能获得更好的结果，例如颜色转换。

回复收藏 0 原文

旧街凉风 2024-11-09 06:57:26

每条指令处理更大的数量，并交错加载/存储和交错使用。该函数当前加倍（左移）56 uint。

void shiftleft56(const unsigned int* input, unsigned int* output)
{
  __asm__ (
  "vldm %0!, {q2-q8}\n\t"
  "vldm %0!, {q9-q15}\n\t"
  "vshl.u32 q0, q2, #1\n\t"
  "vshl.u32 q1, q3, #1\n\t"
  "vshl.u32 q2, q4, #1\n\t"
  "vshl.u32 q3, q5, #1\n\t"
  "vshl.u32 q4, q6, #1\n\t"
  "vshl.u32 q5, q7, #1\n\t"
  "vshl.u32 q6, q8, #1\n\t"
  "vshl.u32 q7, q9, #1\n\t"
  "vstm %1!, {q0-q6}\n\t"
  // "vldm %0!, {q0-q6}\n\t" if you want to overlap...
  "vshl.u32 q8, q10, #1\n\t"
  "vshl.u32 q9, q11, #1\n\t"
  "vshl.u32 q10, q12, #1\n\t"
  "vshl.u32 q11, q13, #1\n\t"
  "vshl.u32 q12, q14, #1\n\t"
  "vshl.u32 q13, q15, #1\n\t"
  // lost cycle here unless you overlap
  "vstm %1!, {q7-q13}\n\t"
  : "=r"(input), "=r"(output) : "0"(input), "1"(output)
  : "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
    "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15", "memory" );
}

Neon 优化需要记住的重要事项...它有两个管道，一个用于加载/存储（具有 2 个指令队列 - 一个待处理，一个正在运行 - 通常每个需要 3-9 个周期），一个用于算术运算（具有一个2 条指令流水线，一条执行，一条保存其结果）。只要你保持这两个管道忙碌并交错你的指令，它就会工作得非常快。更好的是，如果你有ARM指令，只要你留在寄存器中，就永远不必等待NEON完成，它们将同时执行（缓存中最多8条指令）！因此，您可以在 ARM 指令中放置一些基本的循环逻辑，它们将同时执行。

您的原始代码也仅使用 4 个寄存器值中的一个（q 寄存器有 4 个 32 位值）。其中 3 个无缘无故地进行了加倍操作，因此您的速度比原本的速度慢了 4 倍。

在此代码中更好的是，对于此循环，通过在 vstm %1! 之后添加 vldm %0!, {q2-q8} 来处理嵌入的它们。。等等。您还可以看到，我在发送结果之前还等待了 1 条指令，因此管道永远不会等待其他内容。最后，注意!，它的意思是后自增。因此它读取/写入该值，然后自动递增寄存器中的指针。我建议您不要在 ARM 代码中使用该寄存器，这样它就不会挂起自己的管道...将寄存器分开，在 ARM 端有一个冗余的 count 变量。

最后一部分......我说的可能是真的，但并非总是如此。这取决于您当前的 Neon 版本。未来时间可能会改变，或者可能不会一直这样。它对我有用，ymmv。

Process in bigger quantities per instruction, and interleave load/stores, and interleave usage. This function currently doubles (shifts left) 56 uint.

void shiftleft56(const unsigned int* input, unsigned int* output)
{
  __asm__ (
  "vldm %0!, {q2-q8}\n\t"
  "vldm %0!, {q9-q15}\n\t"
  "vshl.u32 q0, q2, #1\n\t"
  "vshl.u32 q1, q3, #1\n\t"
  "vshl.u32 q2, q4, #1\n\t"
  "vshl.u32 q3, q5, #1\n\t"
  "vshl.u32 q4, q6, #1\n\t"
  "vshl.u32 q5, q7, #1\n\t"
  "vshl.u32 q6, q8, #1\n\t"
  "vshl.u32 q7, q9, #1\n\t"
  "vstm %1!, {q0-q6}\n\t"
  // "vldm %0!, {q0-q6}\n\t" if you want to overlap...
  "vshl.u32 q8, q10, #1\n\t"
  "vshl.u32 q9, q11, #1\n\t"
  "vshl.u32 q10, q12, #1\n\t"
  "vshl.u32 q11, q13, #1\n\t"
  "vshl.u32 q12, q14, #1\n\t"
  "vshl.u32 q13, q15, #1\n\t"
  // lost cycle here unless you overlap
  "vstm %1!, {q7-q13}\n\t"
  : "=r"(input), "=r"(output) : "0"(input), "1"(output)
  : "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
    "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15", "memory" );
}

What's important to remember for Neon optimization... It has two pipelines, one for load/stores (with a 2 instruction queue - one pending and one running - typically taking 3-9 cycles each), and one for arithmetical operations (with a 2 instruction pipeline, one executing and one saving its results). As long as you keep these two pipelines busy and interleave your instructions, it will work really fast. Even better, if you have ARM instructions, as long as you stay in registers, it will never have to wait for NEON to be done, they will be executed at the same time (up to 8 instructions in cache)! So you can put up some basic loop logic in ARM instructions, and they'll be executed simultaneously.

Your original code also was only using one register value out of 4 (q register have 4 32 bits values). 3 of them were getting a doubling operation for no apparent reason, so you were 4 times as slow as you could've been.

What would be better in this code is to for this loop, process them embedded by adding vldm %0!, {q2-q8} following the vstm %1! ... and so on. You also see I wait 1 more instruction before sending out its results, so the pipes are never waiting for something else. Finally, note the !, it means post-increment. So it reads/writes the value, and then increments the pointer from the register automatically. I suggest you don't use that register in ARM code, so it won't hang its own pipelines... keep your registers separated, have a redundant count variable on ARM side.

Last part ... what I said might be true, but not always. It depends on the current Neon revision you have. Timing might change in the future, or might not have always been like that. It works for me, ymmv.

回复收藏 0 原文

~没有更多了~