使用内在函数进行 Neon 优化
在了解 ARM NEON 内在函数之后,我对自己编写的一个函数进行了计时,该函数将数组中的元素加倍。使用该内在函数的版本比该函数的普通 C 版本需要更多时间。
不使用 NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
使用 NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
想知道数组和向量之间的加载/存储操作是否消耗更多时间,从而抵消了并行加法的好处。
更新:针对伊戈尔回复的更多信息。
1.代码贴在这里:
plain.c
plain.s
neon.c
neon.s
从两个汇编列表中的 L7 部分,我看到 neon 版本有更多数量的汇编指令。(因此花费了更多时间?)
2.我在arm-gcc上使用-mfpu=neon进行编译,没有其他标志或优化。对于普通版本,根本没有编译器标志。
3.这是一个拼写错误,SIZE 的意思是尺寸;两者是相同的。
4,5.尝试了 4000 个元素的数组。我在函数调用之前和之后使用 gettimeofday() 进行计时。NEON=230us,ordinary=155us。
6.是的,我打印了每种情况下的元素。
7.这样做了,没有任何改善。
Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.
Without NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
With NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.
UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
像这样的事情可能会运行得更快一些。
原始代码存在一些问题(其中一些优化器可能会修复,但其他可能不会,您需要在生成的代码中进行验证):
如果您要使用内联汇编编写此代码,您还可以:
4000 个元素需要 95us,这听起来相当慢,在 1GHz 处理器上,每个 128 位块大约需要 95 个周期。假设您使用缓存工作,您应该能够做得更好。如果您受到外部存储器速度的限制,这个数字大约是您所期望的。
Something like this might run a bit faster.
There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):
If you were to write this with inline assembly you could also:
95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.
这个问题相当模糊,你没有提供太多信息,但我会尽力给你一些指导。
size
,第二个函数使用SIZE
,这是故意的吗?它们是一样的吗?vshl_n_u32
) 而不是加法。编辑:感谢您的回答。我环顾四周,发现此讨论,其中写着(强调我的):
在您的情况下,没有 NEON 到 ARM 的转换,只有加载和存储。尽管如此,并行操作中的节省似乎被非 NEON 部件耗尽了。我希望在 NEON 中执行许多操作的代码能获得更好的结果,例如颜色转换。
The question is rather vague and you didn't provide much info but I'll try to give you some pointers.
size
, second usesSIZE
, is this intentional? Are they the same?vshl_n_u32
) instead of the addition.Edit: thanks for the answers. I've looked a bit around and found this discussion, which says (emphasis mine):
In your case there's no NEON to ARM conversion but only loads and stores. Still, it seems that the savings in parallel operation are eaten up by the non-NEON parts. I would expect better results in code which does many things while in NEON, e.g. color conversions.
每条指令处理更大的数量,并交错加载/存储和交错使用。该函数当前加倍(左移)56 uint。
Neon 优化需要记住的重要事项...它有两个管道,一个用于加载/存储(具有 2 个指令队列 - 一个待处理,一个正在运行 - 通常每个需要 3-9 个周期),一个用于算术运算(具有一个2 条指令流水线,一条执行,一条保存其结果)。只要你保持这两个管道忙碌并交错你的指令,它就会工作得非常快。更好的是,如果你有ARM指令,只要你留在寄存器中,就永远不必等待NEON完成,它们将同时执行(缓存中最多8条指令)!因此,您可以在 ARM 指令中放置一些基本的循环逻辑,它们将同时执行。
您的原始代码也仅使用 4 个寄存器值中的一个(q 寄存器有 4 个 32 位值)。其中 3 个无缘无故地进行了加倍操作,因此您的速度比原本的速度慢了 4 倍。
在此代码中更好的是,对于此循环,通过在
vstm %1!
之后添加vldm %0!, {q2-q8}
来处理嵌入的它们。 。 等等。您还可以看到,我在发送结果之前还等待了 1 条指令,因此管道永远不会等待其他内容。最后,注意!
,它的意思是后自增。因此它读取/写入该值,然后自动递增寄存器中的指针。我建议您不要在 ARM 代码中使用该寄存器,这样它就不会挂起自己的管道...将寄存器分开,在 ARM 端有一个冗余的count
变量。最后一部分......我说的可能是真的,但并非总是如此。这取决于您当前的 Neon 版本。未来时间可能会改变,或者可能不会一直这样。它对我有用,ymmv。
Process in bigger quantities per instruction, and interleave load/stores, and interleave usage. This function currently doubles (shifts left) 56 uint.
What's important to remember for Neon optimization... It has two pipelines, one for load/stores (with a 2 instruction queue - one pending and one running - typically taking 3-9 cycles each), and one for arithmetical operations (with a 2 instruction pipeline, one executing and one saving its results). As long as you keep these two pipelines busy and interleave your instructions, it will work really fast. Even better, if you have ARM instructions, as long as you stay in registers, it will never have to wait for NEON to be done, they will be executed at the same time (up to 8 instructions in cache)! So you can put up some basic loop logic in ARM instructions, and they'll be executed simultaneously.
Your original code also was only using one register value out of 4 (q register have 4 32 bits values). 3 of them were getting a doubling operation for no apparent reason, so you were 4 times as slow as you could've been.
What would be better in this code is to for this loop, process them embedded by adding
vldm %0!, {q2-q8}
following thevstm %1!
... and so on. You also see I wait 1 more instruction before sending out its results, so the pipes are never waiting for something else. Finally, note the!
, it means post-increment. So it reads/writes the value, and then increments the pointer from the register automatically. I suggest you don't use that register in ARM code, so it won't hang its own pipelines... keep your registers separated, have a redundantcount
variable on ARM side.Last part ... what I said might be true, but not always. It depends on the current Neon revision you have. Timing might change in the future, or might not have always been like that. It works for me, ymmv.