Cortex A9 NEON 与 VFP 使用混淆

发布于 12-02 08:33 字数 1678 浏览 3 评论 0原文

我正在尝试为 Cortex A9 ARM 处理器（更具体地说是 OMAP4）构建一个库，对于在浮点运算和 SIMD 上下文中使用 NEON 与 VFP 的情况，我有点困惑。需要注意的是，我知道两个硬件协处理器单元之间的区别（也概述了这里），我只是对它们的正确用法有一些误解。

与此相关，我使用以下编译标志：

GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

我已经阅读了 ARM 文档，很多 wiki(像这个），论坛和博客文章，每个人似乎都同意使用 NEON 比使用 VFP 更好或者至少混合 NEON（例如使用内联函数在 SIMD 中实现一些算法）和 VFP 并不是一个好主意；我还不能 100% 确定这是否适用于整个应用程序\库的上下文或仅适用于代码中的特定位置（函数）。

因此，我使用 neon 作为我的应用程序的 FPU，因为我也想使用内在函数。因此，我遇到了一些麻烦，并且我对如何在 Cortex A9 上最好地使用这些功能（NEON 与 VFP）的困惑只会进一步加深，而不是消除。我有一些代码可以为我的应用程序进行基准测试并使用一些定制的计时器类其中计算基于双精度浮点。使用 NEON 作为 FPU 会产生完全不合适的结果（尝试打印这些值会导致打印大部分 inf 和 NaN；相同的代码在为 x86 构建时可以顺利运行）。因此，我将计算更改为使用单精度浮点，因为有记录表明 NEON 不处理双精度浮点。我的基准测试仍然没有给出正确的结果（最糟糕的是现在它不再在 x86 上工作；我认为这是因为精度损失，但我不确定）。所以我几乎完全迷失了：一方面我想使用 NEON 来实现 SIMD 功能，并且使用它作为 FPU 不能提供正确的结果，另一方面将它与 VFP 混合似乎不是一个好主意。在这方面的任何建议将不胜感激！

我在上述 wiki 的文章中找到了在 NEON 上下文中应该如何进行浮点优化的摘要：

“

仅使用单精度浮点，
当您发现瓶颈 FP 函数时，请使用 NEON 内在函数/ASM。您可以比编译器做得更好。
最小化条件分支
启用 RunFast 模式

对于 softfp：

内联浮点代码（除非它非常大）
通过指针而不是按值传递 FP 参数，并在函数调用之间进行整数工作

“

我不能。对 float ABI 使用hard，因为我无法链接到我可用的库。大多数建议对我来说都是有意义的（除了“运行快速模式”，我不完全理解应该做什么，而且事实上，此时我可以比编译器做得更好），但我不断得到不一致的结果，我现在什么都不确定。

谁能阐明如何正确使用 Cortex A9/A8 的浮点和 NEON 以及我应该使用哪些编译标志？

原文

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.

Related to this I'm using the following compilation flags:

GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp

I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.

So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!

I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:

Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode

For softfp:

Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.

I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.

Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吝吻2024-12-09 08:33:50

...论坛和博客文章，每个人似乎都同意使用 NEON 比使用 VFP 更好，或者至少混合 NEON（例如使用内联函数在 SIMD 中实现一些算法），而 VFP 并不是一个好主意

我不确定这是正确的。根据 ARM 介绍 NEON 开发文章| NEON 寄存器：

NEON 寄存器组由 32 个 64 位寄存器组成。如果两者都
实现了高级SIMD和VFPv3，它们共享该寄存器
银行。在这种情况下，VFPv3 以 VFPv3-D32 形式实现：
支持32个双精度浮点寄存器。这
集成简化了上下文切换支持的实现，因为
保存和恢复 VFP 上下文的相同例程也保存和恢复
恢复 NEON 上下文。
NEON 单元可以查看相同的寄存器组：
16 个 128 位四字寄存器，Q0-Q15
32 个 64 位双字寄存器，D0-D31。
NEON D0-D31 寄存器与 VFPv3 D0-D31 寄存器相同
每个 Q0-Q15 寄存器映射到一对 D 寄存器。
图1.3展示了共享NEON和VFP的不同视图
注册银行。所有这些视图都可以随时访问。软件
不必在它们之间显式切换，因为
使用的指令决定适当的视图。

寄存器不竞争；相反，它们作为寄存器组的视图共存。没有办法吐出 NEON 和 FPU 装备。

与此相关，我使用以下编译标志：
<前><代码>-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp

这就是我所做的；您的里程可能会有所不同。它源自从平台和编译器收集的信息的混搭。

gnueabihf 告诉我平台使用硬浮点数，这可以加快程序调用速度。如果有疑问，请使用 softfp 因为它与硬浮动兼容。

BeagleBone Black：

$ gcc -v 2>&1 | grep Target          
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
model name  : ARMv7 Processor rev 2 (v7l)
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
...

因此 BeagleBone 使用：

-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard

CubieTruck v5：

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 5 (v7l)
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4

因此 CubieTruck 使用：

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Banana Pi Pro：

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 4 (v7l)
Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt

因此 Banana Pi 使用：

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Raspberry Pi 3：

RPI3 的独特之处在于它采用 ARMv8，但它运行的是32 位操作系统。这意味着它实际上是 32 位 ARM 或 Aarch32。 32 位 ARM 与 Aarch32 之间还有更多区别，但这将向您显示 Aarch32 标志。

此外，RPI3 使用 Broadcom A53 SoC，并且具有 NEON 和可选的 CRC32 指令，但缺少可选的加密扩展。

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo 
model name  : ARMv7 Processor rev 4 (v7l)
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...

所以Raspberry Pi可以使用：

-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard

或者它可以使用（我不知道-mtune使用什么）：

-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard

ODROID C2：

ODROID C2使用Amlogic A53 SoC，但它使用64位操作系统。 ODROID C2，它具有 NEON 和可选的 CRC32 指令，但缺少可选的加密扩展（与 RPI3 类似的配置）。

$ gcc -v 2>&1 | grep Target 
Target: aarch64-linux-gnu

$ cat /proc/cpuinfo 
Features    : fp asimd evtstrm crc32

所以 ODROID 使用：

-march=armv8-a+crc -mtune=cortex-a53

在上面的食谱中，我通过检查数据表了解了 ARM 处理器（如 Cortex A9 或 A53）。根据 Unix and Linux Stack Exchange 上的这个答案，它破译了 /proc/cpuinfo：

CPU 部件：部件号。 0xd03表示Cortex-A53处理器。

因此我们可以从数据库中查找该值。我不知道它是否存在或位于何处。

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.

The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.

Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp

Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.

BeagleBone Black:

$ gcc -v 2>&1 | grep Target          
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
model name  : ARMv7 Processor rev 2 (v7l)
Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
...

So the BeagleBone uses:

-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard

CubieTruck v5:

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 5 (v7l)
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4

So the CubieTruck uses:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Banana Pi Pro:

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo
Processor   : ARMv7 Processor rev 4 (v7l)
Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt

So the Banana Pi uses:

-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

Raspberry Pi 3:

The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

$ gcc -v 2>&1 | grep Target 
Target: arm-linux-gnueabihf

$ cat /proc/cpuinfo 
model name  : ARMv7 Processor rev 4 (v7l)
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...

So the Raspberry Pi can use:

-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard

Or it can use (I don't know what to use for -mtune):

-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard

ODROID C2:

ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

$ gcc -v 2>&1 | grep Target 
Target: aarch64-linux-gnu

$ cat /proc/cpuinfo 
Features    : fp asimd evtstrm crc32

So the ODROID uses:

-march=armv8-a+crc -mtune=cortex-a53

In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo: