ARM Cortex-A8:VFP 和 NEON 有什么区别
在ARM Cortex-A8处理器中,我了解NEON是什么,它是一个SIMD协处理器。
但同时也是协处理器的VFP(矢量浮点)单元是否可以用作SIMD处理器呢?如果可以的话,使用哪一个更好?
我读了一些链接,例如 这个。
但并不是很清楚它们的意思。他们说 VFP 从来没有打算用于 SIMD,但在 Wiki 上我读到了以下 - “VFP 架构还支持执行短向量指令,但这些指令按顺序对每个向量元素进行操作,因此无法提供真正的 SIMD(单指令多数据)并行性的性能。
”如此清楚该相信什么,有人能详细说明这个话题吗?
In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.
But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?
I read few links such as this one.
But not really very clear what they mean. They say that VFP was never intended to be used for SIMD but on Wiki I read the following - "The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism."
It so not so clear what to believe, can anyone elaborate more on this topic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
两者之间存在着相当大的差异。 Neon 是 SIMD(单指令多数据)加速器处理器,作为 ARM 内核的一部分。这意味着在执行一条指令的过程中,同一操作将并行发生在最多 16 个数据集上。由于 Neon 内部具有并行性,因此与以相同时钟速率运行的标准 SISD 处理器相比,您可以从 Neon 中获得更多的 MIPS 或 FLOPS。
Neon 最大的好处是如果你想执行矢量操作,即视频编码/解码。它还可以并行执行单精度浮点(float)运算。
VFP是一个经典的浮点硬件加速器。它不是像 Neon 那样的并行架构。基本上它对一组输入执行一个操作并返回一个输出。其目的是加速浮点计算。它支持单精度和双精度浮点。
您有 3 种使用 Neon 的可能性:
-mfpu=neon
作为参数来为您进行优化(gcc 4.5 是对此很好)There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.
The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.
VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. It supports single and double precision floating point.
You have 3 possibilities to use Neon:
-mfpu=neon
as argument (gcc 4.5 is good on this)对于armv7 ISA(及其变体)
NEON 是一个用于整数和浮点数据的 SIMD 和并行数据处理单元,VFP 是一个完全兼容 IEEE-754 的浮点单元。特别是在 A8 上,即使您没有高度并行的数据,NEON 单元对于几乎所有事情都快得多,因为 VFP 是非流水线的。
那么你为什么要使用 VFP?!
最主要的区别是 VFP 提供双精度浮点。
其次,VFP 提供了一些专门的指令,但 NEON 单元中没有等效的实现。我想到了 SQRT,也许是一些类型转换。
但 Cosmin 的回答中没有提到的最重要的区别是 NEON 浮点管道并不完全符合 IEEE-754。有关差异的最佳描述位于FPSCR 寄存器描述< /a>.
由于它不兼容 IEEE-754,因此编译器无法生成这些指令,除非您告诉编译器您对完全兼容不感兴趣。这可以通过多种方式完成。
-mfpu=neon
的较新 GCC 版本也不会生成浮点 NEON 指令,除非您还指定了-funsafe-math-optimizations
。对于armv8+ ISA(和变体) [更新]
NEON 现在完全符合 IEE-754,从程序员(和编译器)的角度来看,实际上没有太大区别。双精度已矢量化。从微架构的角度来看,我有点怀疑它们甚至是不同的硬件单元。 ARM 确实分别记录了标量和向量指令,但两者都是“高级 SIMD”的一部分。
For armv7 ISA (and variants)
The NEON is a SIMD and parallel data processing unit for integer and floating point data and the VFP is a fully IEEE-754 compatible floating point unit. In particular on the A8, the NEON unit is much faster for just about everything, even if you don't have highly parallel data, since the VFP is non-pipelined.
So why would you ever use the VFP?!
The most major difference is that the VFP provides double precision floating point.
Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit. SQRT comes to mind, perhaps some type conversions.
But the most important difference not mentioned in Cosmin's answer is that the NEON floating point pipeline is not entirely IEEE-754 compliant. The best description of the differences are in the FPSCR Register Description.
Because it is not IEEE-754 compliant, a compiler cannot generate these instructions unless you tell the compiler that you are not interested in full compliance. This can be done in several ways.
-mfpu=neon
will not generate floating point NEON instructions unless you also specify-funsafe-math-optimizations
.For armv8+ ISA (and variants) [Update]
NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."
从架构上来说,VFP(它并不是无缘无故地被称为向量浮点)确实提供了在单个指令中对浮点向量进行操作的规定。我认为它实际上不会同时执行多个操作(如真正的 SIMD),但它可以节省一些代码大小。但是,如果您阅读 Shark 帮助中的 ARM 架构参考手册(正如我在 NEON 简介中所描述的,问题中的链接 1),您将在 A2.6 节中看到 VFP 的矢量功能在 ARMv7 中已弃用(这是 Cortex A8 实现的),软件应使用高级 SIMD 进行浮点向量运算。
更糟糕的是,在 Cortex A8 实现中,VFP 是使用 VFP Lite 执行单元实现的(将 lite 理解为占用更小的硅表面,而不是具有更少的功能),这意味着它实际上比 ARM11 慢!幸运的是,大多数单精度 VFP 指令由 NEON 单元执行,但我不确定矢量 VFP 操作是否如此;即使它们这样做,它们的执行速度也肯定比 NEON 指令慢。
希望这能解决问题!
Architecturally, VFP (it wasn't called Vector Floating Point for nothing) indeed has a provision for operating on a floating-point vector in a single instruction. I don't think it ever actually executes multiples operations simultaneously (like true SIMD), but it could save some code size. However, if you read the ARM Architecture Reference Manual in the Shark help (as I describe in my introduction to NEON, link 1 in the question), you'll see at section A2.6 that the vector feature of VFP is deprecated in ARMv7 (which is what the Cortex A8 implements), and software should use Advanced SIMD for floating-point vector operations.
Worse yet, in the Cortex A8 implementation, VFP is implemented with a VFP Lite execution unit (read lite as occupying a smaller silicon surface, not as having less features), which means that it's actually slower than on the ARM11, for instance! Fortunately, most single-precision VFP instructions get executed by the NEON unit, but I'm not sure vector VFP operations do; and even if they do, they certainly execute slower than with NEON instructions.
Hope that clears thing up!
IIRC,VFP 是一个按顺序工作的浮点协处理器。
这意味着您可以在浮点向量上使用指令来实现类似 SIMD 的行为,但在内部,该指令是按顺序对向量的每个元素执行的。
虽然由于单个加载指令,指令所需的总时间因此减少,但 VFP 仍然需要时间来处理向量的所有元素。
真正的 SIMD 将获得更多的净浮点性能,但将 VFP 与向量一起使用仍然比使用纯顺序更快。
IIRC, the VFP is a floating point coprocessor which works sequentially.
This means that you can use instruction on a vector of floats for SIMD-like behaviour, but internally, the instruction is performed on each element of the vector in sequence.
While the overall time required for the instruction is reduced by this because of the single load instruction, the VFP still needs time to process all elements of the vector.
True SIMD will gain more net floating point performance, but using the VFP with vectors is still faster then using it purely sequential.
VFP:想想 x86 CPU 的 x87 FPU。它为 ARM CPU 添加了 FPU 功能,否则这些 CPU 只能支持整数指令。从ARMv8一代(AARCH64)开始,FPU就成为核心指令集的一部分,就像x87随着Pentium成为核心一样(有些486 CPU仍然没有FPU)。
NEON:想想 SSE2。它添加了 SIMD 功能。从 ARMv8 一代(AARCH64)开始,NEON 就成为核心指令集的一部分,就像 SSE 成为 x86_64 的核心一样(因为至少没有不支持 SSE2 的 x86_64 CPU)
至于向量指令,ARM一直有向量到处都是指示。例如,ARMv6 一代添加了向量整数指令作为核心指令集的一部分。因此,ARM 从来没有像其他 CPU 架构那样明确区分“矢量”和“非矢量”,其中所有核心指令都是非矢量,只有特殊的矢量扩展添加了矢量支持。今天的 x86 CPU 也到处都有向量运算。
您在维基百科上的引用与您之前所说或引用的任何内容并不矛盾。 VFP单元也有向量指令,但这些只能减少代码大小,因为一条指令可以一次指定多个操作,但是,操作本身在VFP单元内一次执行一个,因此没有速度优势,只有代码规模效益。另一方面,NEON 真正并行执行这些指令,为您带来巨大的速度优势。
VFP: Think of the x87 FPU for x86 CPUs. It added FPU capabilities to ARM CPUs that would otherwise only have supported integer instructions. Since the ARMv8 generation (AARCH64), FPU is part of the core instruction set, just like x87 became core with the Pentium (some 486 CPU still had no FPU).
NEON: Think of SSE2. It added SIMD capabilities. Since the ARMv8 generation (AARCH64), NEON is part of the core instruction set, just like SSE became core with the x86_64 (as there is no x86_64 CPU that does not support at laest SSE2)
As for vector instructions, ARM has always had vector instructions all over the place. For example, the ARMv6 generation added vector integer instructions as part of the core instruction set. So ARM never had a clear distinction between "vector" and "non-vector" like other CPU architectures where all core instructions were non-vector and only special vector extensions added vector support. Today's x86 CPUs also have vector operations all over the place.
And your Wikipedia quote does not contradict anything you said or referenced earlier. The VFP unit also has vector instructions, but these only reduce code size because one instruction can specify multiple operations at once, however, the operations themselves are executed one at a time inside the VFP unit, so there is no speed benefit, only a code size benefit. NEON, on the other hand, executes these instructions truly in parallel, giving you a huge speed benefit.