Apple Accelerate Framework 对向量进行缩放和标准化

发布于 2024-10-04 13:56:07 字数 401 浏览 3 评论 0原文

我可以在 Accelerate.framework 中使用哪些函数来按标量缩放向量并标准化向量?我在文档中发现了一个我认为可能适用于扩展的方法,但我对其操作感到困惑。

vDSP_vsma
Vector scalar multiply and vector add; single precision.

void vDSP_vsma (
   const float *__vDSP_A,
   vDSP_Stride __vDSP_I,
   const float *__vDSP_B,
   const float *__vDSP_C,
   vDSP_Stride __vDSP_K,
   float *__vDSP_D,
   vDSP_Stride __vDSP_L,
   vDSP_Length __vDSP_N
);

What functions can I use in Accelerate.framework to scale a vector by a scalar, and normalize a vector? I found one I think might work for scaling in the documentation but I am confused about it's operation.

vDSP_vsma
Vector scalar multiply and vector add; single precision.

void vDSP_vsma (
   const float *__vDSP_A,
   vDSP_Stride __vDSP_I,
   const float *__vDSP_B,
   const float *__vDSP_C,
   vDSP_Stride __vDSP_K,
   float *__vDSP_D,
   vDSP_Stride __vDSP_L,
   vDSP_Length __vDSP_N
);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

天冷不及心凉 2024-10-11 13:56:07

就地标准化向量的最简单方法是“

int n = 3;
float v[3] = {1, 2, 3};
cblas_sscal(n, 1.0 / cblas_snrm2(n, v, 1), v, 1);

您需要

#include <cblas.h>

或”

#include <vblas.h>

(或两者)。请注意,当某些函数对向量进行运算时,它们位于“矩阵”部分。

如果您想使用 vDSP 函数,请参阅 矢量标量除法部分。您可以执行以下操作:

  • vDSP_dotpr()sqrt()vDSP_vsdiv()
  • vDSP_dotpr() >、vrsqrte_f32()vDSP_vsmul()vrsqrte_f32() 是 NEON GCC 内置的,所以您需要检查您正在为 armv7 进行编译)。
  • vDSP_rmsqv(),乘以 sqrt(n)vDSP_vsdiv()

没有向量归一化函数的原因是因为vDSP 中的“向量”意味着“同时处理很多事情”(最多大约 4096/8192),并且必然是线性代数中的“向量”。规范化 1024 元素向量毫无意义,并且规范化 3 元素向量的快速函数不会让您的应用程序显着更快,这是为什么没有一个。

vDSP 的预期用途更像是标准化 1024 23 元素向量。我可以找到几种方法来实现此目的:

  • 使用 vDSP_vdist() 获取长度向量,然后使用 vDSP_vdiv()。不过,对于长度大于 2 的向量,您必须多次使用 vDSP_vdist()
  • 使用 vDSP_vsq() 对所有输入进行平方,使用 vDSP_vadd() 多次将所有输入相加,相当于 vDSP_vsqrt() 或 < code>vDSP_vrsqrt() 和 vDSP_vmul()vDSP_vdiv()(视情况而定)。编写 vDSP_vsqrt()vDSP_vrsqrt() 的等效函数应该不会太难。
  • 假装你的输入是一个复杂向量的各种方法。不太可能更快。

当然,如果没有 1024 个向量需要标准化,就不要把事情搞得太复杂。

注意:

  1. 我不使用“2-向量”和“3-向量”以避免与相对论中的“四向量”混淆。
  2. n 的一个不错的选择是几乎填满您的 L1 数据缓存。这并不难;它们相对固定在 32K 大约十年或更长时间(它们可能在超线程 CPU 中的虚拟内核之间共享,一些较旧/更便宜的处理器可能有 16K),所以您最多应该做大约是 8192 用于浮点上的就地操作。您可能需要减去一点堆栈空间,并且如果您正在执行多个顺序操作,您可能希望将其全部保留在缓存中; 10242048 似乎相当合理,更多可能会导致收益递减。如果您关心的话,请衡量绩效...

The easiest way to normalize a vector in-place is something like

int n = 3;
float v[3] = {1, 2, 3};
cblas_sscal(n, 1.0 / cblas_snrm2(n, v, 1), v, 1);

You'll need to

#include <cblas.h>

or

#include <vblas.h>

(or both). Note that several of the functions are in the "matrix" section when they operate on vectors.

If you want to use the vDSP functions, see the Vector-Scalar Division section. There are several things you can do:

  • vDSP_dotpr(), sqrt(), and vDSP_vsdiv()
  • vDSP_dotpr(), vrsqrte_f32(), and vDSP_vsmul() (vrsqrte_f32() is a NEON GCC built-in, though, so you need to check you're compiling for armv7).
  • vDSP_rmsqv(), multiply by sqrt(n), and vDSP_vsdiv()

The reason why there isn't a vector-normalization function is because the "vector" in vDSP means "lots of things at once" (up to around 4096/8192) and necessarily the "vector" from linear algebra. It's pretty meaningless to normalize a 1024-element vector, and a quick function for normalizing a 3-element vector isn't something that will make your app significantly faster, which is why there isn't one.

The intended usage of vDSP is more like normalizing 1024 2- or 3-element vectors. I can spot a handful of ways to do this:

  • Use vDSP_vdist() to get a vector of lengths, followed by vDSP_vdiv(). You have to use vDSP_vdist() multiple times for vectors of length greater than 2, though.
  • Use vDSP_vsq() to square all the inputs, vDSP_vadd() multiple times to add all of them, the equivalent of vDSP_vsqrt() or vDSP_vrsqrt(), and vDSP_vmul() or vDSP_vdiv() as appropriate. It shouldn't be too hard to write the equivalent of vDSP_vsqrt() or vDSP_vrsqrt().
  • Various ways which pretend your input is a complex vector. Not likely to be faster.

Of course, if you don't have 1024 vectors to normalize, don't overcomplicate things.

Notes:

  1. I don't use "2-vector" and "3-vector" to avoid confusion with the "four-vector" from relativity.
  2. A good choice of n is one that nearly fills your L1 data cache. It's not difficult; they've been relatively fixed at 32K for around a decade or more (they may be shared between virtual cores in a hyperthreaded CPU and some older/cheaper processors might have 16K), so the most you should do is around 8192 for in-place operation on floats. You might want to subtract a little for stack space, and if you're doing several sequential operations you probably want to keep it all in cache; 1024 or 2048 seem pretty sensible and any more will probably hit diminishing returns. If you care, measure performance...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文