Apple Accelerate Framework 对向量进行缩放和标准化
我可以在 Accelerate.framework 中使用哪些函数来按标量缩放向量并标准化向量?我在文档中发现了一个我认为可能适用于扩展的方法,但我对其操作感到困惑。
vDSP_vsma
Vector scalar multiply and vector add; single precision.
void vDSP_vsma (
const float *__vDSP_A,
vDSP_Stride __vDSP_I,
const float *__vDSP_B,
const float *__vDSP_C,
vDSP_Stride __vDSP_K,
float *__vDSP_D,
vDSP_Stride __vDSP_L,
vDSP_Length __vDSP_N
);
What functions can I use in Accelerate.framework
to scale a vector by a scalar, and normalize a vector? I found one I think might work for scaling in the documentation but I am confused about it's operation.
vDSP_vsma
Vector scalar multiply and vector add; single precision.
void vDSP_vsma (
const float *__vDSP_A,
vDSP_Stride __vDSP_I,
const float *__vDSP_B,
const float *__vDSP_C,
vDSP_Stride __vDSP_K,
float *__vDSP_D,
vDSP_Stride __vDSP_L,
vDSP_Length __vDSP_N
);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
就地标准化向量的最简单方法是“
您需要
或”
(或两者)。请注意,当某些函数对向量进行运算时,它们位于“矩阵”部分。
如果您想使用 vDSP 函数,请参阅 矢量标量除法部分。您可以执行以下操作:
vDSP_dotpr()
、sqrt()
和vDSP_vsdiv()
vDSP_dotpr()
>、vrsqrte_f32()
和vDSP_vsmul()
(vrsqrte_f32()
是 NEON GCC 内置的,所以您需要检查您正在为 armv7 进行编译)。vDSP_rmsqv()
,乘以sqrt(n)
和vDSP_vsdiv()
没有向量归一化函数的原因是因为vDSP 中的“向量”意味着“同时处理很多事情”(最多大约
4096
/8192
),并且必然是线性代数中的“向量”。规范化1024
元素向量毫无意义,并且规范化3
元素向量的快速函数不会让您的应用程序显着更快,这是为什么没有一个。vDSP 的预期用途更像是标准化
1024
2
或3
元素向量。我可以找到几种方法来实现此目的:vDSP_vdist()
获取长度向量,然后使用vDSP_vdiv()
。不过,对于长度大于 2 的向量,您必须多次使用vDSP_vdist()
。vDSP_vsq()
对所有输入进行平方,使用vDSP_vadd()
多次将所有输入相加,相当于vDSP_vsqrt()
或 < code>vDSP_vrsqrt() 和vDSP_vmul()
或vDSP_vdiv()
(视情况而定)。编写vDSP_vsqrt()
或vDSP_vrsqrt()
的等效函数应该不会太难。当然,如果没有 1024 个向量需要标准化,就不要把事情搞得太复杂。
注意:
8192
用于浮点上的就地操作。您可能需要减去一点堆栈空间,并且如果您正在执行多个顺序操作,您可能希望将其全部保留在缓存中;1024
或2048
似乎相当合理,更多可能会导致收益递减。如果您关心的话,请衡量绩效...The easiest way to normalize a vector in-place is something like
You'll need to
or
(or both). Note that several of the functions are in the "matrix" section when they operate on vectors.
If you want to use the vDSP functions, see the Vector-Scalar Division section. There are several things you can do:
vDSP_dotpr()
,sqrt()
, andvDSP_vsdiv()
vDSP_dotpr()
,vrsqrte_f32()
, andvDSP_vsmul()
(vrsqrte_f32()
is a NEON GCC built-in, though, so you need to check you're compiling for armv7).vDSP_rmsqv()
, multiply bysqrt(n)
, andvDSP_vsdiv()
The reason why there isn't a vector-normalization function is because the "vector" in vDSP means "lots of things at once" (up to around
4096
/8192
) and necessarily the "vector" from linear algebra. It's pretty meaningless to normalize a1024
-element vector, and a quick function for normalizing a3
-element vector isn't something that will make your app significantly faster, which is why there isn't one.The intended usage of vDSP is more like normalizing
1024
2
- or3
-element vectors. I can spot a handful of ways to do this:vDSP_vdist()
to get a vector of lengths, followed byvDSP_vdiv()
. You have to usevDSP_vdist()
multiple times for vectors of length greater than 2, though.vDSP_vsq()
to square all the inputs,vDSP_vadd()
multiple times to add all of them, the equivalent ofvDSP_vsqrt()
orvDSP_vrsqrt()
, andvDSP_vmul()
orvDSP_vdiv()
as appropriate. It shouldn't be too hard to write the equivalent ofvDSP_vsqrt()
orvDSP_vrsqrt()
.Of course, if you don't have 1024 vectors to normalize, don't overcomplicate things.
Notes:
32K
for around a decade or more (they may be shared between virtual cores in a hyperthreaded CPU and some older/cheaper processors might have 16K), so the most you should do is around8192
for in-place operation on floats. You might want to subtract a little for stack space, and if you're doing several sequential operations you probably want to keep it all in cache;1024
or2048
seem pretty sensible and any more will probably hit diminishing returns. If you care, measure performance...