通过任意因素进行调整大小的最快调整方法(重新缩放)
我有以下代码,可以用类似方式调整1D矢量,并以类似的方式调整图像大小。另一个术语将是重新采样,但是这些术语似乎有很多混乱(重新采样也是统计技术中的一种技术),因此我更喜欢描述性。
目前,代码看起来像这样,我需要优化它:
inline void resizeNearestNeighbor(const int16_t* current, uint32_t currentSize, int16_t* out, uint32_t newSize, uint32_t offset = 0u)
{
if(currentSize == newSize)
{
return;
}
const float scaleFactor = static_cast<float>(currentSize) / static_cast<float>(newSize);
for(uint32_t outIdx = 0; outIdx<newSize; ++outIdx)
{
const int currentIdx = static_cast<uint32_t>(outIdx * scaleFactor);
out[outIdx] = current[(currentIdx + offset)%currentSize];
}
}
当然,这不是很有效的,因为通过降落来采用浮子整数的操作很昂贵,而且我认为它在此中没有任何好处案件。该平台是Cortex M7,因此,如果您熟悉此平台上的任何矢量化技术,它也将非常有帮助。
此代码的用例是声音效果,可以平稳更改延迟线的长度(因此,由于它是环形缓冲区,因此额外的偏移参数)。能够平稳地更改延迟线的长度听起来像是放慢速度或加速录音机中的播放,只是它在循环中。没有这种缩放,就会有很多点击的声音和文物。目前,硬件与所有DSP和此代码都在努力,并且它不能实时重新分组长延迟线。
I have the following code that does the resizing of a 1D vector with nearest neighbor interpolation in a similar fashion you'd also resize an image. Another term would be resampling, but there seems to be a lot of confusion around these terms (resampling is also a technique in statistics), so I prefer to be more descriptive.
Currently the code looks like this and I need to optimize it:
inline void resizeNearestNeighbor(const int16_t* current, uint32_t currentSize, int16_t* out, uint32_t newSize, uint32_t offset = 0u)
{
if(currentSize == newSize)
{
return;
}
const float scaleFactor = static_cast<float>(currentSize) / static_cast<float>(newSize);
for(uint32_t outIdx = 0; outIdx<newSize; ++outIdx)
{
const int currentIdx = static_cast<uint32_t>(outIdx * scaleFactor);
out[outIdx] = current[(currentIdx + offset)%currentSize];
}
}
This of course is not hugely efficient because the operation to take the integer part of a float by downcasting is expensive and I don't think it can take any benefit of vectorization in this case. The platform is Cortex M7, so if you're familiar with any vectorization techniques on this platform, it would be also very helpful.
The use case of this code is a sound effect that allows for smoothly changing the length of a delay line (hence the additional offset parameter, since it's a ring buffer). Being able to smoothly change the length of a delay line sounds like slowing down or speeding up playback in a tape recorder, only it's in a loop. Without this scaling, there are lots of clicking noises and artifacts. Currently the hardware struggles with all the DSP and this code on top of that and it can't rescale long delay lines in real time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于Cortex-M系列非常有限(即使是M7中的浮点也是可选的),因此我将估计使用Bresenham的Mid Point Line Drawing算法,我将估计合理的加速。
该算法始终基于错误项的符号来推进N或N+1个元素。模量不需要全长划分:足以计算
currentIdx + = n +(delta&lt; 0);如果(currentIdx&gt; = currentize)currentIdx- = currentsize;
也可以以
的形式进行“试用部”,if(currentIdx + 64 *(n + 1)&lt; currentsize)确保接下来的64个元素不需要模块化减少。 M7具有一个乘法单元,但是通过换档乘以乘以更快的微观化。
bresenham的算法对于线图,
您的应用程序是您的应用程序的,您Y0,Y1,但取而代之的是直接具有
dy = input_size
,dx = output_size
。通过
n&gt; 0
步骤推进y
的关键修改是dy = dy = dy%dx
以正确计算错误。与
Since the Cortex-M series is quite limited (even floating point in M7 is optional), I would estimate a reasonable speed-up coming from using Bresenham's mid point line drawing algorithm.
This algorithm always advances either N or N+1 elements based on the sign of the error term. The modulus does not need full length division: it suffices to compute
currentIdx += N + (delta < 0); if (currentIdx >= currentSize) currentIdx -= currentSize;
One can also make a "trial divisions" in form of
if (currentIdx + 64 * (N+1) < currentSize)
to ensure that the next 64 elements do not need modular reduction. M7 has a multiplication unit, but multiplying by shifting is still likely a faster micro-optimisation.The Bresenham's algorithm for line drawing is of form
Your application does not have x0,x1,y0,y1, but instead it has directly
dy = input_size
,dx = output_size
.The crucial modification to advance
y
byN>0
steps is tody = dy % dx
to get the error computation correct.One can also use slightly less accurate fixed point DDA algorithm with
如果您查看
currentIdx
,您会注意到它每次都会被scalefactor
每次OUTIDX
添加。因此,您可以用OUTIDX * scalfactor
currentIdx += scale -factor
。您将
currentIdx
offset
初始化,因此也从循环中悬挂。%currentsize
也是一个昂贵的操作,并且似乎仅适用于非零偏移案例。您可能需要以不同的方式对处理,然后将循环分为两个循环(包裹点之前/之后)。If you look at
currentIdx
, you'll note that it is incremented byscaleFactor
every timeoutIdx
is incremented by one. Hence, you can replaceoutIdx * scaleFactor
withcurrentIdx += scaleFactor
.You'd initialize
currentIdx
tooffset
, so that's hoisted from the loop as well.%currentSize
is an expensive operation as well, and one that appears to exist only for the non-zero offset case. You might want to treat that differently, and split the loop in two loops (before/after wrap-around point).