用于音调检测的倒谱分析
我正在寻找从声音信号中提取音高。
IRC 上的某人刚刚向我解释了如何采用双 FFT 来实现这一目标。具体来说:
- 采用 FFT
- 取绝对值平方的对数(可以使用查找表完成)
- 取另一个 FFT
- 取绝对值
我正在使用 vDSP 尝试此操作,
我不明白为什么我之前没有遇到过这种技术。我做了很多搜索和提问;几周的价值。更重要的是,我不明白为什么我没有想到这一点。
我正在尝试使用 vDSP 库来实现这一目标。看起来它有处理所有这些任务的功能。
不过,我想知道最终结果的准确性。
我之前使用过一种技术,可以在单个 FFT 的频率区间中搜索局部最大值。当它遇到一个峰值时,它会使用一种巧妙的技术(自上次 FFT 以来的相位变化)来更准确地将实际峰值放置在箱内。
我担心我在这里介绍的这项技术会失去这种精度。
我猜想该技术可以在第二次 FFT 之后使用,以准确地获得基波。但看起来信息在第 2 步中丢失了。
由于这是一个潜在棘手的过程,有经验的人是否可以看看我正在做的事情并检查其是否理智?
另外,我听说有一种替代技术,涉及在相邻的垃圾箱上拟合二次方。这是否具有可比精度?如果是这样,我会赞成它,因为它不涉及记住垃圾箱阶段。
那么问题来了:
- 这种方法有意义吗?可以改进吗?
- 我有点担心“对数平方”分量;似乎有一个 vDSP 函数可以做到这一点:vDSP_vdbcon。然而,没有迹象表明它会预先计算日志表——我认为它不会,因为 FFT 函数需要调用显式的预计算函数并将其传递给它。而这个功能却没有。
- 是否存在谐波被拾取的危险?
- 有什么狡猾的方法可以让vDSP拉出最大值,首先是最大的吗?
任何人都可以向我指出有关该技术的一些研究或文献吗?
主要问题:它足够准确吗?准确度可以提高吗?一位专家刚刚告诉我,准确性确实不够。这是该行的结尾吗?
Pi
PS 当我想创建标签但却不能创建标签时,我感到非常恼火。 :|我已向维护人员建议,请跟踪尝试的标签,但我确信我被忽略了。我们需要 vDSP 标签、加速框架、倒谱分析
I'm looking to extract pitches from a sound signal.
Someone on IRC just explained to me how taking a double FFT achieves this. Specifically:
- take FFT
- take log of square of absolute value (can be done with lookup table)
- take another FFT
- take absolute value
I am attempting this using vDSP
I can't understand how I didn't come across this technique earlier. I did a lot of hunting and asking questions; several weeks worth. More to the point, I can't understand why I didn't think of it.
I am attempting to achieve this with vDSP library. It looks as though it has functions to handle all of these tasks.
However, I'm wondering about the accuracy of the final result.
I have previously used a technique which scours the frequency bins of a single FFT for local maxima. When it encounters one, it uses a cunning technique (the change in phase since the last FFT) to more accurately place the actual peak within the bin.
I am worried that this precision will be lost with this technique I'm presenting here.
I guess the technique could be used after the second FFT to get the fundamental accurately. But it kind of looks like the information is lost in step 2.
As this is a potentially tricky process, could someone with some experience just look over what I'm doing and check it for sanity?
Also, I've heard there is an alternative technique involving fitting a quadratic over neighbouring bins. Is this of comparable accuracy? If so, I would favour it, as it doesn't involve remembering bin phases.
So, questions:
- does this approach makes sense? Can it be improved?
- I'm a bit worried about the "log square" component; there seems to be a vDSP function to do exactly that: vDSP_vdbcon. However, there is no indication it precalculates a log-table -- I assume it doesn't, as the FFT function requires an explicit pre-calculation function to be called and passed into it. And this function doesn't.
- Is there some danger of harmonics being picked up?
- is there any cunning way of making vDSP pull out the maxima, biggest first?
Can anyone point me towards some research or literature on this technique?
the main question: Is it accurate enough? Can the accuracy be improved? I have just been told by an expert that the accuracy IS INDEED not sufficient. Is this the end of the line?
Pi
PS I get SO annoyed when I want to create tags, but cannot. :| I have suggested to the maintainers that SO keep track of attempted tags, but I'm sure I was ignored. We need tags for vDSP, accelerate framework, cepstral analysis
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,我们来一一分析:
尽管我不是专家并且接受过很少的正式培训,但我认为我知道这个问题的最佳答案。在过去的几年里,我进行了大量的搜索、阅读和实验。我的共识是,就准确性、复杂性、噪声鲁棒性和速度之间的权衡而言,自相关方法是迄今为止最好的基音检测器。除非您有一些非常特殊的情况,否则我几乎总是建议使用自相关。稍后,让我回答您的其他问题。
您所描述的是“倒谱分析”,这是一种主要用于从语音中提取音调的方法。倒谱分析完全依赖于信号泛音的丰富性和强度。例如,如果您要通过倒谱分析传递纯正弦波,您将得到糟糕的结果。然而,对于复杂的语音信号来说,存在大量的泛音。 (顺便说一句,泛音是信号的元素,它们以基频的倍数振荡,即我们感知的音调)。倒谱分析在检测缺少基频的语音时非常稳健。也就是说,假设您绘制了函数 正弦(4x)+正弦(6x)+正弦(8x)+正弦(10x)。如果你看一下,很明显它与函数 sin(2x) 具有相同的频率。但是,如果对此函数应用傅里叶分析,则与 sin(2x) 对应的 bin 的幅度将为零。因此,该信号被认为具有“缺失基频”,因为它不包含我们认为的频率的正弦波。因此,简单地选取傅立叶变换上的最大峰值对该信号不起作用。
您所描述的是相位声码器技术,可以更准确地测量给定部分的频率。然而,如果您使用基频分量缺失或较弱的信号,挑选最大箱的基本技术会给您带来问题。
首先,请记住,相位声码器技术只能更准确地测量单个分音的频率。它忽略了有关基频的较高分音中包含的信息。其次,给定合适的 FFT 大小,您可以使用峰值插值获得非常好的精度。这里的其他人已经向您指出了抛物线插值。我也建议这个。
如果您以 44100 Hz 的频率对 4098 个样本数据块的 FFT 进行抛物线插值,音调约为 440 Hz,则意味着它将位于第 40 个 (430.66 Hz) 和第 41 个 (441.430664064) bin 之间。假设本文在一般情况下大致正确,它说抛物线插值分辨率提高一个数量级以上。这使得分辨率至少为 1 Hz,这是人类听觉的阈值。事实上,如果您使用理想的高斯窗口,抛物线插值在峰值处精确(没错,精确。但是请记住,您永远不能使用真正的高斯窗口,因为它永远延伸到两个方向。)如果您仍然担心获得更高的精度,您可以随时填充 FFT。这意味着在转换之前在 FFT 末尾添加零。结果表明,这相当于“sinc 插值”,这是频率受限信号的理想插值函数。
这是正确的。相位声码器技术依赖于连续帧相互连接并具有特定相位关系的事实。然而,连续帧的 FFT 的对数幅度在相位方面不表现出相同的关系,因此在第二次 FFT 中使用此变换是没有用的。
是的,是的,我将在最后详细阐述我在自相关方面的改进。
我不知道 vDSP 库的具体情况,抱歉。
在你原来的相位声码器峰值拾取技术中?是的。用倒谱方法?不,不是真的,重点是它考虑所有谐波来获得频率估计。例如,假设我们的频率是 1。我们的泛音是 2,3,4,5,6,7,8,9 等,我们必须去掉所有奇数谐波,即留下 2,4,6, 8 等,并在基频开始与其泛音之一混淆之前将其删除。
不了解 vDSP,但在一般情况下,您通常只需迭代所有这些并跟踪最大的。
我在评论中给你的链接 P. 看起来不错。
此外,此网站提供了 DSP 令人难以置信的深入和广泛的解释主题,包括各种音高提取、操作等,以理论和实践的方式。 (此是指向网站索引的更通用链接)。我总是发现自己又回到了这个话题。有时,如果您跳到中间,可能会有点不知所措,但您始终可以按照每个解释回到基本构建块。
现在进行自相关。基本上,该技术是这样的:您获取(窗口)信号并对其进行不同量的时间延迟。找到与原始信号最匹配的量。那是基本时期。这具有很大的理论意义。您正在寻找信号中的重复部分。
实际上,与所有这些时间延迟的信号副本进行相关是很慢的。它通常以这种方式实现(这在数学上是等效的):
对其进行零填充以使其原始长度加倍。进行 FFT。然后将所有系数替换为其平方幅值,第一个系数除外,将其设置为 0。现在进行 IFFT。将每个元素除以第一个元素。这给你自相关。从数学上讲,您正在使用循环卷积定理(查一下),并使用零填充将线性卷积问题转换为循环卷积问题,可以有效地解决该问题。
但是,选择峰值时要小心。对于非常小的延迟,信号将与其自身很好地匹配,因为它是连续的。 (我的意思是,如果将其延迟为零,它与自身完美相关)相反,选择第一个过零之后的最大峰值。您可以像使用其他技术一样对自相关函数进行抛物线插值以获得更准确的值。
根据所有标准,这本身将为您提供非常好的音调检测。但是,您有时可能会遇到音调减半和音调加倍的问题。基本上,问题在于,如果信号每 1 秒重复一次,那么它也会每两秒重复一次。同样,如果它具有非常强烈的泛音,您可能会得到音高减半。因此,最大的峰值可能并不总是您想要的。 Phillip McLeod 提出的 MPM 算法解决了这个问题。这个想法是这样的:
您不想选择最大的峰值,而是选择足够大以供考虑的第一个峰值。如何确定峰值是否足够大以供考虑?如果它至少与 A* 最大峰一样高,其中 A 是某个常数。我认为 Phillip 建议 A 的值约为 0.9。实际上,他编写的程序 Tartini 允许您实时比较几种不同的音高检测算法。我强烈建议下载它并尝试一下(它实现了倒谱、直接自相关和 MPM ):(如果您在构建时遇到问题,请尝试此处的说明.
我要注意的最后一件事是关于窗口,一般来说,任何平滑窗口、汉明窗口都可以。如果您想要更准确,我也建议您使用重叠窗口。 顺便说一句,自相关
的一个很酷的特性是,如果频率在您测量的窗口部分中线性变化,它将为您提供窗口中心的正确频率。
还有一件事:我所描述的称为有偏自相关函数,这是因为对于较高的时间滞后,原始信号和时间滞后版本之间的重叠变得越来越少。例如,如果您查看一个大小为 N 且已延迟 N-1 个样本的窗口,您会发现只有一个样本重叠。因此,这个延迟的相关性显然将非常接近于零。您可以通过将自相关函数的每个值除以样本重叠数来补偿这一点。这称为无偏自相关。然而,一般来说,您会得到更糟糕的结果,因为自相关的较高延迟值非常嘈杂,因为它们仅基于几个样本,因此减少它们的权重是有意义的。
如果您像往常一样寻找更多信息,谷歌是您的朋友。好的搜索术语:自相关、基音检测、基音跟踪、基音提取、基音估计、倒谱等。
Okay, let's go through one by one:
Although I am not an expert and have had minimal formal training, I think I know the best answer to this problem. I've done a lot of searching, reading, and experimenting over the past few years. My consensus is that the autocorrelation method is by far the best pitch detector in terms of the tradeoff between accuracy, complexity, noise robustness, and speed. Unless you have some very specific circumstances, I would almost always recommend using autocorrelation. More on this later, let me answer your other questions.
What you describe is "cepstral analysis" which is a method mainly used for the extraction of pitch from speech. Cepstral analysis relies entirely on the plentifulness and strength of the overtones of your signal. If for example, you were to pass a pure sine wave through cepstral analysis, you would get terrible results. However, for speech, which is a complex signal, there is a large number of overtones. (overtones, by the way, are elements of the signal which are oscillating at multiples of the fundamental frequency i.e. the pitch we perceive). Cepstral analysis can be robust in detecting speech with a missing fundamental frequency. That is, suppose you plotted the function sin(4x)+sin(6x)+sin(8x)+sin(10x). If you look at that, it is clear that it has the same frequency as the function sin(2x). However, if you apply fourier analysis to this function, the bin corresponding to sin(2x) will have zero magnitude. Thus this signal is consider to have a "missing fundamental frequency", because it does not contain the sinusoid of the frequency which we consider it to be. Thus simply picking the biggest peak on the fourier transform will not work on this signal.
What you are describing is the phase vocoder technique to more accurately measure the frequency of a given partial. However, the basic technique of picking out the biggest bin is going to cause you problems if you use a signal with a missing or weak fundamental frequency component.
First of all, remember that the phase vocoder technique only more accurately measures the frequency of a single partial. It ignores the information contained in the higher partials about the fundamental frequency. Second of all, given a decent FFT size, you can get very good accuracy using peak interpolation. Someone else here has pointed you towards parabolic interpolation. I also would suggest this.
If you parabolically interpolate the FFT of a 4098 sample block of data at 44100 Hz, with a pitch about 440 hz, that will mean it will be between the 40th (430.66 Hz) and 41st (441.430664064) bin. Assuming this paper is approximately correct in the general case, it says parabolic interpolation increases resolution by more than one order of magnitude. This leaves the resolution at at least 1 Hz, which is the threshold of human hearing. In fact, if you use an ideal Gaussian window, parabolic interpolation is exact at the peaks (That's right, exact. remember, however, that you can never use a true Gaussian window, because it extends forever in both directions.) If you are still worried about getting higher accuracy, you can always pad the FFT. This means adding zeros to the end of the FFT before transforming. It works out that this is equivalent to "sinc interpolation" which is the ideal interpolation function for frequency limited signals.
That is correct. The phase vocoder technique relies on the fact that sequential frames are connected and have a specific phase relationship. However, the log magnitude of the FFT of sequential frames does not show the same relationship in terms of phase, thus it would be useless to use this transform for the second FFT.
Yes and yes, I will elaborate on the improvement in my bit on autocorrelation at the end.
I don't know the specifics of the vDSP library, sorry.
In your original phase-vocoder peak picking technique? yes. With the cepstral method? no, not really, the whole point is that it considers all the harmonics to get its frequency estimate. For exmaple, let's say our freqency is 1. Our overtones are 2,3,4,5,6,7,8,9,etc We would have to take out all of the odd harmonics, i.e. leave 2,4,6,8, etc, and remove the fundamental frequency before it would start to be confused with one of its overtones.
Don't know vDSP, but in the general case, you usually just iterate over all of them and keep track of the biggest.
The link P. i gave you in a comment seemed like a good one.
Also, this website offers an incredibly in-depth and wonderfully broad explanation of DSP topics, including all sorts of pitch extraction, manipulation, etc, in both a theoretical and practical way. (this is a more general link to an index on the site). I always find myself coming back to it. Sometimes it can be a bit overwhelming if you jump into the middle of it, but you can always follow every explanation back to the basic building blocks.
Now for autocorrelation. Basically the technique is this: You take your (windowed) signal and time delay it different amounts. Find the amount which matches up best with your original signal. That is the fundamental period. It makes a lot of theoretical sense. You are hunting for the repetitive parts of your signal.
In practice, taking the correlation with all these time delayed copies of the signal is slow. It is usually implemented in this way instead (which is mathematically equivalent):
Zero-Pad it to double its original length.Take the FFT. Then replace all the coefficients with their square magnitude, except for the first, which you set to 0. Now take the IFFT. Divide every element by the first one. This gives you the autocorrelation. Mathematically, you are using the circular convolution theorem (look it up), and using zero-padding to convert a linear convolution problem into a circular convolution one, which can be efficiently solved.
However, be careful about picking the peak. For very small delays, the signal will match up with itself very well, simply because it is continuous. (I mean, if you delay it zero, it correlates perfectly with itself) Instead, pick the largest peak after the first zero-crossing. You can parabolically interpolate the autocorrelation function as well just as with other techniques to get much more accurate values.
This by itself will give you very good pitch detection by all criteria However, you might sometimes encounter a problem with pitch halving and pitch doubling. Basically the problem is that if a signal is repetitive every 1 second, it is also repetitive every two seconds. Similarly, if it has a very strong overtone, you might get pitch halving. So the biggest peak might not always be the one you want. A solution to this problem is the MPM algorithm by Phillip McLeod. The idea is this:
Instead of picking the biggest peak, you want to pick the first peak that is large enough to be considered. How do you determine if a peak is large enough to be considered? If it is at least as high as A*the largest peak, where A is some constant. Phillip suggests a value of A around 0.9 I think. Actually the program he wrote, Tartini, allows you to compare several different pitch detection algorithms in real time. I would strongly suggest downloading it and trying it out (it implements Cepstrum, straight autocorrelation, and MPM): (if you have trouble building, try the instructions here.
One last thing I should note is about windowing. In general, any smooth window will do. Hanning window, Hamming window, etc. Hopefully you should know how to window. I would also suggest doing overlapped windows if you want more accurate temporal measurements.
By the way, a cool property of the autocorrelation is that if the frequency is changing linearly through the windowed section you are measuring, it will give you the correct frequency at the center of the window.
One more thing: What I described is called the biased autocorrelation function. This is because for higher time lags, the overlap between the original signal and the time lagged version becomes less and less. For example, if you look at a window of size N which has been delayed N-1 samples, you see that only one sample overlaps. So the correlation at this delay is clearly going to be very close to zero. You can compensate for this, by diving each value of the autocorrelation function by the number of samples overlap to get it. This is called the unbiased autocorrelation. However, in general, you will get worse results with this, as the higher delay values of the autocorrelation are very noisy, as they are based on only a few samples, so it makes sense to weigh them less.
If you're looking for more information, as always, google is your friend. Good search terms: autocorrelation, pitch detection, pitch tracking, pitch extraction, pitch estimation, cepstrum, etc.
这是对用于确定音高的倒谱的简要分析。
首先让我们检查一个合成信号。
下图显示了合成稳态 E2 音符的倒谱,该合成使用典型的近直流分量、82.4 Hz 的基波和 82.4 Hz 整数倍的 8 个谐波。合成正弦曲线被编程为生成 4096 个样本。
观察 12.36 处突出的非 DC 峰值。倒谱宽度为 1024(第二次 FFT 的输出),因此峰值对应于 1024/12.36 = 82.8 Hz,非常接近真实基频 82.4 Hz。
现在让我们检查真实的声学信号。
下图显示了真实原声吉他 E2 音符的倒谱。信号在第一次 FFT 之前未加窗。观察 542.9 处显着的非 DC 峰值。倒谱宽度为 32768(第二次 FFT 的输出),因此峰值对应于 32768/542.9 = 60.4 Hz,与真实基频 82.4 Hz 相距甚远。
下图显示了同一真实原声吉他 E2 音符的倒谱,但是这个在第一次 FFT 之前对信号进行汉恩窗处理的时间。观察 268.46 处显着的非 DC 峰值。倒谱宽度为 32768(第二次 FFT 的输出),因此峰值对应于 32768/268.46 = 122.1 Hz,距离真实基频 82.4 Hz 更远。
用于此分析的原声吉他 E2 音符在 44.1 KHz 处采样,具有较高的采样频率。录音室条件下的优质麦克风,它基本上包含零背景噪音,没有其他乐器或声音,也没有后期处理。
这说明了使用倒谱分析来确定真实声学信号中的音调所面临的重大挑战。
参考资料:
真实音频信号数据、合成信号生成、绘图、FFT 和倒谱分析均在此处完成:乐器倒谱
This is a brief analysis of the Cepstrum used for pitch determination.
First let's examine a synthetic signal.
The plot below shows the Cepstrum of a synthetic steady-state E2 note, synthesized using a typical near-DC component, a fundamental at 82.4 Hz, and 8 harmonics at integer multiples of 82.4 Hz. The synthetic sinusoid was programmed to generate 4096 samples.
Observe the prominent non-DC peak at 12.36. The Cepstrum width is 1024 (the output of the second FFT), therefore the peak corresponds to 1024/12.36 = 82.8 Hz which is very close to 82.4 Hz the true fundamental frequency.
Now let's examine a real acoustical signal.
The plot below shows the Cepstrum of a real acoustic guitar's E2 note. The signal was not windowed prior to the first FFT. Observe the prominent non-DC peak at 542.9. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/542.9 = 60.4 Hz which is fairly far from 82.4 Hz the true fundamental frequency.
The plot below shows the Cepstrum of the same real acoustic guitar's E2 note, but this time the signal was Hann windowed prior to the first FFT. Observe the prominent non-DC peak at 268.46. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/268.46 = 122.1 Hz which is even farther from 82.4 Hz the true fundamental frequency.
The acoustic guitar's E2 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
This illustrates the significant challenge of using Cepstral analysis for pitch determination in real acoustical signals.
References:
Real audio signal data, synthetic signal generation, plots, FFT, and Cepstral analysis were done here: Musical instrument cepstrum
您现有的技术有什么问题而您对新技术感兴趣?如果这是目标的话,我认为倒谱不会给你带来更准确的音调。然而,它会帮助你解决被抑制的基本原理。我想你可以使用倒谱来让你接近,然后返回到第一个 FFT(我将保留其原始形式),然后将你狡猾的技术应用到倒谱引导你到的容器上。
至于二次拟合,在 Ted Knowlton 的本文中提到,最近在另一个SO问题中提出,但我从未使用过它。
我应该补充一点,二次拟合技术(至少如 Knowlton 的参考文献中所述)取决于在第一个 FFT 上使用矩形窗口。正如 Paul R 在您的另一个问题中解释的那样,如果您正在进行音频处理,您应该在第一个 FFT 上使用 Hann 或 Hamming 窗。所以我猜整体算法可能如下所示:
x
,制作窗口副本w
。Sx = FFT(x)
,Sw = FFT(w)
c = Sw 平方幅值的对数
Cx = FFT(c )
Cx
估计基波(可能还有谐波)Sw
对基波(或高次谐波) bin 进行狡猾的相位技巧Sx
围绕基波(或高次谐波)进行二次箱拟合如果您确实抑制了基波,则
(或高次谐波)
注释适用。我在你的另一个问题中提到了这一点,但是是什么让你认为日志需要查找表呢?为什么不直接调用日志函数呢?我认为两个 FFT (O(n*logn)) 所花费的时间使您可以执行的任何其他处理相形见绌。
What's wrong with your existing technique that you're interested in a new one? I don't think a cepstrum is going to give you more accurate pitch, if that's the goal. It will, however, help you with suppressed fundamentals. I suppose you could use the cepstrum to get you close, then go back to the first FFT (which I would keep in its original form) and then apply your cunning technique to the bin that the cepstrum guides you to.
As for the quadratic fit, it's referred to in this paper by Ted Knowlton, which came up in another SO question recently, but I've never used it.
I should add that the quadratic fit technique, at least as outlined in the reference from Knowlton, depends on using a rectangular window on the first FFT. As Paul R explained in another of your questions, if you're doing audio processing you should use a Hann or Hamming window on the first FFT. So I guess an overall algorithm could look like:
x
, make a windowed copyw
.Sx = FFT(x)
,Sw = FFT(w)
c = Log of square magnitude of Sw
Cx = FFT(c)
Cx
Sw
to do cunning phase trick on fundamental (or higher harmonic) bin(s)Sx
to do quadratic bin fit around fundamental (or higher harmonic)The
(or higher harmonic)
note applies if you do indeed have suppressed fundamentals.And I mentioned this in your other question, but what makes you think the log requires a lookup table? Why not just call the log function? I imagine that the time taken by two FFTs (O(n*logn)) dwarfs any other processing you can do.
倒谱分析是同态处理的一种形式,在 Oppenheim & Co. 的《离散时间信号处理》一书中进行了解释。谢弗。它曾经被认为对于从波形包络中分离出激励器频率很有用(也许仍然如此,不知道)。当给出相当长的固定数据窗口时,它似乎效果更好。
但倒谱分析并不意味着频率估计的准确性。这实际上是一种有损分析形式。但它可能有助于从一系列谐波中找到基频,其中基频频谱分量可能相对较弱甚至缺失。
相位声码器分析(不是那么狡猾,因为该技术已经存在了大约半个世纪)更适合给定峰值的频率估计,假设您选择正确的峰值(不一定是最强的峰值),则峰值频谱在整个范围内是平稳的两个 fft 帧,并且频谱中并未完全缺失基波。
如果窗口函数的变换类似于抛物线,则二次或抛物线插值可能是一个不错的选择。 Sinc 插值对于矩形窗口效果更好。
Cepstrum analysis is a form of homomorphic processing, explained in the book "Discrete-Time Signal Processing" by Oppenheim & Schafer. It was once thought useful for separating out the exciter frequency from a forment envelope (maybe still is, dunno). It seems to work better when given a fairly long window of stationary data.
But Cepstral analysis is not meant for accuracy of frequency estimation. It's actually a lossy form of analysis. But it might be useful at finding the fundamental frequency from a train of harmonics where the fundamental frequency spectral component might be comparatively weak or even missing.
Phase vocoder analysis (not so cunning, as the technique has been around for maybe a half century) is better at frequency estimation for a given peak, assuming you pick the correct peak (not necessarily the strongest one), the peak spectrum is stationary across both fft frames, and the fundamental isn't completely missing from the spectrum.
Quadratic or parabolic interpolation might be a good fit if the transform of your window function resembles a parabola. Sinc interpolation works better with rectangular windows.
这个答案是为了除了 Jeremy Salwen 的帖子之外阅读,也是为了回答有关文献的问题。
首先,重要的是要考虑信号的周期性。对于给定的分析窗口,信号是否更接近完全周期信号。
有关术语和数学的详细解释,请参阅此处 https://en.wikipedia.org/wiki /Almost_periodic_function#Quasiperiodic_signals_in_audio_and_music_synthesis
简短的答案是,如果对于给定的分析窗口,信号是完全周期性的,或者如果信号是准周期性的并且分析窗口足够小以实现周期性,则自相关
足以完成任务。
满足这些条件的信号示例有:
不满足这些条件的信号示例有:
对于使用自相关的音调检测,有一个关于如何在 Praat 中实现它的教程:
Praat 的基音检测算法的简要说明。这描述了名为“ac”的算法。
论文详细描述了使用无偏自相关(Jeremy Salwen 使用的术语)进行基音检测,并表明它优于有偏自相关进行基音检测。尽管它指出自相关结果仅在窗口大小的一半以内显着,但您不需要计算后一半。
有偏自相关是通过使用锥形窗口对信号加窗然后进行自相关来完成的。这减少了低频调制(在慢时间尺度上的幅度变化)的影响,这对基音检测是不利的,因为否则具有较大幅度的部分将给出较大的自相关系数,这将是优选的。
Boersma 论文中使用的算法可以用 5 个步骤来描述:
需要注意的是,窗口两端将趋向于零,并且窗口的自相关也将趋向于零。这就是为什么无偏自相关的后半部分是无用的,它是在接近窗口末尾时除以零的结果。
接下来是尹:
- De Cheveigné、阿兰和川原英树。 “YIN,语音和音乐的基频估计器。”美国声学学会杂志 111.4 (2002):1917-1930。
据我了解,YIN 论文还提供了证据,表明使用锥形窗口会对音调检测精度产生不利影响。有趣的是,它更喜欢不使用任何锥形窗口函数(它的意思是锥形窗口不会给结果带来任何改进,反而会使结果变得复杂。)
最后是 Philip McLeod 的 SNAC 和 WSNAC(已由 Jeremy Salwen 链接):
它们可以在 Miracle.otago.ac.nz/tartini/papers.html 上找到,
我还没有读得太深,但有提到它是一种减少偏向自相关锥形窗口的不利影响的方法,与Boersma使用的方法相比有所不同。
(请注意,我还没有遇到过有关 MPM 的任何内容,因此我对此无话可说)
最后一个建议是,如果您正在制作乐器调音器,那么与自相关是通过使用与具有预定频率的纯正弦信号的互相关来实现的。
杰里米·萨尔文:
我想说的是,虽然给定的信号在 omega=2 处是周期性的,但它与具有与函数 sin(2x) 相同的频率不同。傅里叶分析表明分量 sin(2x) 的幅值为零。
这与信号的音高、频率和基频之间存在关系有关,但它们是不同的且不可互换。重要的是要记住,音调是一种主观测量,它取决于人类对它的感知。
它看起来好像与 sin(2x) 具有相同的频率,这就是我们视觉上感知它的方式。
同样的效果也同样发生在音调和音频感知上。我立即想到的例子是节拍,即当存在两个频率接近但不同的正弦曲线时听到的感知音高。
This answer is meant to be read in addition to Jeremy Salwen's post, and also to answer the question regarding literatures.
First of all it's important to consider what is the signal's periodicity. Whether or not the signal is closer to a fully periodic signal for a given analysis window.
Refer here for a detailed explanation for the term and maths https://en.wikipedia.org/wiki/Almost_periodic_function#Quasiperiodic_signals_in_audio_and_music_synthesis
The short answer is that if for a given analysis window a signal is fully periodic, or if the signal is quasi-periodic and the analysis window is small enough that periodicity is achieved then Autocorrelation
is enough for the task.
Examples of signals that fulfill these conditions are:
Example of signals that fail to fulfill these conditions are:
For pitch detection using autocorrelation there is a tutorial on how it is implemented in Praat:
A brief explanation of Praat's pitch detection algorithm. This describes the algorithm named 'ac'.
The paper describes in detail about the use of unbiased autocorrelation (the term as used by Jeremy Salwen) for pitch detection, it also shows that it is superior to biased autocorrelation for pitch detection. Although it notes that the autocorrelation results are only significant up to half of the window size, you don't neet to calculate the latter half.
A biased autocorrelation is done by windowing the signals using a tapering window and then doing the autocorrelation. This reduces the effects of low-frequency modulation (amplitude change at a slow time scale) that is detrimental to the pitch detection, since otherwise parts with larger amplitude will give a larger autocorrelation coefficient that will be preferred.
The algorithm used in Boersma's paper can be described in 5 steps:
It's important to note that the window will go toward zero on both ends, and the autocorrelation of the window will also go towards zero. This is why the latter half of an unbiased autocorrelation is useless, it is a division by zero nearing the end of the window.
Next is YIN:
- De Cheveigné, Alain, and Hideki Kawahara. "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111.4 (2002): 1917-1930.
As I understand it the YIN paper also gives evidence that using a taper window has detrimental effects on pitch detection accuracy. And interestingly it prefer to not use any tapering window function (it says something to the effect that tapering window does not bring any improvements to the results and instead complicates it.)
Last is Philip McLeod's SNAC and WSNAC (already linked by Jeremy Salwen):
They can be found on miracle.otago.ac.nz/tartini/papers.html
I haven't read too far into it, but there is a mention of it as a method to reduce the detriment effects of tapering window of biased autocorrelation that is different compared to the method used by Boersma.
(note that I haven't come across anything about MPM so I can't say anything about it)
One last suggestion is that if you're making an instrument tuner, the method that would be easier and will have a bit better result compared to autocorrelation is by using cross-correlation with a pure sinusoidal signal with a predetermined frequency.
Jeremy Salwen:
I would like to argue that although the given signal is periodic at the \omega=2, it is not the same as having the same frequency as the function sin(2x). As fourier analysis will show that the component sin(2x) has zero magnitude.
This is related to the point that there is a relation between pitch, frequency and the fundamental frequency of a signal, but they are different and not interchangeable. It is important to remember that pitch is a subjective measurements, that it depends on human as one that perceives it.
It looks as though it has the same frequency as sin(2x), that's how we perceive it visually.
The same effect also happens similarly on pitch and audio perception. the example that came to mind immediately is Beats, that is the perceived pitch that is heard when there are two sinusoidals with close but different frequencies.