由于资源有限,例如较慢的 CPU、代码大小和 RAM,如何最好地检测音符的音高,类似于电子或软件调音器的做法?
我应该使用:
- Kiss FFT
- FFTW
- 离散小波变换
- 自相关
- 过零分析
- 倍频程滤波器
其他?
简而言之,我想做的是识别在任何(合理的)乐器上演奏的单个音符,中C以下两个八度到以上两个八度。我希望控制在半音的 20% 以内 - 换句话说,如果用户演奏得太平或太尖锐,我需要区分这一点。但是,我不需要调整所需的精度。
With limited resources such as slower CPUs, code size and RAM, how best to detect the pitch of a musical note, similar to what an electronic or software tuner would do?
Should I use:
- Kiss FFT
- FFTW
- Discrete Wavelet Transform
- autocorrelation
- zero crossing analysis
- octave-spaced filters
other?
In a nutshell, what I am trying to do is to recognize a single musical note, two octaves below middle-C to two octaves above, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.
发布评论
评论(5)
如果您不需要那么高的精度,FFT 就足够了。首先窗口化音频块,以便获得明确定义的峰值,然后找到第一个显着的峰值顶峰。
Bin 宽度 = 采样率 / FFT 大小:
基本范围 20 Hz至 7 kHz,因此 14 kHz 的采样率就足够了。下一个“标准”采样率为 22050 Hz。
然后,FFT 大小由您想要的精度决定。 FFT 输出的频率是线性的,而音乐音调的频率是对数的,因此最坏情况下的精度将出现在低频。对于 20 Hz 的半音的 20%,您需要的宽度为 1.2 Hz,表示 FFT 长度为 18545。下一个 2 的幂是 215 = 32768。这是 1.5 秒的数据,需要我笔记本电脑的处理器 3 毫秒来计算。
这不适用于具有“缺少基本面”的信号,并找到“第一个显着”峰值有点困难(因为谐波通常高于基波),但你可以找到适合你情况的方法。
自相关和谐波乘积谱更适合寻找波的真正基波,而不是其中之一和声,但我认为它们不能很好地处理不和谐,并且大多数乐器都喜欢钢琴或吉他不和谐(和声比应有的稍微尖锐)。不过,这实际上取决于您的具体情况。
此外,您还可以使用 Chirp-Z 变换。
为了进行比较,我编写了 Python 中的一些不同方法。
If you don't need that much accuracy, an FFT could be sufficient. Window the chunk of audio first so that you get well-defined peaks, then find the first significant peak.
Bin width = sampling rate / FFT size:
Fundamentals range from 20 Hz to 7 kHz, so a sampling rate of 14 kHz would be enough. The next "standard" sampling rate is 22050 Hz.
The FFT size is then determined by the precision you want. FFT output is linear in frequency, while musical tones are logarithmic in frequency, so the worst case precision will be at low frequencies. For 20% of a semitone at 20 Hz, you need a width of 1.2 Hz, which means an FFT length of 18545. The next power of two is 215 = 32768. This is 1.5 seconds of data, and takes my laptop's processor 3 ms to calculate.
This won't work with signals that have a "missing fundamental", and finding the "first significant" peak is somewhat difficult (since harmonics are often higher than the fundamental), but you can figure out a way that suits your situation.
Autocorrelation and harmonic product spectrum are better at finding the true fundamental for a wave instead of one of the harmonics, but I don't think they deal as well with inharmonicity, and most instruments like piano or guitar are inharmonic (harmonics are slightly sharp from what they should be). It really depends on your circumstances, though.
Also, you can save even more processor cycles by computing only within a specific frequency band of interest, using the Chirp-Z transform.
I've written up a few different methods in Python for comparison purposes.
如果您想实时进行音高识别(并且精确到半音的 1/100 以内),您唯一真正的希望是过零方法。遗憾的是,这是一个微弱的希望。过零可以仅根据几个波长的数据来估计音调,并且可以通过智能手机的处理能力来完成,但它并不是特别准确,因为测量波长时的微小误差会导致估计频率的巨大误差。吉他合成器(仅用几个波长就可以从吉他弦中推断出音高)等设备的工作原理是将测量值量化为音阶音符。这可能适合您的目的,但请注意,过零对于简单的波形非常有效,但对于更复杂的乐器声音往往效果越来越差。
在我的应用程序(在智能手机上运行的软件合成器)中,我使用单个乐器音符的录音作为波表合成的原材料,为了产生特定音高的音符,我需要知道录音的基本音高,准确的到半音的 1/1000 以内(我实际上只需要 1/100 的精度,但我对此有强迫症)。过零方法对此来说太不准确,而基于 FFT 的方法要么太不准确,要么太慢(有时两者兼而有之)。
在这种情况下,我发现的最佳方法是使用自相关。通过自相关,您基本上可以猜测音调,然后测量样本在相应波长下的自相关。通过按半音扫描合理的音高范围(例如 A = 55 Hz 到 A = 880 Hz),我找到了最相关的音高,然后在该音高的附近进行更细粒度的扫描,以获得更准确的值。
最适合您的方法完全取决于您尝试使用它的目的。
If you want to do pitch recognition in realtime (and accurate to within 1/100 of a semi-tone), your only real hope is the zero-crossing approach. And it's a faint hope, sorry to say. Zero-crossing can estimate pitch from just a couple of wavelengths of data, and it can be done with a smartphone's processing power, but it's not especially accurate, as tiny errors in measuring the wavelengths result in large errors in the estimated frequency. Devices like guitar synthesizers (which deduce the pitch from a guitar string with just a couple of wavelengths) work by quantizing the measurements to notes of the scale. This may work for your purposes, but be aware that zero-crossing works great with simple waveforms, but tends to work less and less well with more complex instrument sounds.
In my application (a software synthesizer that runs on smartphones) I use recordings of single instrument notes as the raw material for wavetable synthesis, and in order to produce notes at a particular pitch, I need to know the fundamental pitch of a recording, accurate to within 1/1000 of a semi-tone (I really only need 1/100 accuracy, but I'm OCD about this). The zero-crossing approach is much too inaccurate for this, and FFT-based approaches are either way too inaccurate or way too slow (or both sometimes).
The best approach that I've found in this case is to use autocorrelation. With autocorrelation you basically guess the pitch and then measure the autocorrelation of your sample at that corresponding wavelength. By scanning through the range of plausible pitches (say A = 55 Hz thru A = 880 Hz) by semi-tones, I locate the most-correlated pitch, then do a more finely-grained scan in the neighborhood of that pitch to get a more accurate value.
The approach best for you depends entirely on what you're trying to use this for.
我并不熟悉您提到的所有方法,但您选择的方法应主要取决于输入数据的性质。您正在分析纯音,还是您的输入源有多个音符?语音是您输入的一个特征吗?对输入进行采样的时间长度有限制吗?您能牺牲一定的准确性来换取速度吗?
在某种程度上,您的选择还取决于您是否希望在 时间 中执行计算或在频率空间中。将时间序列转换为频率表示需要时间,但根据我的经验往往会给出更好的结果。
自相关比较时域中的两个信号。简单的实现很简单,但计算成本相对较高,因为它需要对原始信号和时移信号中的所有点进行成对差分,然后进行微分以识别自相关函数中的转折点,然后选择与基频。还有其他方法。例如,平均幅度差分为一种非常廉价的自相关形式,但准确性会受到影响。所有自相关技术都存在倍频程误差的风险,因为函数中存在基波以外的峰值。
测量零交叉点简单明了,但如果存在多个波形,则会遇到问题在信号中。
在频率空间中,基于 FFT 的技术可能足以满足您的目的。一个例子是谐波乘积频谱技术,它将信号的功率谱与每个谐波的下采样版本进行比较,并通过将频谱相乘以产生清晰的峰值来识别音调。
与以往一样,没有什么可以替代测试和分析多种技术,以凭经验确定哪种技术最适合您的问题和限制。
这样的答案只能触及这个话题的表面。除了前面的链接之外,这里还有一些相关参考资料供进一步阅读。
I'm not familiar with all the methods you mention, but what you choose should depend primarily on the nature of your input data. Are you analysing pure tones, or does your input source have multiple notes? Is speech a feature of your input? Are there any limitations on the length of time you have to sample the input? Are you able to trade off some accuracy for speed?
To some extent what you choose also depends on whether you would like to perform your calculations in time or in frequency space. Converting a time series to a frequency representation takes time, but in my experience tends to give better results.
Autocorrelation compares two signals in the time domain. A naive implementation is simple but relatively expensive to compute, as it requires pair-wise differencing between all points in the original and time-shifted signals, followed by differentiation to identify turning points in the autocorrelation function, and then selection of the minimum corresponding to the fundamental frequency. There are alternative methods. For example, Average Magnitude Differencing is a very cheap form of autocorrelation, but accuracy suffers. All autocorrelation techniques run the risk of octave errors, since peaks other than the fundamental exist in the function.
Measuring zero-crossing points is simple and straightforward, but will run into problems if you have multiple waveforms present in the signal.
In frequency-space, techniques based on FFT may be efficient enough for your purposes. One example is the harmonic product spectrum technique, which compares the power spectrum of the signal with downsampled versions at each harmonic, and identifies the pitch by multiplying the spectra together to produce a clear peak.
As ever, there is no substitute for testing and profiling several techniques, to empirically determine what will work best for your problem and constraints.
An answer like this can only scratch the surface of this topic. As well as the earlier links, here are some relevant references for further reading.
在我的项目 danstuner 中,我从 大胆。它本质上是进行 FFT,然后通过在 FFT 上放置三次曲线并找到该曲线的峰值来找到峰值功率。效果很好,尽管我必须防止八度跳跃。
请参阅 Spectrum.cpp。
In my project danstuner, I took code from Audacity. It essentially took an FFT, then found the peak power by putting a cubic curve on the FFT and finding the peak of that curve. Works pretty well, although I had to guard against octave-jumping.
See Spectrum.cpp.
过零不起作用,因为典型的声音具有比基频多得多的谐波和过零。
我尝试过的事情(作为家庭端项目)是这样的:使用
然而我发现,通过电子键盘的输入,对于某些乐器的声音,它能够拾取 2 倍的基频(下一个八度)。这是一个业余项目,在继续其他事情之前我从未抽出时间来实施解决方案。但我认为它的 CPU 负载比 FFT 少得多。
Zero crossing won't work because a typical sound has harmonics and zero-crossings much more than the base frequency.
Something I experimented with (as a home side project) was this:
However I found that with inputs from my electronic keyboard, for some instrument sounds it managed to pick up 2× the base frequency (next octave). This was a side project and I never got around to implementing a solution before moving on to other things. But I thought it had promise as being much less CPU load than FFT.