This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
Harmonic Product Spectrum Theory
If the input signal is a musical note, then its spectrum should consist of a series of peaks, corresponding to fundamental frequency with harmonic components at integer multiples of the fundamental frequency. Hence when we compress the spectrum a number of times (downsampling), and compare it with the original spectrum, we can see that the strongest harmonic peaks line up. The first peak in the original spectrum coincides with the second peak in the spectrum compressed by a factor of two, which coincides with the third peak in the spectrum compressed by a factor of three. Hence, when the various spectrums are multiplied together, the result will form clear peak at the fundamental frequency.
Method
First, we divide the input signal into segments by applying a Hanning window, where the window size and hop size are given as an input. For each window, we utilize the Short-Time Fourier Transform to convert the input signal from the time domain to the frequency domain. Once the input is in the frequency domain, we apply the Harmonic Product Spectrum technique to each window.
The HPS involves two steps: downsampling and multiplication. To downsample, we compressed the spectrum twice in each window by resampling: the first time, we compress the original spectrum by two and the second time, by three. Once this is completed, we multiply the three spectra together and find the frequency that corresponds to the peak (maximum value). This particular frequency represents the fundamental frequency of that particular window.
Limitations of the HPS method
Some nice features of this method include: it is computationally inexpensive, reasonably resistant to additive and multiplicative noise, and adjustable to different kind of inputs. For instance, we could change the number of compressed spectra to use, and we could replace the spectral multiplication with a spectral addition. However, since human pitch perception is basically logarithmic, this means that low pitches may be tracked less accurately than high pitches.
Another severe shortfall of the HPS method is that it its resolution is only as good as the length of the FFT used to calculate the spectrum. If we perform a short and fast FFT, we are limited in the number of discrete frequencies we can consider. In order to gain a higher resolution in our output (and therefore see less graininess in our pitch output), we need to take a longer FFT which requires more time.
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note. Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
发布评论
评论(10)
这正是我去年在这里做的项目:)除了我的项目是关于跟踪人类歌声的音高(而且我没有机器人来演奏曲调)的
最快方法我想到的是利用 BASS 库。 它包含即用型功能,可以为您提供来自默认记录设备的 FFT 数据。 看一下 BASS 附带的“livespec”代码示例。
顺便说一句,原始 FFT 数据不足以确定基频。 您需要谐波乘积谱等算法来获取F0。
另一个考虑因素是音频源。 如果您打算进行 FFT 并对其应用谐波乘积频谱。 您需要确保输入只有一个音频源。 如果它包含多个来源(例如现代歌曲),则需要考虑很多频率。
来自:http://cnx.org/content/m11714/latest/
This is exactly what I'm doing here as my last year project :) except one thing that my project is about tracking the pitch of human singing voice (and I don't have the robot to play the tune)
The quickest way I can think of is to utilize BASS library. It contains ready-to-use function that can give you FFT data from default recording device. Take a look at "livespec" code example that comes with BASS.
By the way, raw FFT data will not enough to determine fundamental frequency. You need algorithm such as Harmonic Product Spectrum to get the F0.
Another consideration is the audio source. If you are going to do FFT and apply Harmonic Product Spectrum on it. You will need to make sure the input has only one audio source. If it contains multiple sources such as in modern songs there will be to many frequencies to consider.
from: http://cnx.org/content/m11714/latest/
只是评论:(谐波)声音中基波谐波可能会丢失,这不会改变感知的音高。 作为极限情况,如果您采用方波(例如 C# 音符)并完全抑制第一个谐波,感知到的音符仍然是 C#,在同一八度。 在某种程度上,当我们的大脑猜测一个音符时,它能够补偿某些和声的缺失,甚至是第一个和声。
因此,要使用频域技术检测音高,您应该考虑所有谐波(傅里叶变换幅度的局部最大值),并提取它们的某种“最大公约数”频率。 音高检测根本不是一个小问题...
DAFX 大约有 30 页致力于音高检测,带有示例和 Matlab 代码。
Just a comment: The fundamental harmonic may as well be missing from a (harmonic) sound, this doesn't change the perceived pitch. As a limit case, if you take a square wave (say, a C# note) and completely suppress the first harmonic, the perceived note is still C#, in the same octave. In a way, our brain is able to compensate the absence of some harmonics, even the first, when it guesses a note.
Hence, to detect a pitch with frequency-domain techniques you should take into account all the harmonics (local maxima in the magnitude of the Fourier transform), and extract some sort of "greatest common divisor" of their frequencies. Pitch detection is not a trivial problem at all...
DAFX has about 30 pages dedicated to pitch detection, with examples and Matlab code.
自相关 - http://en.wikipedia.org/wiki/Autocorrelation
零交叉 - < a href="http://en.wikipedia.org/wiki/Zero_crossing" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Zero_crossing (这种方法用于廉价吉他调音器)
Autocorrelation - http://en.wikipedia.org/wiki/Autocorrelation
Zero-crossing - http://en.wikipedia.org/wiki/Zero_crossing (this method is used in cheap guitar tuners)
尝试 YAAPT 音调跟踪,它可以检测时域和频域中的基频。 您可以从链接下载 Matlab 源代码,并使用频谱处理部分查找 FFT 输出中的峰值。
Python包 http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
Try YAAPT pitch tracking, which detects fundamental frequency in both time and frequency domains. You can download Matlab source code from the link and look for peaks in the FFT output using the spectral process part.
Python package http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html#
您是否尝试过维基百科关于音调检测的文章? 它包含一些您可能感兴趣的参考资料。
此外,这里还有一个 DSP 应用程序和库列表,您可以在其中浏览。 该列表只提到了 Linux 软件包,但其中很多都是跨平台的,并且有很多源代码可以查看。
仅供参考,大多数精通 DSP 的人都可以检测单声道录音中音符的音调。 检测所有音符(包括和弦等)的音高要困难得多。
Did you try Wikipedia's article on pitch detection? It contains a few references that can be interesting to you.
In addition, here's a list of DSP applications and libraries, where you can poke around. The list only mentions Linux software packages, but many of them are cross-platform, and there's a lot of source code you can look at.
Just FYI, detecting the pitch of the notes in a monophonic recording is within reach of most DSP-savvy people. Detecting the pitches of all notes, including chords and stuff, is a lot harder.
只是一个想法 - 但您是否需要处理数字音频流作为输入?
如果没有,请考虑使用音乐的符号表示(例如 MIDI)。 然后,音符的音高将被明确说明,您可以非常轻松地合成与音高、节奏和许多其他音乐参数相对应的声音(和动作)。
如果您需要分析数字音频流(mp3、wav、实时输入等),请记住,虽然简单单声道声音的音高检测非常先进,但复音音高检测是一个尚未解决的问题。 在这种情况下,您可能会发现我对这个问题的回答很有帮助。
Just a thought - but do you need to process a digital audio stream as input?
If not, consider using a symbolic representation of music (such as MIDI). The pitches of the notes will then be stated explicitly, and you can synthesize sounds (and movements) corresponding to the pitch, rhythm and many other musical parameters extremely easily.
If you need to analyse a digital audio stream (mp3, wav, live input, etc) bear in mind that while pitch detection of simple monophonic sounds is quite advanced, polyphonic pitch detection is an unsolved problem. In this case, you may find my answer to this question helpful.
要从复调音乐中提取旋律的基频,您可以尝试 MELODIA 插件:http://mtg .upf.edu/technologies/melodia
提取歌曲中所有乐器的 F0(多 F0 跟踪)或将它们转录成音符是一项更加困难的任务。 旋律提取和音乐转录仍然是开放的研究问题,因此无论您使用什么算法/工具,都不要期望获得完美的结果。
For extracting the fundamental frequency of the melody from polyphonic music you could try the MELODIA plug-in: http://mtg.upf.edu/technologies/melodia
Extracting the F0's of all the instruments in a song (multi-F0 tracking) or transcribing them into notes is an even harder task. Both melody extraction and music transcription are still open research problems, so regardless of the algorithm/tool you use don't expect to obtain perfect results for either.
如果您正在尝试检测复调录音的音符(同时有多个音符),祝您好运。 这是一个非常棘手的问题。 我不知道有什么方法可以听弦乐四重奏的录音,并使用算法来分离四个声音。 (也许是小波?)如果一次只是一个音符,那么有几种音高跟踪算法,其中许多在其他评论中提到过。
您要使用的算法取决于您正在听的音乐类型。 如果你想让它捕捉到人们唱歌的声音,有很多专门为语音设计的好算法。 (这就是大多数研究的所在。)如果您想选择特定的乐器,您必须更有创意。 语音算法可以很简单,因为人类歌声的范围通常限制在 100-2000 Hz 左右。 (说话范围要窄得多)。 然而,钢琴的基频约为 27 Hz。 到 4200 Hz,因此您正在处理通常被音高检测算法忽略的更广泛的范围。
大多数仪器的波形相当复杂,含有大量谐波,因此计算零或仅采用自相关等简单方法是行不通的。 如果您大致知道要查找的频率范围,则可以进行低通滤波,然后将计数归零。 我认为你最好使用更复杂的算法,例如其他用户提到的谐波乘积谱,或 YAAPT(“音调跟踪的另一种算法”)或类似的算法。
最后一个问题:有些乐器,尤其是钢琴,会存在基础缺失、不和谐的问题。 缺失的基本原理可以通过音高跟踪算法来处理……事实上,它们必须如此,因为电子传输中经常会删除基本原理……尽管您可能仍然会遇到一些八度音阶误差。 然而,如果有人在钢琴的底部八度音阶中演奏音符,不和谐会给您带来问题。 正常的音调跟踪算法并不是为了处理不和谐而设计的,因为人声并不是明显不和谐。
If you're trying to detect the notes of a polyphonic recording (multiple notes at the same time) good luck. That's a very tricky problem. I don't know of any way to listen to, say, a recording of a string quartet and have an algorithm separate the four voices. (Wavelets maybe?) If it's just one note at a time, there are several pitch tracking algorithms out there, many of them mentioned in other comments.
The algorithm you want to use will depend on the type of music you are listening to. If you want it to pick up people singing there are a lot of good algorithms out there designed specifically for voice. (That's where most of the research is.) If you are trying to pick up specific instruments you'll have to be a bit more creative. Voice algorithms can be simple because the range of the human singing voice is generally limited to about 100-2000 Hz. (Speaking range is much more narrow). The fundamental frequencies on a piano, however, go from about 27 Hz. to 4200 Hz., so you're dealing with a wider range usually ignored by voice pitch detection algorithms.
The waveform of most instruments is going to be fairly complex, with lots of harmonics, so a simple approach like counting zeros or just taking the autocorrelation won't work. If you knew roughly what frequency range you were looking in you could low-pass filter and then zero count. I'd think you'd be better off though with a more complex algorithm such as the Harmonic Product Spectrum mentioned by another user, or YAAPT ("Yet Another Algorithm for Pitch Tracking"), or something similar.
One last problem: some instruments, the piano in particular, will have the problem of missing fundamentals and inharmonicity. Missing fundamentals can be dealt with by the pitch tracking algorithms...in fact they have to be since fundamentals are often cut out in electronic transmission...though you'll probably still get some octave errors. Inharmonicity however, will give you problems if somebody plays a note in the bottom octaves of the piano. Normal pitch tracking algorithms aren't designed to deal with inharmonicity because the human voice is not significantly inharmonic.
您基本上需要一个频谱分析仪。 您也许能够对模拟输入的录音进行 FFT,但这在很大程度上取决于录音的分辨率。
You basically need a spectrum analyzer. You might be able to to a FFT on a recording of an analog input, but much depends on the resolution of the recording.
我立即想到的是:
我不确定,这是否适用于非常复调的音乐声音 - 也许可以在谷歌上搜索“FFT、分析、旋律等”。 将返回有关可能问题的更多信息。
问候
what immediately comes to my mind:
I am not sure, if that works for very polyphonic sounds - maybe googling for "FFT, analysis, melody etc." will return more info on possible problems.
regards