最后一点,在合成输出时,我建议合成您提取的连续音高曲线(最简单的方法)这样做的目的是每 X 毫秒(例如 10)估计音高并合成每 10 毫秒改变频率的正弦波,确保连续相位)。这将使您的结果听起来更加自然,并且您可以避免将连续音高曲线量化为离散音符时所涉及的额外错误(这本身就是另一个问题)。
It depends greatly on the musical content you want to work with - extracting the pitch of a monophonic recording (i.e. single instrument or voice) is not the same as extracting the pitch of a single instrument from a polyphonic mixture (e.g. extracting the pitch of the melody from a polyphonic recording).
For monophonic pitch extraction there are various algorithm you could try to implement both in the time domain and frequency domain. A couple of examples include Yin (time domain) and HPS (frequency domain), link to further details on both are provided in wikipedia:
However, neither will work well if you want to extract the melody from polyphonic material. Melody extraction from polyphonic music is still a research problem, and there isn't a simple set of steps you can follow. There are some tools out there provided by the research community that you can try out (for non-commercial use only though), namely:
As a final note, when synthesizing your output I'd recommend synthesizing the continuous pitch curve that you extract (the easiest way to do this is to estimate the pitch every X ms (e.g. 10) and synthesize a sine wave that changes frequency every 10 ms, ensuring continuous phase). This will make your result sound a lot more natural, and you avoid the extra error involved in quantizing a continuous pitch curve into discrete notes (which is another problem in its own).
You probably don't want to be picking peaks from a FFT to calculate the pitch. You probably want to use autocorrelation. I wrote up a long answer to a very similar question here: Cepstral Analysis for pitch detection
Your method might work for synthetic music using notes synchronized to fit your fft frame timing and length, and using only note sounds whose complete spectrum is compatible with your HPS pitch estimator. None of that is true for common music.
For the more general case, automatic music transcription still seems to be a research problem, with no simple 5 step solution. Pitch is a human psycho-acoustic phenomena. People will hear notes that may or may not be present in the local spectrum. The HPS pitch estimation algorithm is much more reliable than using the FFT peak, but can still fail for many kinds of musical sounds. Also, the FFT of any frames that cross note boundaries or transients may contain no clear single pitch to estimate.
Your approach will not work for any general musical example, for the following reasons:
Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.
Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.
However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.
Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.
Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.
In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.
How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.
Will this work be used in a commercial product, or is it a hobby project?
If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.
As a closing note, your synthesis sounds like random beeps due to the following:
Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.
Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.
You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.
Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.
To see the complex nature of musical notes, take a look at these references:
Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.
All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.
发布评论
评论(4)
这在很大程度上取决于您想要处理的音乐内容 - 提取单音录音(即单个乐器或声音)的音高与从复调混合物中提取单个乐器的音高(例如提取多音轨的音高)不同。复调录音中的旋律)。
对于单声道音高提取,您可以尝试在时域和频域中实现多种算法。几个示例包括 Yin(时域)和 HPS(频域),维基百科中提供了有关两者的更多详细信息的链接:
但是,如果您想从复调材料中提取旋律,这两种方法都不会很好地工作。从复调音乐中提取旋律仍然是一个研究问题,并且没有一套可以遵循的简单步骤。研究社区提供了一些您可以尝试的工具(但仅供非商业用途),即:
最后一点,在合成输出时,我建议合成您提取的连续音高曲线(最简单的方法)这样做的目的是每 X 毫秒(例如 10)估计音高并合成每 10 毫秒改变频率的正弦波,确保连续相位)。这将使您的结果听起来更加自然,并且您可以避免将连续音高曲线量化为离散音符时所涉及的额外错误(这本身就是另一个问题)。
It depends greatly on the musical content you want to work with - extracting the pitch of a monophonic recording (i.e. single instrument or voice) is not the same as extracting the pitch of a single instrument from a polyphonic mixture (e.g. extracting the pitch of the melody from a polyphonic recording).
For monophonic pitch extraction there are various algorithm you could try to implement both in the time domain and frequency domain. A couple of examples include Yin (time domain) and HPS (frequency domain), link to further details on both are provided in wikipedia:
However, neither will work well if you want to extract the melody from polyphonic material. Melody extraction from polyphonic music is still a research problem, and there isn't a simple set of steps you can follow. There are some tools out there provided by the research community that you can try out (for non-commercial use only though), namely:
As a final note, when synthesizing your output I'd recommend synthesizing the continuous pitch curve that you extract (the easiest way to do this is to estimate the pitch every X ms (e.g. 10) and synthesize a sine wave that changes frequency every 10 ms, ensuring continuous phase). This will make your result sound a lot more natural, and you avoid the extra error involved in quantizing a continuous pitch curve into discrete notes (which is another problem in its own).
您可能不想从 FFT 中选取峰值来计算音调。您可能想使用自相关。我在这里写了一个非常类似的问题的长答案:用于音调检测的倒谱分析
You probably don't want to be picking peaks from a FFT to calculate the pitch. You probably want to use autocorrelation. I wrote up a long answer to a very similar question here: Cepstral Analysis for pitch detection
您的方法可能适用于合成音乐,使用同步音符以适合您的 fft 帧时间和长度,并且仅使用其完整频谱与您的 HPS 音高估计器兼容的音符声音。对于普通音乐来说,这些都不是真的。
对于更一般的情况,自动音乐转录似乎仍然是一个研究问题,没有简单的 5 步解决方案。音高是一种人类心理声学现象。人们会听到本地频谱中可能存在或不存在的音符。 HPS 音高估计算法比使用 FFT 峰值可靠得多,但对于多种音乐声音仍然可能失败。此外,任何跨越音符边界或瞬态的帧的 FFT 可能不包含要估计的明确的单个音高。
Your method might work for synthetic music using notes synchronized to fit your fft frame timing and length, and using only note sounds whose complete spectrum is compatible with your HPS pitch estimator. None of that is true for common music.
For the more general case, automatic music transcription still seems to be a research problem, with no simple 5 step solution. Pitch is a human psycho-acoustic phenomena. People will hear notes that may or may not be present in the local spectrum. The HPS pitch estimation algorithm is much more reliable than using the FFT peak, but can still fail for many kinds of musical sounds. Also, the FFT of any frames that cross note boundaries or transients may contain no clear single pitch to estimate.
您的方法不适用于任何一般的音乐示例,原因如下:
音乐本质上是动态的。这意味着音乐中存在的每个声音都受到不同时期的静音、起音、延音、衰减和再次静音的调制,也称为声音的包络。
乐器音符和人声音符无法通过单一音调正确合成。这些音符必须由一个基音和许多和声合成。
但是,仅合成乐器音符或声乐音符的基音和和声是不够的,还必须合成音符的包络线,如上面 1 中所述。
然而
此外,要合成音乐中的旋律段落,无论是器乐还是声乐,都必须针对该段落中的每个音符合成上述第 1-3 项,并且还必须合成每个音符相对于该乐段开头的时间安排。 此外,要合成音乐中的旋律段落
简而言之,任何试图通过使用严格的分析方法从音乐录音的最终混音中提取近乎完美的音乐转录的方法,在最坏的情况下几乎肯定注定会失败,并且在最好的情况下属于高级研究领域。
如何摆脱这种僵局取决于工作的目的是什么,这是OP没有提到的。
这项工作会用于商业产品,还是一个业余爱好项目?
如果是商业作品,则需要采取各种进一步的方法(昂贵或非常昂贵的方法),但这些方法的细节取决于工作的目标是什么。
作为结束语,您的合成听起来像随机嘟嘟声,原因如下:
您的基音检测器与滚动 FFT 帧的时间相关,这实际上会在每个帧的开始时间生成可能是假的基音。以及每个滚动 FFT 帧。
为什么检测到的基音可能是假的?因为您任意地将音乐样本剪辑到 (FFT) 帧中,因此可能会在中音某处截断许多同时发声的音符,从而扭曲音符的频谱特征。
您没有尝试合成检测到的音符的包络,也不能,因为无法根据您的分析获取包络信息。
因此,合成结果可能是一系列纯正弦线性调频脉冲,在时间上由滚动 FFT 帧的 delta-t 间隔开。每个线性调频脉冲可能具有不同的频率,具有不同的包络幅度,并且包络形状可能是矩形。
要了解音符的复杂性质,请查看以下参考资料:
乐器频谱至 102.4 KHz
乐器音符频谱及其时域包络
特别观察构成每个音符的许多纯音,以及每个音符的时域包络的复杂形状。多个音符相对于彼此的可变时序是音乐的另一个重要方面,就像典型音乐中的复调音乐(多个声音同时发声)一样。
所有这些音乐元素共同使得自主音乐转录的严格分析方法极具挑战性。
Your approach will not work for any general musical example, for the following reasons:
Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.
Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.
However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.
Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.
Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.
In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.
How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.
Will this work be used in a commercial product, or is it a hobby project?
If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.
As a closing note, your synthesis sounds like random beeps due to the following:
Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.
Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.
You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.
Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.
To see the complex nature of musical notes, take a look at these references:
Musical instrument spectra to 102.4 KHz
Musical instrument note spectra and their time-domain envelopes
Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.
All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.