当前位置：文江博客话题详情

FFT 音高检测 - 旋律提取

发布于 2024-12-18 09:59:24 字数 1436 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖乖哒 2024-12-25 09:59:24

这在很大程度上取决于您想要处理的音乐内容 - 提取单音录音（即单个乐器或声音）的音高与从复调混合物中提取单个乐器的音高（例如提取多音轨的音高）不同。复调录音中的旋律）。

对于单声道音高提取，您可以尝试在时域和频域中实现多种算法。几个示例包括 Yin（时域）和 HPS（频域），维基百科中提供了有关两者的更多详细信息的链接：

http://en.wikipedia.org/wiki/Pitch_detection_algorithm

但是，如果您想从复调材料中提取旋律，这两种方法都不会很好地工作。从复调音乐中提取旋律仍然是一个研究问题，并且没有一套可以遵循的简单步骤。研究社区提供了一些您可以尝试的工具（但仅供非商业用途），即：

MELODIA：http://mtg.upf.edu/technologies/melodia

最后一点，在合成输出时，我建议合成您提取的连续音高曲线（最简单的方法）这样做的目的是每 X 毫秒（例如 10）估计音高并合成每 10 毫秒改变频率的正弦波，确保连续相位）。这将使您的结果听起来更加自然，并且您可以避免将连续音高曲线量化为离散音符时所涉及的额外错误（这本身就是另一个问题）。

回复收藏 0 原文

何以心动 2024-12-25 09:59:24

您可能不想从 FFT 中选取峰值来计算音调。您可能想使用自相关。我在这里写了一个非常类似的问题的长答案：用于音调检测的倒谱分析

回复收藏 0 原文

幸福还没到 2024-12-25 09:59:24

您的方法可能适用于合成音乐，使用同步音符以适合您的 fft 帧时间和长度，并且仅使用其完整频谱与您的 HPS 音高估计器兼容的音符声音。对于普通音乐来说，这些都不是真的。

对于更一般的情况，自动音乐转录似乎仍然是一个研究问题，没有简单的 5 步解决方案。音高是一种人类心理声学现象。人们会听到本地频谱中可能存在或不存在的音符。 HPS 音高估计算法比使用 FFT 峰值可靠得多，但对于多种音乐声音仍然可能失败。此外，任何跨越音符边界或瞬态的帧的 FFT 可能不包含要估计的明确的单个音高。

回复收藏 0 原文

感受沵的脚步 2024-12-25 09:59:24

您的方法不适用于任何一般的音乐示例，原因如下：

音乐本质上是动态的。这意味着音乐中存在的每个声音都受到不同时期的静音、起音、延音、衰减和再次静音的调制，也称为声音的包络。
乐器音符和人声音符无法通过单一音调正确合成。这些音符必须由一个基音和许多和声合成。
但是，仅合成乐器音符或声乐音符的基音和和声是不够的，还必须合成音符的包络线，如上面 1 中所述。
然而
此外，要合成音乐中的旋律段落，无论是器乐还是声乐，都必须针对该段落中的每个音符合成上述第 1-3 项，并且还必须合成每个音符相对于该乐段开头的时间安排。此外，要合成音乐中的旋律段落
提取单个乐器或人声是一个非常困难的问题，并且您的方法无法解决该问题，因此您的方法无法正确解决问题 1-4。

简而言之，任何试图通过使用严格的分析方法从音乐录音的最终混音中提取近乎完美的音乐转录的方法，在最坏的情况下几乎肯定注定会失败，并且在最好的情况下属于高级研究领域。

如何摆脱这种僵局取决于工作的目的是什么，这是OP没有提到的。

这项工作会用于商业产品，还是一个业余爱好项目？

如果是商业作品，则需要采取各种进一步的方法（昂贵或非常昂贵的方法），但这些方法的细节取决于工作的目标是什么。

作为结束语，您的合成听起来像随机嘟嘟声，原因如下：

您的基音检测器与滚动 FFT 帧的时间相关，这实际上会在每个帧的开始时间生成可能是假的基音。以及每个滚动 FFT 帧。
为什么检测到的基音可能是假的？因为您任意地将音乐样本剪辑到 (FFT) 帧中，因此可能会在中音某处截断许多同时发声的音符，从而扭曲音符的频谱特征。
您没有尝试合成检测到的音符的包络，也不能，因为无法根据您的分析获取包络信息。
因此，合成结果可能是一系列纯正弦线性调频脉冲，在时间上由滚动 FFT 帧的 delta-t 间隔开。每个线性调频脉冲可能具有不同的频率，具有不同的包络幅度，并且包络形状可能是矩形。

要了解音符的复杂性质，请查看以下参考资料：

乐器频谱至 102.4 KHz

乐器音符频谱及其时域包络

特别观察构成每个音符的许多纯音，以及每个音符的时域包络的复杂形状。多个音符相对于彼此的可变时序是音乐的另一个重要方面，就像典型音乐中的复调音乐（多个声音同时发声）一样。

所有这些音乐元素共同使得自主音乐转录的严格分析方法极具挑战性。

Your approach will not work for any general musical example, for the following reasons:

Music by its very nature is dynamic. Meaning that every sound present in music is modulated by distinct periods of silence, attack, sustain, decay, and again silence, otherwise known as the envelope of the sound.
Musical instrument notes and human vocal notes cannot be properly synthesized by a single tone. These notes must be synthesized by a fundamental tone and many harmonics.
However, it is not sufficient to synthesize the fundamental tone and the harmonics of a musical instrument note or vocal note, one must also synthesize the envelope of the note, as described in 1 above.
Furthermore, to synthesize a melodic passage in music, whether instrumental or vocal, one must synthesize items 1-3 above, for every note in the passage, and one must also synthesize the timing of every note relative to the beginning of the passage.
Analytically extracting individual instruments or human voices from a final mix recording is a very difficult problem, and your approach doesn't address that problem, so your approach cannot properly address issues 1-4.

In short, any approach that attempts to extract a near perfect musical transcription from the final mix of a musical recording, by using strict analytical methods, is at worst almost certainly doomed to failure, and at best falls in the realm of advanced research.

How to proceed from this impasse depends on what is the purpose of the work, something that the OP didn't mention.

Will this work be used in a commercial product, or is it a hobby project?

If a commercial work, various further approaches are warranted (costly or very costly ones), but the details of those approaches depend on what are the goals of the work.

As a closing note, your synthesis sounds like random beeps due to the following:

Your fundamental tone detector is tied to the timing of your rolling FFT frames, which in effect generates a probably fake fundamental tone at the start-time of each and every rolling FFT frame.
Why are the detected fundamental tones probably fake? Because you're arbitrarily clipping the musical sample into (FFT) frames, and are therefore probably truncating many concurrently sounding notes somewhere mid-note, thereby distorting the spectral signatures of the notes.
You're not trying to synthesize the envelopes of the detected notes, nor can you, because there's no way to obtain envelope information based on your analysis.
Therefore, the synthesized result is probably a series of pure sine chirps, spaced in time by the rolling FFT frame's delta-t. Each chirp may be of a different frequency, with a different envelope magnitude, and with envelopes that are probably rectangular in shape.

To see the complex nature of musical notes, take a look at these references:

Musical instrument spectra to 102.4 KHz

Musical instrument note spectra and their time-domain envelopes

Observe in particular the many pure tones that make up each note, and the complex shape of the time-domain envelope of each note. The variable timing of multiple notes relative to each other is an additional essential aspect of music, as is polyphony (multiple voices sounding concurrently) in typical music.

All of these elements of music conspire to make the strict analytical approach to autonomous musical transcription, extremelly challenging.

回复收藏 0 原文

~没有更多了~

关于作者

梦明

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

FFT 音高检测 - 旋律提取

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

FFT 音高检测 - 旋律提取

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。