FFT 算法:什么输入/输出? (回复:实时音高检测)
我正在尝试从音频流中提取音高数据。据我所知,FFT 似乎是最好使用的算法。
有人可以帮助我理解这个 FFT 算法的作用,而不是直接深入数学吗?
请不要说“FFT 从原始信号中提取频率数据”之类的明显内容。我需要更详细的信息。
我传入什么,传出什么?
一旦我清楚地理解了接口,这将有助于我理解实现。
我认为我需要传入一个音频缓冲区,我需要告诉它每次计算要使用多少字节(比如该缓冲区中的最新 1024 字节)。也许我需要指定我希望它检测的音高范围。现在它要回传什么?频率仓数组?这些是什么?
(编辑:)我找到了一个可以使用的 C++ 算法(如果我只能理解它的话)
Performous 从麦克风中提取音调。而且代码是开源的。以下是该算法的编码人员对该算法功能的描述。
- PCM 输入(带缓冲)
- FFT(一次 1024 个样本,之后从缓冲区前面删除 200 个样本)
- 重新分配方法(相对于之前的 FFT 为 200 个样本)
- 峰值过滤(这部分可以做得更好,甚至可以做得更好)省略)
- 将峰值组合成谐波组(我们将组合称为音调)
- 音调的时间过滤(更新之前检测到的音调集,而不是简单地使用新检测到的音调)
- 选择最佳的声音音调(频率限制,权重,可以也使用谐波阵列,但我不认为我们这样做)
但是有人可以帮助我理解它是如何工作的吗?从 FFT 发送到重新分配方法的是什么?
I am attempting to extract pitch data from an audio stream. From what I can see, it looks as though FFT is the best algorithm to use.
Rather than digging straight into the math, could someone help me understand what this FFT algorithm does?
Please don't say something obvious like 'FFT extracts frequency data from a raw signal.' I need the next level of detail.
What do I pass in, and what do I get out?
Once I understand the interface clearly, this will help me to understand the implementation.
I take it I need to pass in an audio buffer, I need to tell it how many bytes to use for each computation (say the most recent 1024 bytes from this buffer). and maybe I need to specify the range of pitches I want it to detect. Now it is going to pass back what? An array of frequency bins? What are these?
(Edit:) I have found a C++ algorithm to use (if I can only understand it)
Performous extracts pitch from the microphone. Also the code is open source. Here is a description of what the algorithm does, from the guy that coded it.
- PCM input (with buffering)
- FFT (1024 samples at a time, remove 200 samples from front of the buffer afterwards)
- Reassignment method (against the previous FFT that was 200 samples earlier)
- Filtering of peaks (this part could be done much better or even left out)
- Combining peaks into sets of harmonics (we call the combination a tone)
- Temporal filtering of tones (update the set of tones detected earlier instead of simply using the newly detected ones)
- Pick the best vocal tone (frequency limits, weighting, could use the harmonic array also but I don't think we do)
But could someone help me understand how this works? What is it that is getting sent from the FFT to the Reassignment method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
FFT 只是该过程中的一个组成部分,它可能不是音调检测的最佳方法。阅读音高检测并决定您要首先使用哪种算法(这取决于您到底想要测量什么音高 - 语音、单一乐器、其他类型的声音等。在进入低音之前先做好这一点级别细节,例如 FFT(一些但不是所有音高检测算法在内部使用 FFT)
已经有许多类似的问题,例如 使用 FFT 进行实时音高检测 和 使用 FFT 进行小号的音调检测,维基百科上有很好的概述材料 a> 等 - 阅读这些内容,然后决定是否仍要推出自己的基于 FFT 的解决方案,或者使用适合您的特定应用程序的现有库。
The FFT is just one building block in the process, and it may not be the best approach for pitch detection. Read up on pitch detection and decide which algo you want to use first (this will depend on what exactly you are trying to measure the pitch of - speech, single musical instrument, other types of sound, etc. Get this right before getting into low level details such as the FFT (some, but not all pitch detection algorithms use the FFT internally).
There are numerous similar questions on SO already, e.g. Real-time pitch detection using FFT and Pitch detection using FFT for trumpet, and there is good overview material on Wikipedia etc - read these and then decide whether you still want to roll your own FFT-based solution or perhaps use an existing library which is suitable for your particular application.
这里有一个选择的因素。最简单的实现方法是输入 (2^n 个样本) 复数,然后输出 2^n 个复数,所以也许您应该从这里开始。
在 DCT(离散余弦变换)的特殊情况下,通常输入的是 2^n 个样本(通常是浮点数),输出的是 2^n 个值,通常也是浮点数。 DCT 是一种 FFT,但仅采用实数值,并根据余弦分析函数。
定义一个结构来处理复杂值是明智的(但通常会被跳过)。传统上,FFT 是就地完成的,但如果不这样做,它也能正常工作。
实例化一个包含 FFT 工作缓冲区的类(如果您不想就地进行 FFT)并将其重新用于多个 FFT 可能很有用。
There is an element of choice here. The most straightforward to implement is to do (2^n samples in) complex numbers in, and 2^n complex numbers out, so maybe you should start with that.
In the special case of a DCT(discrete cosine transform), typically what goes in is 2^n samples (often floats), and out go 2^n values, often floats too. DCT is an FFT but that takes only the real values, and analyses the function in terms of cosines.
It is smart (but commonly skipped) to define a struct to handle the complex values. Traditionally FFT's are done in-place, but it works fine if you don't.
It can be useful to instantiate a class that contains a work buffer for the FFT (if you don't want to do the FFT in-place), and reuse that for several FFTs.
输入 N 个 PCM 样本(纯实复数)。结果是频域的 N 个 bin(每个 bin 对应于采样率的 1/N 切片)。每个 bin 都是一个复数。这些值通常应以极坐标格式(绝对值和参数)处理,而不是实部和虚部。绝对值表示箱中心频率附近的声音量,而参数表示相位(正弦波在哪个位置传播)。
大多数情况下,编码器仅使用幅度(绝对值)并丢弃相位角(参数)。
In goes N samples of PCM (purely real complex numbers). Out comes N bins of frequency domain (each bin corresponding to a 1/N slice of sample rate). Each bin is a complex number. Rather than real and imaginary parts, these values should generally be handled in polar format (absolute value and argument). The absolute value tells the amount of sound near the bin center frequency while the argument tells the phase (at which position the sine wave is travelling).
Most often coders only use the magnitude (absolute value) and throw away the phase angle (argument).