基频 + C# 中的语音检测
我正在尝试通过麦克风实时检测语音输入。
我已准备好接收输入,执行 FFT 算法并得到以 dB 为单位的结果。我有频域、时域和频谱图。
如何获得基频? 如果我得到基频,我可以指定如果频率在某些值之间,那么我们正在谈论的是语音吗? 有没有其他方法可以用我已经拥有的东西来做到这一点?
提前致谢
I'm trying to detect voice throught input from the microphone in real-time.
I allready receive the input, execute FFT algorithm and have the result in dB. I have a frequency domain, a time domain and a spectogram.
How can I get the fundamental frequency?
If I get the fundamental frequency can I specify that if the frequency is between certain values, then it is voice that we are talking?
Is there any other way to do this with the things that I allready have?
Tks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
频率估计有许多不同的算法,使用正确的算法取决于您的算法正在做。您期望什么样的输入?您想用该输入做什么?你有什么样的处理能力?
如果您想要做的话,检测基频并不能帮助您识别特定的人是否正在说话。你声音的频率不断变化。您必须对人的共振峰等进行“指纹”。
仅仅找到 FFT 的峰值并不会给您带来良好的语音结果。查看倒谱分析。
There are many different algorithms for frequency estimation, and the right one to use depends on what you're doing. What kinds of input do you expect? What do you want to do with that input? What kind of processing power do you have?
Detecting the fundamental frequency isn't going to help you identify whether a specific person is talking, if that's what you're trying to do. The frequency of your voice changes constantly. You'd have to make a "fingerprint" of the person's formants, etc.
Simply finding the peak of the FFT isn't going to give you good results for voice. Look into cepstral analysis.
取声谱图上语音范围内的最高峰(例如,400 到 10K Hz)。这应该给你基频。
或者,您可能需要整合频率直方图。这是因为有时您的单词以齿擦音(“s”音)和摩擦音(“f”和“th”音)开头或包含它们,它们具有相当高的频率和宽频谱。您不想错过语音的开头,因为它不是以元音开头的。
另一个因素是除了声音之外你还会听到什么。背景噪音很大吗?什么样的?如果没有,那么只要有声音就足够了。例如,如果有音乐,那么你就会面临一个完全不同的挑战。如果您想区分语音和其他声音,那么我会尝试使用神经网络方法——它可能需要这种程度的复杂性。
Take the highest peak on the spectrogram that's within the range for voice (say, 400 to 10K hz). That should give you the fundamental frequency.
Alternatively, you may need to integrate a histogram of frequencies. This is because sometimes you have words that start with or contain sibilants ("s" sounds) and fricatives ("f" and "th" sounds) which have fairly high frequencies and broad spectrum. You don't want to miss the start of speech because it started with something other than a vowel.
Another factor is what else would you pick up besides voice. Is there a lot of background noise? What kind? If there isn't any, then just the presence of sound is enough. If, for example, there's music, then you have a whole different challenge. If you're trying to distinguish between voice and some other sounds, then I'd be tempted to try a neural network approach--it's likely to need that level of complexity.