如何在 iOS 上比较两个语音样本?
首先,我想声明,我的问题并不是关于语音识别的“经典”定义。
我们试图做的事情有些不同,从某种意义上说:
- 用户记录他的命令
- 稍后,当用户说出预先记录的命令时,就会发生某个动作。
例如,我录制了一个呼叫我妈妈的语音命令,所以我点击她并说“妈妈”。 然后当我使用该程序并说“妈妈”时,它会自动呼叫她。
如何将口头命令与保存的语音样本进行比较?
编辑: 我们不需要任何“文本到语音”的能力,只需比较声音信号即可。 显然,我们正在寻找某种现成的产品或框架。
First of all I'd like to state that my question is not per say about the "classic" definition of voice recognition.
What we are trying to do is somewhat different, in the sense of:
- User records his command
- Later, when the user will speak pre-recorded command, a certain action will occur.
For example, I record a voice command for calling my mom, so I click on her and say "Mom".
Then when I use the program and say "Mom", it will automatically call her.
How would I perform the comparison of a spoken command to a saved voice sample?
EDIT:
We have no need for any "text-to-speech" abilities, solely a comparison of sound signals.
Obviously we're looking for some sort of a off-the-shelf product or framework.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
音乐识别的一种方法是获取所讨论的两种声音的频谱时间序列(时间窗 STFT FFT),在时间轴上映射频率峰值的位置,并对两个 2D 时间进行互相关-匹配的频率峰值映射。这比仅仅互相关 2 个声音样本要稳健得多,因为峰值的变化远小于频谱峰值之间的所有频谱“干扰”。如果两个话语的语速和音调没有发生太大变化,则此方法效果会更好。
在 iOS 4.x 中,您可以使用 Accelerate 框架进行 FFT,也可以使用 2D 互相关。
One way this is done for music recognition is to take a time sequence of frequency spectrums (time windowed STFT FFTs) for the two sounds in question, map the locations of the frequency peaks over the time axis, and cross-correlate the two 2D time-frequency peak mappings for a match. This is far more robust than just cross-correlating the 2 sound samples, as the peaks change far less than all the spectral "cruft" between the spectral peaks. This method will work better if the rate of the two utterances and their pitch haven't changed too much.
In iOS 4.x, you can use the Accelerate framework for the FFTs and maybe the 2D cross correlations as well.
尝试使用第三方库,例如适用于 iOS 应用程序的 OpenEars。您可以让用户录制语音样本并将其另存为翻译文本,或者只是让他们输入文本进行识别。
Try using a third-party library, like OpenEars for iOS applications. You could have users record a voice sample and save it as translated text, or just let them enter text for recognition.
我认为你必须执行某种交叉相关来确定这两者的相似程度信号是。 (当然假设说话的是同一个用户)。我只是把这个答案写出来看看是否有帮助,但我会等待其他人更好的答案。我的信号处理能力接近于零。
I think you'd have to perform some sort of cross correlation to determine how similar these two signals are. (Assuming it'll be the same user that is speaking ofcourse). I'm just typing this answer out to see if it helps, but I'd wait for a better answer from someone else though. My signal processing skills are close to zero.
我不确定您的问题是关于 DSP 还是如何在 iPhone 上实现它。如果是后者,我会从 Apple 提供的 Speak Here 项目开始。这样您就已经拥有将语音录制到完成文件的界面了。这将为您省去很多麻烦。
I'm not sure if your question is about the DSP or how to do it on the iPhone. If it is the latter I would start with the Speak Here project that Apple provides. That way you already have the interface to record the voice to a file done. It will save you a lot of trouble.
我正在使用 Visqol 来实现此目的。文档说它最适合短样本,最好是 5-10 秒。您还需要根据采样率准备文件,并且它们必须是 .wav 文件。您可以使用 ffmpeg 库轻松将文件转换为所需的格式。
https://github.com/google/visqol
I'm using Visqol for this purpose. The docs say it works best with a short sample, ideally 5-10 sec.You also need to prepare the files in terms of sample rate and they need to be .wav files. You can easily convert your files to the desired format with ffmpeg library.
https://github.com/google/visqol