“语音触发” 检测
我有一个语音应用程序,如果能够使用“触发词”开始录制音频,那么该应用程序将会得到很大的改进。 我不需要完整的语音文本引擎,只需要可靠/高效地检测触发词的能力。
我想知道是否有任何专门的语音引擎支持这个特定的用例,或者有任何库/方法来开发这样的单一用途检测引擎。 理想情况下,我希望它能够在嘈杂的环境中工作,但它可以针对单个用户的声音进行训练。
研究论文/主题的指针也将不胜感激,这样我就知道要问什么。
I have a voice application that would be much-improved if there was the ability to use a "trigger word" to start recording audio. I don't need a full speech-text engine, just the ability to reliably/efficiently detect the trigger word.
I am wondering if there are any specialized speech engines that support this specific use case, or any libraries/methods to developing such a single-purpose detection engine. Ideally I'd like it to work in noisy environments, but it can be trained for a single user's voice.
Pointers to research papers / topics would also be appreciated so I know what to ask for.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我的 Red5 项目同事创建了一个类似的演示,使用触发词来对图像存储库运行搜索。 说“猫”会在大约一秒钟内出现一只猫的图像。 客户端应用程序是用 Flash 编写的,后端使用免费的 Sphinx 库在 Red5 上运行。 您当然可以毫不费力地使用 Sphinx 做您想做的事情。
Sphinx 项目:http://cmusphinx.sourceforge.net/sphinx4/
A colleague of mine on the Red5 project created a similar demo using trigger words to cause a search to be run against an image repository. Saying "cat" caused an image of a cat to appear within about a second. The client application was written in Flash and the back-end ran on Red5 using the free Sphinx library. You could certainly do what you want with Sphinx without much effort.
Sphinx project: http://cmusphinx.sourceforge.net/sphinx4/
好吧,我可能完全不在话下,但使用功能齐全的语音识别库对于您的用例来说可能有点过分了。
如果您可以接受更简单但仍然由音频驱动的东西,请考虑这一点:
检测拍手非常简单。 拍手将在整个音频带上具有高能量。 检测它比完整的语音识别简单且计算成本低得多。
简而言之,您录制音频,对数据进行(短时间)FFT,并检测 80% 的可用频率仓中具有高能量的情况。 80% 的人会因为简单的录音室/麦克风设置而解决任何相位问题。 然后根据口味调整阈值,就完成了。
对语音识别执行同样的操作也是可能的,但您将消耗大量的 CPU 周期。
Okay, I could be completely off, but using a full featured speech-recognition library may be overkill for your use-case..
If you can live with something simpler but still audio driven consider this:
Detecting a hand-clap is very simple. A hand-clap will have high energy over the overall audio band. Detecting it is simple and much cheaper computational wise than full-bown speech recoginition.
In a nutshell you record the audio, do a (short time) FFT on the data and detect the case where you have high energy in 80% of the available frequency bins. 80% takes care of any phasing issues due to a simple recording-room/microphone setting. Then adjust the thresold to taste and you're done.
Doing the same with speech-recognition is possible as well, but you will burn tons of CPU cycles.
什么操作系统? 例如,我想知道 Windows Vista 中的语音功能是否会对您有所帮助。 对于任何语音分析器来说,识别单个单词似乎是最简单的问题。
What O/S? I wonder for example whether Speech functionality in Windows Vista would help you. Recognising a single word seems like the simplest possible problem for any speech analyzer.
有人问一个问题只是一个几天前关于 Linux 上语音识别的可能性。 您要求的是其中的一个子集,我认为其中一些答案可能包含有用的信息。 joeforker 的回答中链接的文章非常有趣。
There were asked a question just a few days ago about speech recognition possibilities on linux. What you ask for is a subset of that, I assume some of those answers could contain useful information. The article linked in joeforker's answer was very interesting.
我有一个录音 win32 应用程序。 我使用 OCX 来管理录制/播放。
我知道这并不完全是您所要求的解决方案,但您可能需要考虑脚踏板。 它的编程很简单,并且非常像用口语来开始/停止录音。 检查这些:www.pedalpower.com
希望它有帮助,
雷纳尔多。
I have a voice recording win32 app. I use an OCX to manage recording/playback.
I know it is not exactly the solution you are asking, but you might want to consider a foot pedal. It is simple to program and would serve very much like a spoken word to begin/stop recording. Check these: www.pedalpower.com
Hope it helps,
Reinaldo.