“语音触发” 检测

发布于 2024-07-21 20:50:11 字数 231 浏览 13 评论 0原文

我有一个语音应用程序,如果能够使用“触发词”开始录制音频,那么该应用程序将会得到很大的改进。 我不需要完整的语音文本引擎,只需要可靠/高效地检测触发词的能力。

我想知道是否有任何专门的语音引擎支持这个特定的用例,或者有任何库/方法来开发这样的单一用途检测引擎。 理想情况下,我希望它能够在嘈杂的环境中工作,但它可以针对单个用户的声音进行训练。

研究论文/主题的指针也将不胜感激,这样我就知道要问什么。

I have a voice application that would be much-improved if there was the ability to use a "trigger word" to start recording audio. I don't need a full speech-text engine, just the ability to reliably/efficiently detect the trigger word.

I am wondering if there are any specialized speech engines that support this specific use case, or any libraries/methods to developing such a single-purpose detection engine. Ideally I'd like it to work in noisy environments, but it can be trained for a single user's voice.

Pointers to research papers / topics would also be appreciated so I know what to ask for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

入怼 2024-07-28 20:50:11

我的 Red5 项目同事创建了一个类似的演示,使用触发词来对图像存储库运行搜索。 说“猫”会在大约一秒钟内出现一只猫的图像。 客户端应用程序是用 Flash 编写的,后端使用免费的 Sphinx 库在 Red5 上运行。 您当然可以毫不费力地使用 Sphinx 做您想做的事情。

Sphinx 项目:http://cmusphinx.sourceforge.net/sphinx4/

A colleague of mine on the Red5 project created a similar demo using trigger words to cause a search to be run against an image repository. Saying "cat" caused an image of a cat to appear within about a second. The client application was written in Flash and the back-end ran on Red5 using the free Sphinx library. You could certainly do what you want with Sphinx without much effort.

Sphinx project: http://cmusphinx.sourceforge.net/sphinx4/

不必在意 2024-07-28 20:50:11

好吧,我可能完全不在话下,但使用功能齐全的语音识别库对于您的用例来说可能有点过分了。

如果您可以接受更简单但仍然由音频驱动的东西,请考虑这一点:

检测拍手非常简单。 拍手将在整个音频带上具有高能量。 检测它比完整的语音识别简单且计算成本低得多。

简而言之,您录制音频,对数据进行(短时间)FFT,并检测 80% 的可用频率仓中具有高能量的情况。 80% 的人会因为简单的录音室/麦克风设置而解决任何相位问题。 然后根据口味调整阈值,就完成了。

对语音识别执行同样的操作也是可能的,但您将消耗大量的 CPU 周期。

Okay, I could be completely off, but using a full featured speech-recognition library may be overkill for your use-case..

If you can live with something simpler but still audio driven consider this:

Detecting a hand-clap is very simple. A hand-clap will have high energy over the overall audio band. Detecting it is simple and much cheaper computational wise than full-bown speech recoginition.

In a nutshell you record the audio, do a (short time) FFT on the data and detect the case where you have high energy in 80% of the available frequency bins. 80% takes care of any phasing issues due to a simple recording-room/microphone setting. Then adjust the thresold to taste and you're done.

Doing the same with speech-recognition is possible as well, but you will burn tons of CPU cycles.

黒涩兲箜 2024-07-28 20:50:11

什么操作系统? 例如,我想知道 Windows Vista 中的语音功能是否会对您有所帮助。 对于任何语音分析器来说,识别单个单词似乎是最简单的问题。

What O/S? I wonder for example whether Speech functionality in Windows Vista would help you. Recognising a single word seems like the simplest possible problem for any speech analyzer.

几度春秋 2024-07-28 20:50:11

有人问一个问题只是一个几天前关于 Linux 上语音识别的可能性。 您要求的是其中的一个子集,我认为其中一些答案可能包含有用的信息。 joeforker 的回答中链接的文章非常有趣。

There were asked a question just a few days ago about speech recognition possibilities on linux. What you ask for is a subset of that, I assume some of those answers could contain useful information. The article linked in joeforker's answer was very interesting.

把昨日还给我 2024-07-28 20:50:11

我有一个录音 win32 应用程序。 我使用 OCX 来管理录制/播放。

我知道这并不完全是您所要求的解决方案,但您可能需要考虑脚踏板。 它的编程很简单,并且非常像用口语来开始/停止录音。 检查这些:www.pedalpower.com

希望它有帮助,

雷纳尔多。

I have a voice recording win32 app. I use an OCX to manage recording/playback.

I know it is not exactly the solution you are asking, but you might want to consider a foot pedal. It is simple to program and would serve very much like a spoken word to begin/stop recording. Check these: www.pedalpower.com

Hope it helps,

Reinaldo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文