简单的语音识别方法
是的,我知道语音识别相当复杂(轻描淡写)。我正在寻找一种区分 maybe 20-30 个短语的方法。分割单词的能力(离散语音很好)会很好,但不是必需的。该软件将取决于用户(即供我使用)。我不是在寻找现有的软件,而是在寻找一种自己做这件事的好方法。我研究了各种现有的方法,似乎将声音分成音素虽然很常见,但对于我的需求来说有点过度。
对于某些情况,我只是在寻找一种通过一些简单的语音命令来控制计算机某些方面的方法。我知道 Windows 已经有语音识别软件,但我想自己尝试一下这个软件作为学习练习。命令很简单,例如“打开 Google”或“静音”。我的想法(不确定这是否是个好主意)是某些命令是复合的。所以“静音”就只是“静音”。而“打开”命令可以单独识别,然后有其后缀(Google、Photoshop 等)。被另一个网络/模型/其他东西识别。但我不确定以这种方式查找前缀/断词是否会比处理数量增加的单个命令产生更好的结果。
我一直在研究感知器、hopfield 网络(尽管根据我的理解,它们有些过时)和 HMM,虽然我理解这些背后的想法(我之前已经实现了 ANN),但我真的不知道哪个是最适合这项任务。我假设线性矢量量化模型也是合适的,但我实际上找不到太多这方面的文献。任何指导/资源将不胜感激。
Yes, I'm aware that speech recognition is fairly complicated (as an understatement). What I'm looking for is a method for distinguishing between maybe 20-30 phrases. An ability to split words (discrete speech is fine) would be nice, but isn't required. The software will be user-dependent(i.e. for use by me). I'm not looking for existing software, but for a good way of going about doing this myself. I've looked into various existing methods and it seems like splitting the sound into phonemes, while common, is somewhat excessive for my needs.
For some context, I'm just looking for a way to control some aspects of my computer with a few simple voice commands. I'm aware that Windows already has speech recognition software, but I'd like to go about this one myself as a learning exercise. Commands would be simple like "Open Google", or "Mute". What I had in mind (not sure if this is a good idea) is that some commands would be compound. So "Mute" would just be "Mute". Whereas the "Open" command could be recognized individually, and then have its suffixes (Google, Photoshop, etc). recognized with another network/model/whatever. But I'm not sure if looking for prefixes/word breaks in this way would produce better results than having to deal with an increased number of individual commands.
I've been looking into perceptrons, hopfield networks (though they're somewhat obsolete from what I understand) and HMMs, and while I understand the ideas behind these (I've implemented the ANNs before) I don't really know which is best suited to this task. I'm assuming that linear vector quantization models would also be appropriate, but I can't really find much literature to this end. Any guidance/resources would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
语音识别方面有一些开源项目:
都有解码器、训练、语言模型工具包。构建一个完整且强大的语音识别器的一切。
Voxforge 拥有适用于两个开源语音识别工具包的声学和语言模型。
There are some open source project in speech recognition:
Both have decoder, training, language model toolkits. Eveything to build a complete and robust speech recognizer.
Voxforge has acoustic and language models for both open source speech recognition toolkits.
前段时间,我读了一篇关于有限词汇系统的白皮书,该系统使用了一个简单的识别过程。系统将每个话语划分为少量的 bin(时间上有 6 个,幅度上有 4 个,如果我没记错的话,总共 24 个),它所做的只是计算每个 bin 中样本音频测量的数量。有一个模糊逻辑规则库,然后解释每个话语 24 个 bin 计数,并生成一个解释。
我想(对于某些应用程序)一个简单的匹配过程可能也同样有效,其中当前话语的 24 个 bin 计数与每个存储的原型的计数进行简单匹配,总体差异最小的一个是优胜者。
Some time ago, I read a whitepaper about a limited vocabulary system, which used a simple recognition process. The system divided each utterance into a small number of bins (6 in time, and 4 in magnitude, if I remember correctly, for 24 total), and all it did was count the number of sample audio measurements in each bin. There was a fuzzy logic rule base which then interpreted each utterances 24 bin counts, and generated an interpretation.
I imagine that (for some applications) a simple matching process might work just as well, in which the 24 bin counts of the current utterance are simple matched against those of each of your stored prototypes, and the one with the least overall difference is the winner.