Microsoft.Speech 通用语言语法
如果我们使用 Windows 7 集成的语音识别功能,我们可以看到,它非常擅长猜测我们口述的内容。它不仅适用于一组有限的命令,而且适用于任何口头语言。
另一方面,当我尝试针对 Microsoft.Speech 命名空间中的类(我安装的是 Microsoft Speech Server Runtime 10.2)进行编程时,我发现自己始终需要定义要使用的有限语法。
有没有办法只获取听写的音频文件并尝试将其解析为文本,而不在 Microsoft.Speech 中指定自定义语法?
If we use an integrated Windows 7 feature of speech recognition, we can see, that it is pretty good at guessing what we have dictated. And it works not only with a limited set of commands, but with any spoken word.
On the other hand, when I try to program against classes in Microsoft.Speech namespace (Microsoft Speech Server Runtime 10.2 is what I have installed), I find myself in need of always defining a limited grammar to use.
Is there a way to just get a dictated audio file and try to parse it to text, without specifying a custom grammar in Microsoft.Speech?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我的理解是桌面操作系统带有听写语法。然而,服务器识别器不包括听写语法,因为它们主要用于电话用途,其中用户向 IVR 系统发出简短命令。对于更多背景知识,这个问题可能会有所帮助 - 在 ASP.NET Web 应用程序中将语音转录为文本的最佳选项是什么?
请记住桌面识别器一次由一个用户使用。他们可以接受培训以提高对每个用户的识别度。服务器识别器旨在同时处理许多用户。服务器识别器无法训练。也许,如果不经过训练,准确的听写语法太难了? (或者,也许微软不想放弃他们所有最好的技术?)
我还读到(但没有检查)桌面识别器支持更高质量的音频(更高的比特率和样本大小)和服务器识别器仅限于电话质量的音频。也许准确的转录需要更高质量的音频。
My understanding is that the desktop operating systems come with a dictation grammar. However, the server recognizers do not include a dictation grammar because they were primarily intended for telephony use where users give short commands to an IVR system. For some more background, this question may be helpful - What is the best option for transcribing speech-to-text in a asp.net web app?
Remember that the desktop recognizers are used by a single user at a time. They can be trained to improve recognition for each user. Server recognizers are designed to handle many users simultaneously. Server recognizers cannot be trained. Perhaps, an accurate dictation grammar is too difficult without training? (Or, perhaps Microsoft doesn't want to give away all of their best technology?)
I've also read (but haven't checked) that the desktop recognizers support higher quality audio (higher bit rate and sample size) and the server recognizers are limited to telephony quality audio. Perhaps accurate transcription requires the higher quality audio.