如何设置语音识别服务器?

发布于 2024-12-27 17:28:09 字数 511 浏览 2 评论 0原文

如何在服务器端实现语音识别(请不要推荐HTML5的x-webkit-speech、javascript等)?该程序将以音频文件作为输入,并以足够的精度提供音频文件的文本转录。我可以使用哪些选项?

我尝试过使用 Voxforge 模型实现 Sphin4 但准确性很差(它们也可能是一些我的配置有问题,我仍在努力学习)。在一篇文章中,我读到,当我们使用 时,输入会发送到 外部服务器,该服务器识别并将数据发送回浏览器。

如何设置该服务器?任何现有的开源服务器如果能够以最小的错误率识别英语句子,也会很有用。

How to implement Speech recognition at server side (please don't suggest HTML5's x-webkit-speech, javascript etc) ? The program will take an audio file as input and with sufficient accuracy provides the text transcription of audio file. What are the options I can use ?

I have tried implementing Sphin4 with Voxforge model but the accuracy is so poor (their may be also some problem in my configuration, I am still trying to learn it). In one post I read that when we use <input name="speech" id="speech" type="text" x-webkit-speech /> the input is sent to an external server and that server than does the recognition and sends the data back to the browser.

How can I setup that server ? Any existing open Source server would be also useful if it can recognize English sentences with minimal error rate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

归属感 2025-01-03 17:28:09

您正在实施什么类型的应用程序?该应用程序的目的是将用户的口头输入转录为文本,还是只是为了理解简单的命令?像 Sphinx4 这样的系统使用统计模型来转录语音。使用这些类型的系统,您将无法获得与使用自动语音识别 (ASR) 系统一样好的识别效果,该系统使用语法来限制 ASR 的搜索空间,以获得更好的识别效果。使用统计模型的系统需要大量的调整和试运行才能获得良好的认可。

Sphinx4 是我所知道的唯一开源 ASR。有许多商业产品/服务,其中 Nuance 是市场上最大的。当识别率较低时,一些商业产品可以选择让人工转录消息。

Google 有一个非官方 API,它在内部用于 Google Voice 等服务,我相信它与您引用的 webkit 使用的 API 相同。 Google Voice 会将语音邮件消息转录并通过电子邮件将文本发送给您。谷歌语音被认为是最先进的转录技术,但如果您有语音帐户,您会发现转录的消息并不是那么好。以下是有关使用非官方 Google Speech 的博客文章的链接API

What type of application are you implementing? Is the purpose of the application to transcribe user spoken input into text or is it meant to just understand simple commands? Systems like Sphinx4 use a statistical model for transcription of speech. You will not get as good recognition with these types of systems as you would with an automated speech recognition (ASR) system that uses grammars to restrict the search space for the ASR to get better recognition. Systems that use statistical models require a lot of tuning and trial runs to get decent recognition.

Sphinx4 is the only opens source ASR that I am aware of. There are a number of commercial products/services with Nuance being the biggest in the market. Some of the commercial offerings have the option to include humans to transcribe the message when recognition rates are low.

Google has an unofficial API that it uses internally for services like Google Voice and I believe it is the same one used by the webkit you reference. Google Voice will take voice mail messages transcribe them and email the text to you. Google Voice is considered state of the art for transcription, but if you have a Voice account you will see that the transcribed messages are not that great. Here is a link to a blog article on using the unofficial Google Speech API.

谁把谁当真 2025-01-03 17:28:09

你有一些问题:
1. 如何在客户端中捕获音频。
2. 如何将这些音频传输到服务器。
3.如何进行认可。
4. 如何传回识别度和置信度分数。
5. 您将如何处理这些认可度和置信度分数(您的申请)。

对于第一种情况,您可以使用谷歌的方法,让某人点击麦克风图标,录制一段时间的声音。或者,iPhone Siri,其中使用 VAD 来录制音频。

其次,这是一个基本的 TCP/IP 文件传输问题。也可以使用苹果/
Google 使用 Flac 或 Speex 来压缩音频文件。

第三,这是真正困难的部分。您需要比从 Voxforge 获得的声学模型更好的模型。对于像 Siri 这样的上下文无关的连续语音识别来说尤其如此。对于命令来说,Voxforge 很好。

第四,这是另一个文件传输问题。

第五,这是你的申请。

难点是语音识别部分。也许另一个问题是如何为数千个用户扩展它。
您可以使用 Julius 语音识别作为语音客户端来捕获音频。我们可以私下进一步讨论这个问题。

You have some problems:
1. How to capture audio in a client.
2. How to transfer these audio for a server.
3. How to make recognition.
4. How to transfer back the recognition and confidence score.
5. What are you going to do with these recognition and confidence score (your application).

For the first case, you can use Google approach that someone click in a microphone icon, record the voice for some times. Or, iPhone Siri, where a VAD is used to record audio.

Second, it is basic a TCP/IP file transfer problem. It is also possible to use Apple /
Google approach and compress audio file using Flac or Speex.

Third, this is the really hard part. You need much better acoustic models that ones that you can get from Voxforge. This is special true for a continuous speech recognition, context free like Siri. For commands, Voxforge is fine.

Forth, it is another file transfer problem.

Fifth, it is your application.

The hard part is speech recognition part. Perhaps other problem is how to scale this for thousands of users.
You can use Julius speech recognition as a speech client to capture audio. We can chat more about this problem privately.

巨坚强 2025-01-03 17:28:09

在 Chrome 中,该服务器是专有的 Google 服务器。您无法设置自己的版本。人们对服务器的调用进行了逆向工程,请参阅 http: //mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/ 为例,但这对于生产或商业应用程序来说不是一个好主意,因为 Google 可能会随时更改 API 或限制其访问。

这是另一个问题的旧答案,但可能会有所帮助 - https://stackoverflow.com/a/6351055/90236< /a>

In Chrome, that server is a proprietary Google Server. You can't set up you own version. People have reverse engineered the calls to the server, see http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/ for an example, but this is not a good idea for a production or commercial application since Google may change the API or limit its access at any time.

Here is an old answer to a different question, but it may be helpful - https://stackoverflow.com/a/6351055/90236

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文