使用 SAPI 5.3 Speech API 进行声学训练

发布于 2024-07-08 10:06:02 字数 516 浏览 10 评论 0原文

在 Vista 上使用 Microsoft 的 SAPI 5.3 Speech API,如何以编程方式进行 RecoProfile 的声学模型训练? 更具体地说,如果您有一个文本文件和一个用户说出该文本的音频文件,您将执行什么顺序的 SAPI 调用来使用该文本和音频来训练用户的配置文件?

更新:

有关此问题的更多信息我仍未解决: 您在“开始”处调用 ISpRecognizer2.SetTrainingState( TRUE, TRUE ) ,在“结束”处调用 ISpRecognizer2.SetTrainingState( FALSE, TRUE ) 。 但目前尚不清楚这些行动相对于其他行动何时必须发生。

例如,您必须进行各种调用来设置与您的音频相匹配的文本的语法,以及进行其他调用来连接音频,以及对各种对象进行其他调用以表示“您现在可以开始了”。 但相互依存关系是什么?什么必须先发生,然后再发生什么? 如果您使用音频文件而不是系统麦克风进行输入,这是否会使相对时间变得不那么宽容,因为识别器不会一直坐在那里听,直到扬声器正确为止?

Using Microsoft's SAPI 5.3 Speech API on Vista, how do you programatically do acoustic model training of a RecoProfile? More concretely, if you have a text file, and an audio file of a user speaking that text, what sequence of SAPI calls would you make to train the user's profile using that text and audio?

Update:

More information about this problem I still haven't solved:
You call ISpRecognizer2.SetTrainingState( TRUE, TRUE ) at "the beginning" and ISpRecognizer2.SetTrainingState( FALSE, TRUE ) at "the end." But it is still unclear just when those actions have to happen relative to other actions.

For example, you have to make various calls to set up a grammar with the text that matches your audio, and other calls to hook up the audio, and other calls to various objects to say "you're good to go now." But what are the interdependencies -- what has to happen before what else? And if you're using an audio file instead of the system microphone for input, does that make the relative timing less forgiving, because the recognizer isn't going to keep sitting there listening until the speaker gets it right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

绳情 2024-07-15 10:06:02

实施 SAPI 培训相对困难,并且文档并没有真正告诉您需要了解的内容。

ISpRecognizer2::SetTrainingState 将识别器切换为或退出的训练模式。

当您进入训练模式时,真正发生的只是识别器为用户提供了更多的识别余地。 因此,如果您尝试识别一个短语,引擎对识别的严格程度会降低很多。

在您离开训练模式并且设置了 fAdaptFromTrainingData 标志之前,引擎不会真正执行任何调整。

当引擎适应时,它会扫描存储在配置文件数据下的训练音频。 训练代码负责将新的音频文件放置在引擎可以找到的地方进行调整。

这些文件还必须被标记,以便引擎知道所说的内容。

那么你该怎么做呢? 您需要使用三个鲜为人知的 SAPI API。 特别是,您需要使用 ISpRecognizer: 获取配置文件令牌: GetObjectTokenSpObjectToken::GetStorageFileName 正确定位该文件。

最后还需要使用ISpTranscript来生成正确标记的音频文件。

为了将它们放在一起,您需要执行以下操作(伪代码):

创建一个 inproc 识别器 & 绑定适当的音频输入。

确保保留用于识别的音频; 稍后你会需要它。

创建包含要训练的文本的语法。

设置语法状态以在识别发生时暂停识别器。 (这也有助于从音频文件进行训练。)

发生识别时:

获取识别的文本和保留的音频。

使用 CoCreateInstance(CLSID_SpStream) 创建流对象。

使用 ISpRecognizer::GetObjectToken 创建训练音频文件,和 ISpObjectToken::GetStorageFileName ,并绑定它到流(使用 ISpStream::BindToFile) 。

将保留的音频复制到流对象中。

QI ISpTranscript 接口的流对象,以及使用 ISpTranscript::AppendTranscript 添加识别的文本到溪流。

更新下一个话语的语法,恢复识别器,然后重复,直到训练完文本为止。

Implementing SAPI training is relatively hard, and the documentation doesn’t really tell you what you need to know.

ISpRecognizer2::SetTrainingState switches the recognizer into or out of training mode.

When you go into training mode, all that really happens is that the recognizer gives the user a lot more leeway about recognitions. So if you’re trying to recognize a phrase, the engine will be a lot less strict about the recognition.

The engine doesn’t really do any adaptation until you leave training mode, and you have set the fAdaptFromTrainingData flag.

When the engine adapts, it scans the training audio stored under the profile data. It’s the training code’s responsibility to put new audio files where the engine can find it for adaptation.

These files also have to be labeled, so that the engine knows what was said.

So how do you do this? You need to use three lesser-known SAPI APIs. In particular, you need to get the profile token using ISpRecognizer::GetObjectToken, and SpObjectToken::GetStorageFileName to properly locate the file.

Finally, you also need to use ISpTranscript to generate properly labeled audio files.

To put it all together, you need to do the following (pseudo-code):

Create an inproc recognizer & bind the appropriate audio input.

Ensure that you’re retaining the audio for your recognitions; you’ll need it later.

Create a grammar containing the text to train.

Set the grammar’s state to pause the recognizer when a recognition occurs. (This helps with training from an audio file, as well.)

When a recognition occurs:

Get the recognized text and the retained audio.

Create a stream object using CoCreateInstance(CLSID_SpStream).

Create a training audio file using ISpRecognizer::GetObjectToken, and ISpObjectToken::GetStorageFileName , and bind it to the stream (using ISpStream::BindToFile).

Copy the retained audio into the stream object.

QI the stream object for the ISpTranscript interface, and use ISpTranscript::AppendTranscript to add the recognized text to the stream.

Update the grammar for the next utterance, resume the recognizer, and repeat until you’re out of training text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文