什么是“时机”的正确方法? SAPI TTS 中的音素(c#)? (SpVoice.Phoneme()->streamPosition)
我的应用程序中存在下一个“问题”,我编写应用程序,其中有人会编写文本,SAPI TTS 将其翻译为语音,接下来我将处理输出 WAV。 我需要的是有关音素的信息(输出 WAV 中的某个音素、语音说出的时间等)。 好的,我使用了 SpVoice.Phoneme() 并添加了音素处理程序。好的,现在我可以获得持续时间等..但是在 SpVoice.Phoneme() 中是属性 StreamPosition 但我不知道这意味着什么..
来自 MSDN:
流位置
输出流中音素开始的字符位置。
我不明白它们是否意味着输出 WAV 中的“字节”位置(哪个字节是音素)..或输出 WAV 中的毫秒时间..或者这意味着什么?
例如,对于文本:
这已经很高了。这是低的。这很快。这很慢。
我得到 StreamPositions 值:
位置:0
位置:120
位置:2562
....
当前位置:143798
当前位置:147874
当前位置:151950
输出的 WAV 文件有 5.377098 秒,最后一个音素“ow”大约在 4.734 秒内被告知。 输出的 WAV 文件有 237 568 字节。因此,属性 StreamPosition“147874”的值可能不是音素开始的字节。 “计时”也是如此(以毫秒为单位,因为 WAV 有 5.3 秒,但 151950 毫秒是 151,950 秒..所以这是关闭的..)。
那么 StreamPosition 是什么?(StreamPosition 中的值意味着什么?)
我真的需要准确捕获音素开始的时间。我用 DateTime.Now.Ticks/10000 尝试过。当用户单击开始翻译 TTS 的按钮时,我会保存此日期时间值,当某些处理程序捕获某些音素时,我会再次捕获该值。然后我将通过 currTime-startTime 获取该值。但这个“方法”并不那么精确。总有一些分歧。 SpVoice.Phoneme() 是否有一些“方法”或其他方法来获取有关音素开始时间的准确信息? 如果没有,是否有更好的方法来获得更精确的时间(以毫秒为单位)?
对不起我的英语,真的感谢所有的答案和建议。
i have next "problem" in my application, i write app where someone will write text, SAPI TTS translate it in speech and next i will work with the output WAV.
What i need are information about phonemes (where in the output WAV is some phoneme, how long voice say it, etc)..
ok, i used SpVoice.Phoneme() and i added handler for phonemes. Ok, now i can get duration etc..but in SpVoice.Phoneme() is attribute StreamPosition but i have not idea what that means..
from MSDN:
StreamPosition
The character position in the output stream at which the phoneme begins.
I dont understand if they mean "byte" position in output WAV (on WHICH byte is the phoneme)..or millisecond time in output WAV..or what that mean??
For example, for text:
This is high. This is low. This is fast. This is slow.
I get the StreamPositions values:
Position:0
Position:120
Position:2562
....
Position:143798
Position:147874
Position:151950
The output WAV file have 5.377098seconds and last phoneme "ow" is told circa in 4.734s.
The output WAV file have 237 568bytes.. So the value of attribute StreamPosition "147874" is probably not the byte on which begin the phoneme. The same for "timing" (in ms because WAV have 5.3s but 151950ms is 151,950s..so this is closed..).
So what is the StreamPosition? (what means the value in StreamPosition?)
I really need catch exactly time when the phoneme begin. I tried it with DateTime.Now.Ticks/10000. When user click on button for start translating TTS i save this datetime value and when some handler catch some phoneme i catch the value again. And then i will get the value with currTime-startTime. But this "method" is not so exact. There are always some divergency. Have SpVoice.Phoneme() some "method" or something to get exactly information about the time when phoneme began?
If not, is there some better way to get exactlier time in ms?
sry for my english and really thanks for all answers and advices..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,我自己回答。我的学士教授给我发了一些他写的 C++ 代码。我最近 2 天读过它,现在我发现我是多么愚蠢。
所以我会回答..
属性StreamPosition实际上是输出流(可能是WAV)中的“咬”位置。
如果你想知道输出流中的毫秒位置,你需要编写如下内容:
因此您需要找到有关输出流的信息,如bitsPerSample、SamplesPerSec,然后您将获得毫秒计时。
ok, i will answer myself.. My bachelors profesor sended me some code in C++ what he wrote.. I readed it last 2days and now i see how stupid I am.
so i will answer..
attribute StreamPosition is really "bites" position in the output stream (probably WAV).
If you want to know millisecond position in the output stream, you need write something like:
so you need find information about the outputStream like bitsPerSample, SamplesPerSec and you will get the milliseconds timing.
1)我不知道你如何将输出保存到wav文件,但文件大小
237 568bytes 比正常情况大(如果采样率为 16khz),因为 5.377098 秒的 wav 文件的文件大小
为 5.377098*16000*2 = 172067 字节 + 标头(44 字节),
所以,我认为你的wav 文件也包含音素事件。
2)TTS 需要时间来生成输出,所以你不能以这种方式计时,我建议你:
2.1) 记录音素事件,就像你可能已经在 1
C:\Program Files\Microsoft SDKs\Windows\v7.1\ 中所做的那样Samples\winui\speech\ttsapplication
2.2) 由其他事件(如流启动)计时 <= 我不太确定确切的名称。
在 Windows SDK 中:
但代码不是 C# 语言
1) I am not sure how you save the output to wav file,but the file size
237 568bytes is larger than normal(if sampling rate is 16khz), as file size for a 5.377098seconds wav file
is 5.377098*16000*2 = 172067 bytes + header(44 bytes)
so, I think your wav file contains phoneme event as well.
2)TTS take time to generate output so you can't timing in that way, I suggest you:
2.1)record the phoneme event as you may already done in 1
C:\Program Files\Microsoft SDKs\Windows\v7.1\Samples\winui\speech\ttsapplication
2.2)Timing by other event like stream start <= I am not so sure about the exactly name.
in Windows SDK:
But the code is not in C#