语音合成从哪里开始
你们可能熟悉 Google 的 TTS 引擎:此处。
我对类似的东西如何分析输入并挑选出不同的音节/词性有基本的了解,但是如果我想为 TTS 系统创建“语音”,我应该从哪里开始呢?
You guys may be familiar with Google's TTS engine: here.
I have a basic understanding of how something like that is able to analyze the input and pick out different syllables/parts of speech, but where would I start if I wanted to create a "voice" for a TTS system?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在大学里花了近一个学期的时间来学习这个问题的答案,并提前一年(或更长)的课程来学习理解该过程所需的底层信号处理。整个课程都致力于语音合成,整个课程致力于信号处理。
人们可以将人类的声道视为一个过滤器,将声门视为一个脉冲发生器,也就是说,语音实际上是经过声道、口腔和鼻腔过滤的脉冲序列的结果。
对于每个音素,“过滤器”都会不同,因此您将需要一个音素库来为其生成“过滤器”。理论上,逆滤波可用于音素声音剪辑库来查找“滤波器”系数。 Levinson-Durbin 递归通常用于查找 LPC 系数。
必须创建声门脉冲序列。一种简单的方法是将脉冲串与正半正弦波进行卷积。
最后,使用与您想要创建的音素相关的“过滤器”系数过滤声门脉冲序列。
但这仅适用于有声语音。为了生成清音语音,一个简单的解决方案是使用与清音音素相关联的“滤波器”系数来过滤随机噪声信号。
在其之上的一层抽象,创建所需的音素列表,然后连接。简单如馅饼!
更新:
一位朋友指出 Festival,一个用于输入文本和获取语音的“黑匣子”:http://festvox.org /节日/
That's a question that I spent nearly a semester in college learning the answer to, and a year (or more) of classes beforehand to learn the underlying signal processing required to understand the process. Whole classes are devoted to speech synthesis, and whole curriculums to signal processing.
One can think of the human vocal tract as a filter, and the glottis as an impulse generator—that is, speech is actually the result of an impulse train filtered by the vocal tract, mouth, and nasal cavity.
For every phoneme, the "filter" will be different, so you will need a library of phonemes to generate "filters" for. Theoretically, inverse filtering could be used on a library of phoneme sound clips to find "filter" coefficients. The Levinson-Durbin recursion is often used to find LPC coefficients.
A glottal pulse train must be created. A simple way to do this is to convolve a pulse train with a positive half-sine wave.
Finally, filter the glottal pulse train with the "filter" coefficients associated with the phoneme you wish to create.
But that's only for voiced speech. In order to generate unvoiced speech, a simple solution is to filter a random noise signal with "filter" coefficients associated with unvoiced speech phonemes.
One layer of abstraction above that, create a list of phonemes needed, and concatenate. Simple as pie!
UPDATE:
A friend pointed out Festival, a "black box" to input text and get speech out: http://festvox.org/festival/