Creating a new voice for a text-to-speech engine is a complex process. It is not just a matter of getting a voice artist to record audio and simply creating a voice from that. There is a lot of work that goes into this (segmenting the audio into phonemes; building the voice data; building the dictionary; getting the prosody and audio joining/synthesizing rules correct).
For a voice engine like the Microsoft Text-to-Speech engine, you are also facing the problem that the voice format is proprietary and so you cannot create new voices in that format. You are also limited by the capabilities of the engine.
Your best bet at the moment is either:
switching to using the eSpeak text-to-speech engine and using espeakedit to create your own voice (contacting the developer for help with this) -- this engine uses a synthesis method that makes it sound similar to the Microsoft's and the voice Stephen Hawking is using, but they are very clear and the pronunciation is on the whole very good;
using a different text-to-speech engine like Cepstral that use voice recordings (these tend to sound more human-like, but I have found that the prosody is not very good, ruining the resulting audio);
using the service from Cepstral to create a voice specific for your needs (which is likely to be expensive).
I am looking at using the audio data from librivox.org to generate text-to-speech voices from. This is likely 3-4 years away though, before I have anything close to being functional.
发布评论
评论(1)
为文本转语音引擎创建新的声音是一个复杂的过程。这不仅仅是让配音艺术家录制音频并简单地从中创建声音的问题。这方面需要做很多工作(将音频分割为音素;构建语音数据;构建词典;正确设置韵律和音频连接/合成规则)。
对于像 Microsoft 文本转语音引擎这样的语音引擎,您还面临着语音格式是专有的问题,因此您无法以该格式创建新的语音。您还受到引擎功能的限制。
目前您最好的选择是:
我正在考虑使用 librivox.org 中的音频数据来生成文本到语音的声音。不过,这可能还需要 3 到 4 年的时间,我才能真正发挥作用。
Creating a new voice for a text-to-speech engine is a complex process. It is not just a matter of getting a voice artist to record audio and simply creating a voice from that. There is a lot of work that goes into this (segmenting the audio into phonemes; building the voice data; building the dictionary; getting the prosody and audio joining/synthesizing rules correct).
For a voice engine like the Microsoft Text-to-Speech engine, you are also facing the problem that the voice format is proprietary and so you cannot create new voices in that format. You are also limited by the capabilities of the engine.
Your best bet at the moment is either:
I am looking at using the audio data from librivox.org to generate text-to-speech voices from. This is likely 3-4 years away though, before I have anything close to being functional.