高品质、情感化、流畅且可变的文本转语音引擎?
在查看了一些服务/工具之后,我得出了一个结论。大多数文本转语音工具都过于技术化、机械化——换句话说,语音质量很差。
是的,最重要的是,它们似乎带有“硬编码”语音模板,因此缩短了多样性/定制化。有些工具允许您设置阅读速度和音调,但这还不够。
我对情感方面背后的问题的猜测是——从纯文本中很难判断情感,如果只是一两句话就更难了。另外,好的电脑是一台机器——机器没有情感,但那是另一回事了。
最让我烦恼的是质量。例如,有一些工具可以用来切断单词的顶点,从而产生这些技术性的声音。感觉句子结构有问题什么的。是的,当人们正在开发这些工具时,我想知道是什么阻止他们更多地工作来改进这些……切断顶点,这不是一件小事!另外,必须记住,一个好的、高质量的文本转语音软件是值得的,嗯……很多!因此产生了相当有利可图的产品。
哦,在流利程度下我隐藏了问题、感叹词等等。 (可能这些不适用于流利度,但我不是英语母语,如果是这样,请原谅我。)
我研究过的工具列表:
相当令人印象深刻,但仍有改进的空间(++)
- Loquendo:缺乏声音多样性,也有一些小的顶点/流畅性问题(取决于句子)例子中有很多咳嗽和借口!
- Nuance Vocalizer:虽然仍然缺乏多样性,但提供的一些声音是值得的。
也可以合作以获得更多资源,然后开发不同但几乎相同的产品 (--)
- eSpeak:最好的机器人之一,因此有了程序徽标(?!)
- Natural Reader(愚蠢的自动播放!!):嗯,它有些流畅,但是仍然有那种科技感。
- iSpeech :将语音设置为带有英文文本的日语时,笑得很开心。我敢打赌日本人对此不太高兴。
- 倒谱 + 增强的声音 ...加上增强的声音给出了很好的蹩脚结果,所以,除了大约 5 个声音之外,没有任何增强。
- AT&T:流畅度不错,但有问题有句子结尾和太多机器人!
- LumenVox TTS :看起来像是来自拥有大量语音工具的背景,但是仍然会产生机器人声音。
- 还有更多...
如果我错过了一些值得一看的内容,请分享。 可以是免费的、商业的、超级昂贵的......只要它有效,我就感兴趣!
还有问题(-s) ..
- 您认为这些声音的质量、流畅度和多样性背后的主要问题是什么? 由于情感方面很难判断,我不介意您跳过它,但如果你有一两个想法,我不介意你分享你的想法
- 文本如何转换为语音?比如,什么这些工具背后使用了算法吗?也许一两个新的理论会派上用场。
- 这些实际上是不同的引擎/驱动程序还是只是同一驱动程序/引擎的不同声音模式?
- 这只是我,还是第一个 Text2Speech 工具之间的质量没有太大变化(或根本没有变化)超过几年了? 而且必须承认,这个老派的 Apple 工具比 2000 多年的一些工具提供了更好的结果,至少在将视频与我研究过的视频进行比较时是这样。)
After looking at some of services/tools, I've come to a conclusion. Most Text-to-Speech tools have too techy, robotic - in other words, bad quality c voices.
And yeah, on top of that, it looks like they come with a "hard-coded" voice templates, therefore shortening the variety/customization. Some tools allow you to set the reading speed and pitch', but that's not enough.
My guess about the problem behind the emotional aspect - it's hard to judge emotions from plain text, even more if it's just a sentence or two. Plus, the good ol' PC is a machine - machines don't have emotions, but that's a different story.
The thing that bothers me the most, is, quality. For example, there are these tools out there, that use to cut off apex of words, resulting in these techy voices. Feels like there's a problem with sentence construction or something. And yes, while people are working on such tools, I wonder, what keeps them from working a little more to improve those... cutting off the apex, that's not a small deal! Plus, have to keep in mind, that a good, quality Text-to-Speech software is worth, well... A LOT! Therefore resulting in a pretty profitable product.
Oh, under fluency I'm hiding questions, exclamations and so on. (Possible that those do not apply to fluency, but I'm not native English, please excuse me if that's the case.)
A list of tools I've looked into:
Quite impressive, but still have space for improvements (++)
- Loquendo : lacks voice variety, got some minor apex/fluency problems (depends on sentence), too much coughing and excuses in examples!
- Nuance Vocalizer : while still lacks variety, some of the provided voices are worthy.
Could as well cooperate to get more resources then to work on different, but almost equal products (--)
- eSpeak : one of the best robots out there, hence the program logo(?!)
- Natural Reader (dumb autoplay!!) : well, it got some fluency, but still that techy feeling kicks in.
- iSpeech : good laugh when setting the voice to Japanese with English text. I bet Japanese guys aren't very happy about it.
- Cepstral + Enhanced Voices ... plus the enhanced voices give the good ol' crappy result, so, except ~5 more voices, nothing have been enhanced.
- AT&T : decent fluency, but got problems with sentence endings and too much robo!
- LumenVox TTS : looks like coming from a background with lots of speech tools, but still results in robotic voices.
- And some more...
In case I've missed something worth a look, please share. Can be free, commercial, super expensive... as long as it works, I'm interested!
And the question(-s)..
- What do you think are the main issues behind quality, fluency and variety of those voices? Since emotional aspect is hard to judge, I don't mind if you skip it, but if you have an idea or two, I wouldn't mind if you shared your thoughts
- How is text transformed into speech? Like, what algorithms are used behind these tools? Maybe a fresh theory or two could come in handy.
- Are those actually different engines/drivers or just different voice patterns for the same driver/engine?
- Is it just me, or the quality between one of the first Text2Speech tools hasn't changed much (or at all) over the years? And have to admit, that this oldschool Apple's tool provides better results than some of the year 2000+ tools, at least when comparing video with what I've looked into.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不知道您是否正在寻找开放的解决方案,但如果您有 Mac,您应该查看 OS X 高级语音标记 和“Repeat After Me" 短语构建工具。真的很强大。 Mac OS X 10.5 及更高版本中内置的 Alex 语音比其他语音更先进。
在 Mac 上,突出显示以下文本,按住 Control 键单击,然后转到“语音”>“语音”>“语音”。开始讲话:
http://www.mattmontag.com/personal /mac-os-x-speech-synthesis-markup
I don't know if you're looking for an open solution, but if you have a Mac, you should check out OS X advanced speech markup and the "Repeat After Me" phrase building tool. It's really powerful. The Alex voice built into Mac OS X 10.5 and later is more advanced than the other voices.
On a Mac, highlight the following text, control-click, and go to Speech > Start Speaking:
http://www.mattmontag.com/personal/mac-os-x-speech-synthesis-markup
谷歌翻译使用的 TTS 对于简短的短语非常有用,但对于任何复杂的内容可能会产生不自然的语调轮廓。尽管如此,在单词层面上,它仍然令人印象深刻。
有一个小的 代码示例在这里
还有 Ivona -他们可能会比谷歌翻译犯更多的发音错误,但他们在节奏和语调方面做得更好。看看他们的“Raveena”声音,这是他们迄今为止最好的声音之一。
The TTS used by Google Translate is quite good for short phrases, though likely to produce an unnatural intonation contour for anything complicated. Still, at the word level, it's impressive.
There is a small code example here
And there's Ivona - They might make a slightly more articulation errors than e.g. Google Translate, but they do somewhat better on rhythm and intonation. Check out their 'Raveena' voice, it's one of their best yet.
我知道这是一个老问题,但我刚刚看到了“ 的演示IBM 的 Watson”,非常令人印象深刻!它们支持多种语言,您可以控制音调、停顿、语调和一些其他变量。
如果您还在寻找这个,或者其他人正在寻找好的 TTS,您应该去看看。
免责声明:我不为 IBM 或任何与此产品相关的公司工作,我只是发现它令人印象深刻!
I know that this is an old question, but I just saw the demo of "Watson" from IBM, it's pretty impressive!! They have support for several languages, you can control tone, pauses, intonation and some other variables.
You should go and take a look if you are still looking for this, or if any other person is looking for a good TTS.
Disclaimer: I don't work for IBM or anything related to this product, I just found it impressive!