API 将语音分解为音素/根据给定的语音样本合成新的语音?
您知道那些电影中,技术极客录制某人的声音,然后他们的软件将其分解为音素吗?然后他们可以用它来输入任何短语,并让目标看起来像是在说这句话?
该软件是否存在 API 版本?我什至不知道谷歌什么。
You know those movies where the tech geeks record someone's voice, and their software breaks it into phonemes? Which they can then use to type in any phrase, and make it seem as if the target is saying it?
Does that software exist in an API Version? I don't even know what to Google.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
没有这样的软件。将任意语音分解为其组成音素只是部分解决的问题:语音转文本软件仍然不完善,就像文本转语音。
这个想法是重现目标声音的音色。即使您能够完美地分割音频,重新排序音素也会产生节奏和语调不自然的音频,更不用说拼接伪影了。此时,您将进入平滑、时间缩放和音调校正,所有这些在理论上都是可能的并且很好理解,但在实际数据上运行不佳,特别是当所讨论的音频样本短于单个音素,以及需要保留音色时的进一步。
这些问题在语音方面因口音和周围音素的声音异音变化而变得更加复杂。为了忠实地生成低质量的近似音频,您需要详细了解目标的语言、口音和语音模式。
此外,你的最终问题是社会工程之一,当涉及到时,人们并不容易被愚弄。他们认识的人的声音。即使有大量的输入数据,您最多也只能得到一个简短的低质量样本,几乎不足以进行对话。
因此,虽然这当然有可能,但很困难;即使它存在,也并不总是足够好。
There is no such software. Breaking arbitrary speech into its constituent phonemes is only a partially solved problem: speech-to-text software is still imperfect, as is text-to-speech.
The idea is to reproduce the timbre of the target's voice. Even if you were able to segment the audio perfectly, reordering the phonemes would produce audio with unnatural cadence and intonation, not to mention splicing artifacts. At that point you're getting into smoothing, time-scaling, and pitch correction, all of which are possible and well-understood in theory, but operate poorly on real-world data, especially when the audio sample in question is as short as a single phoneme, and further when the timbre needs to be preserved.
These problems are compounded on the phonetic side by allophonic variation in sounds based on accent and surrounding phonemes; in order to faithfully produce even a low-quality approximation of the audio, you'd need a detailed understanding of the target's language, accent, and speech patterns.
Furthermore, your ultimate problem is one of social engineering, and people are not easy to fool when it comes to the voices of people they know. Even with a large corpus of input data, at best you could get a short low-quality sample, hardly enough for a conversation.
So while it's certainly possible, it's difficult; even if it existed, it wouldn't always be good enough.
SRI International(为 iOS 创建 Siri 的公司)有一个名为 EduSpeak,它将获取音频输入并将其分解为单独的音素。我知道这一点是因为大约一周前我观看了该产品的演示。在演示过程中,演示者向我们展示了一个使用 SDK 创建的应用程序。该应用程序提供了几行文本供演示者阅读。阅读文本后,应用程序显示一个条形图,其中每个条形代表他讲话中的一个音素。每个条形的高度代表每个音素发音的分数(演示者不是以英语为母语的人,因此与其他音素相比,他在某些音素上获得的分数较低)。演示者还可以单击每个单独的条,以便使用原始音频仅播放该单独的音素。
所以,是的,存在按音素划分音频的软件,并且它做得非常好。现在,这些音素是否可以重新组装成语音还是一个悬而未决的问题。如果我们最终获得 SDK 的试用版,我会尝试一下并通知您。
SRI International (the company that created Siri for iOS) has an SDK called EduSpeak, which will take audio input and break it down into individual phonemes. I know this because I sat through a demo of the product about a week ago. During the demo, the presenter showed us an application that was created using the SDK. The application gave a few lines of text for the presenter to read. After reading the text, the application displayed a bar chart where each bar represented a phoneme from his speech. The height of each bar represented a score of how well each phoneme was pronounced (the presenter was not a native English speaker, so he received lower scores on certain phonemes compared to others). The presenter could also click on each individual bar to have only that individual phoneme played back using the original audio.
So yes, software exists that divides audio up by phoneme, and it does a very good job of it. Now, whether or not those phonemes can be re-assembled into speech is an open question. If we end up getting a trial version of the SDK, I'll try it out and let you know.
如果你的目标是模仿别人的声音,那么另一种态度就是转换你自己的声音(而不是组装音素)。它(令人惊讶地)被称为语音转换,例如 http:// www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
If your aim is to mimic someone else's voice, then another attitude is to convert your own voice (instead of assembling phonemes). It is (surprisingly) called voice conversion, e.g http://www.busim.ee.boun.edu.tr/~speech/projects/Voice_Conversion.htm
该技术称为“语音合成”和“语音识别”。
可以在此处找到该技术的 Java API Java 语音 JSAPI
Apple 有一个用于此的 API Apple 语音
微软有几个...这里讨论了一个 Vista演讲
The technology is called "voice synthesis" and "voice recognition"
The java API for this can be found here Java voice JSAPI
Apple has an API for this Apple speech
Microsoft has several ...one is discussed here Vista speech
Lyrebird 是一家致力于解决这个问题的初创公司。给定一个人的声音样本和一些书面文本,它可以用样本中人的声音合成该书面文本的口语版本。
Lyrebird is a start-up that is working on this very problem. Given samples of a person's voice and some written text, it can synthesize a spoken version of that written text in the voice of the person in the samples.
通过共振峰感知的音调变换,您可以获得有趣的声音变形效果。 Adobe Audition 有一个非常好的实现。 Antares 制作了一些有趣的人声效果 VST 插件。
这些技术使用某种形式的线性预测编码 (LPC) 将语音视为源过滤器模型。 LPC 通过估计声道(共振峰)的共振来处理语音信号,使用逆滤波器反转其影响,然后对所得残余信号进行编码。理想情况下,残余信号是代表声门脉冲的脉冲序列。这允许独立缩放音高和共振峰,这比简单的音高转换带来更好的性别转换结果。
You can get interesting voice warping effects with a formant-aware pitch shift. Adobe Audition has a pretty good implementation. Antares produces some interesting vocal effects VST plugins.
These techniques use some form of linear predictive coding (LPC) to treat the voice as a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting.
我不知道商业上可用的解决方案,但这个概念并非完全超出可能性范围。例如,特拉华大学就有相当不错的软件可以做到这一点。
http://www.modeltalker.com
I dunno about a commercially available solution, but the concept isn't entirely out of the range of possibility. For example, the University of Delaware has fairly decent software for doing just that.
http://www.modeltalker.com