是否有任何信号处理算法可以对人类发声系统产生声波的方式进行逆向工程?

发布于 2024-11-08 19:00:53 字数 674 浏览 0 评论 0原文

有长长的录音带,上面有 3 个扬声器,如何获取有关嘴巴如何张开/闭合的信息?我们有录音,有不止一个扬声器。声音清晰,不需要降噪。我们想要创建一些带有会说话的 3d 头部的动画。一般我们想从声音数据中找出嘴巴的运动情况。

实际上,我们的 3d 头部通过一些默认动画以某种方式移动。就像我们为每个人准备了 O 声音的动画一样,我们需要一些信息:哪个人在哪一毫秒发出了哪个声音?

因此,这就像语音到文本,但对于声音来说,并且在一份录音中适用于多个人。

image with head on it

一般来说(完美的情况)我们希望获得一些有关 D9、D6、D5 点移动的信号对。来自不止一位发言者,当然是英语。

有没有关于算法或开源库的论文?

到目前为止,我已经找到了一些库

http://freespeech.sourceforge.net/ http://cmusphinx.sourceforge.net/

但我从未使用过其中任何一个......

Having long sound tape with 3 speakers on it , how to get info on how there mouthes open/close? We have audio recording, with more than one speaker. Sound is clear and does not require noise reduction. We want to create some animation with speaking 3d heads. Generally we want to find out from sound data mouthes movement.

Really we have 3d heads moving somehow via some default animation. Like we have prepared animation for O sound for each person, we need some info: on which millisecond which person produced which sound?

So it is like voice to text but for sounds and for more than one person on one recording.

image with head on it

In general (perfect case) we want to obtain some signals on movements of D9, D6, D5 point pairs. From more than one speaker, English language of course.

Are there any papers with algorithms or opensource libraries?

So far I have found some libraries

http://freespeech.sourceforge.net/
http://cmusphinx.sourceforge.net/

but I had never used any of them yet...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

抚笙 2024-11-15 19:00:53

有趣的问题!!我首先想到的是使用运动检测来识别 D5、D6 和 D9 区域的任何运动。将 D5、D6、D9 扩展为矩形并使用此处提到的方法之一 检测这些区域内的运动。

当然,在开始监控任何运动之前,您必须首先识别人脸以及帧中的 D5、D6、D9 区域。

您可以使用语音识别库并检测音频流中的音素以及运动,并尝试将运动特征(如区域、强度和频率等)映射到音素,并构建将嘴部运动映射到音素的概率模型。

真是有趣的问题!!我希望我现在正在做一些有趣的事情:)。

希望我在这里提到了一些有用的东西。

Interesting problem!! The first thing that came to my mind was to use motion detection to identify any movements at regions D5, D6 and D9. Extend D5, D6, D9 to be rectangles and use one of the approaches mentioned here to detect motion within those reigons.

Of course you have to first identify a person's face and the regions D5, D6, D9 in a frame before you can start monitoring any motion.

You can use a speech recognition library and detect phonemes in the audio stream along with the motion and try to map motion features(like region, intensity and frequency etc.) to Phonemes and build a probabilistic model that maps mouth motions to phonemes.

Really interesting problem!! I wish I was currently working something this interesting :).

Hope I mentioned something useful in here.

柠檬色的秋千 2024-11-15 19:00:53

这是“鸡尾酒会问题”或其概括“盲信号分离”的一个实例。

不幸的是,虽然如果您有 N 个麦克风记录 N 个发言者,则存在良好的算法,但麦克风数量少于源的盲算法的性能相当糟糕。所以这些并没有多大帮助。

据我所知,即使有额外的数据,也没有特别可靠的方法(当然五年前还没有)来区分说话者。您也许能够在人工注释的语音频谱图上训练分类器,以便它能够识别出谁是谁,然后可能使用与说话人无关的语音识别来尝试找出所说的内容,然后使用用于高端视频游戏或电影特效的3D说话模型。但效果不会很好。

你最好聘请三名演员来听磁带,然后在录制视频时每人背诵其中一位演讲者的部分。您将花费更少的时间、精力和金钱获得更真实的外观。如果您想要拥有各种 3D 角色,请在演员的脸上放置标记并捕捉他们的位置,然后将它们用作 3D 模型上的控制点。

This is an instance of the "cocktail party problem" or its generalization, "blind signal separation".

Unfortunately, while good algorithms exist if you have N microphones recording N speakers, performance of blind algorithms with fewer microphones than sources is quite bad. So those are not much help.

There is no particularly robust method I know of (certainly was not as of five years ago) to separate speakers even with extra data. You may be able to train a classifier on human-annotated spectrograms of the speech so that it can pick out who is who, and then possibly use speaker-independent voice recognition to try to figure out what is said, and then use 3D speaking models used for high-end video games or movie special effects. But it won't work well.

You would be better off hiring three actors to listen to the tape and then each recite the part of one of the speakers while you video them. You will get much more realistic appearance with much less time, effort, and money. If you want to have a variety of 3D characters, put markers on the actors' faces and capture their position, then use those as control points on your 3D models.

So要识趣 2024-11-15 19:00:53

我认为您正在寻找所谓的“盲信号分离”。对此进行调查的学术论文是:

盲信号分离:统计原理 (pdf)

Jean-François Cardoso,CNRS 和 ENST

摘要—盲信号分离 (BSS) 和独立分量分析 (ICA) 是阵列处理和数据分析的新兴技术,旨在从观察到的混合物(通常是传感器阵列的输出)中恢复未观察到的信号或“源” ),仅利用信号之间相互独立的假设。假设的弱点使其成为一种强大的方法,但需要冒险超越熟悉的二阶统计数据。本文的目的是回顾最近为解决这个令人兴奋的问题而开发的一些方法,以展示它们如何源于基本原理以及它们如何相互关联。

我不知道你想做的事情有多实际,或者如果可行的话可能需要做多少工作。

I think that you are looking for what is known as "Blind Signal Separation". An academic paper surveying this is:

Blind signal separation: statistical principles (pdf)

Jean-François Cardoso, C.N.R.S. and E.N.S.T.

Abstract— Blind signal separation (BSS) and independent component analysis (ICA) are emerging techniques of array processing and data analysis, aiming at recovering unobserved signals or ‘sources’ from observed mixtures (typically, the output of an array of sensors), exploiting only the assumption of mutual independence between the signals. The weakness of the assumptions makes it a powerful approach but requires to venture beyond familiar second order statistics. The objective of this paper is to review some of the approaches that have been recently developed to address this exciting problem, to show how they stem from basic principles and how they relate to each other.

I have no idea how practical what you are trying to do is, or how much work it might take, if practical.

粉红×色少女 2024-11-15 19:00:53

大约 15 年前爱丁堡大学的一些工作(可能是我们语音识别的基础)是适用的。他们能够自动将任何可理解的英语语音(无需对程序进行训练)转换为一组大约 40 个符号,每个符号代表我们使用的每个不同的声音。这种能力与波形特征分析相结合来识别感兴趣的人就是您所需要的“全部”。

这肯定是一个工程问题。但不是一个适合 Stack Overflow 的编程问题。但我还是很期待这一天的到来。 :-)

Some work that came out of University of Edinburgh about 15 years ago (probably the basis of the voice recognition we have) is applicable. They were able to automatically turn any intelligible English speech (without the program being trained) into a set of about 40 symbols, one for each distinct sound we use. That capability combined with waveform signature analysis to identify the human of interest is "all" you need.

This is an engineering problem for sure. But not a programming problem suitable for Stack Overflow. I look forward to the day it is though. :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文