我计划编写一个对话分析软件,它可以识别各个说话者、他们的音调和强度。音调和强度有些简单(通过自相关获得音调)。
我将如何识别各个说话者,以便记录他/她的特征?为每个说话者的频率存储一些启发式数据就足够了吗?我可以假设一次只有一个人说话(严格不重叠)。我还可以假设,为了进行训练,每个演讲者可以在实际分析之前记录一分钟的数据。
I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
发布评论
评论(2)
音高和强度本身并不能告诉你什么。您确实需要分析音调如何变化。为了识别不同的说话人,您需要将语音音频转换为某种
特征空间
,然后与该特征空间中的说话人数据库进行比较。您可能想要在 Google 上搜索的通用术语是prosody
- 请参阅例如 http://en.wikipedia.org/wiki/Prosody_(linguistics)。当您在谷歌上搜索时,您可能还想阅读说话人识别
,又名说话人识别
,请参阅< a href="http://en.wikipedia.org/wiki/Speaker_identification" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Speaker_identificationPitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of
feature space
, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for isprosody
- see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up onspeaker identification
akaspeaker recognition
, see e.g. http://en.wikipedia.org/wiki/Speaker_identification如果您仍在研究这个...您是否在声音输入上使用语音识别?因为 Microsoft SAPI 为应用程序提供了丰富的 API 来挖掘语音声波,这可以使说话人识别问题更容易处理。我认为你可以得到波形内的音素位置。例如,这可以让你对元音进行功率谱分析,这可以用来生成特征来区分说话者。 (在任何人开始抱怨音高和音量之前,请记住,共振峰曲线来自声道形状,并且完全独立于音高,即声带频率,并且共振峰的相对位置和相对幅度是(相对! )与总音量无关。)上下文中的音素持续时间也可能是一个有用的功能。 “n”个声音期间的能量分布可以提供“鼻音”特征。等等。只是一个想法。我希望自己能在这个领域工作。
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.