用于检测人的声音、性别、年龄和情绪的音频分析——之前有做过开源工作吗?

发布于 2024-10-18 08:03:56 字数 150 浏览 7 评论 0原文

之前是否有在“音频分析”领域完成的开源工作来检测人声(例如,尽管有一些背景噪音),确定说话者的性别,可能确定没有。说话者的数量、说话者的年龄以及说话者的情绪?

我的预感是,像 CMU Sphinx 这样的语音识别软件可能是一个很好的起点,但如果有更好的东西,那就太好了。

Is there prior open-source work done in the field of 'Audio analysis' to detect human-voice (say in spite of some background noise), determine speaker's gender, possibly determine no. of speakers, age of speaker(s), and the emotion of speakers?

My hunch is that the speech recognition software like CMU Sphinx could be a good place to start, but if there's something better, it'd be great.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

意中人 2024-10-25 08:03:56

我是一名研究生,从事语音识别研究。这些是开放的研究问题,不幸的是,我不知道可以开箱即用地完成这些事情的开源包。

如果您有实施信号处理或机器学习算法的背景,您可以尝试使用以下一些搜索术语查找学术论文:

  • 预测说话者的性别
  • 性别识别(有时称为性别识别):根据言语年龄 识别:预测说话者的年龄
  • 说话者识别:从一组可能的说话者中预测语音话语中最有可能的说话者
  • 说话者验证:接受或拒绝属于说话者的话语(想象一下“声纹”类型的授权系统)
  • 说话者分类:获取包含多个文件的音频文件,并标记哪些语音片段属于哪个说话者
  • 情绪识别:从语音话语中预测说话者的情绪(一个非常新的研究领域)。

根据 http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4 -faq.html#speaker_identification,CMU Sphinx 可能是领先的开源语音识别器,但不支持说话者识别 (http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification);我怀疑它是否具有上述任何其他功能。

一些学术研究人员在线发布他们的代码,和/或可能愿意与您分享。 Google Scholar 的搜索显示许多人使用 Sphinx 撰写了硕士或博士论文,因此这可能是一个很好的起点。

最后,如果您了解一点信号处理知识,您可以尝试实现一种非常粗略的性别识别算法,而无需进入语音识别器本身。基本上,男性和女性声音的基频有所不同 - 根据维基百科 (http://en.wikipedia.org /wiki/Voice_Frequency),男声在85-180Hz之间,而女声在165Hz-255Hz之间。您可以使用诸如 sox 之类的东西来确定话语的频谱(使用称为快速傅里叶变换的东西),并根据一些汇总统计数据(例如平均频率)将语音分类为“男性”或“女性” (请参阅http://classicalconvert.com/tag/sox/)。为了使这项工作稳健地工作(即在许多扬声器、麦克风或录音环境下),您可以做很多事情。我不确定我是否可以预测需要多少时间和精力才能获得 70% 的准确率,因为这取决于您任务的性质;我的感觉是90%+肯定会很难。

祝你好运!

I'm a graduate student doing speech recognition research. These are open research problems, and, unfortunately, I'm not aware of open-source packages that can do these things out of the box.

If you have some background in implementing signal-processing or machine-learning algorithms, you could try looking up academic papers using some of these search terms:

  • gender identification (sometimes called gender recognition): predicting the gender of the speaker from the speech utterance
  • age identification: predicting the age of the speaker
  • speaker identification: predicting, from a set of possible speakers, the most likely speaker in a speech utterance
  • speaker verification: accepting or rejecting an utterance as belonging to a speaker (imagine a "voiceprint"-type authorization system)
  • speaker diarization: taking an audio file with multiple files and labeling which segments of speech belong to which speaker
  • emotion recognition: predicting the speaker's emotion from a speech utterance (a very new area of research).

According to http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification, CMU Sphinx, which is probably the leading open-source speech recognizer out there, does not support speaker identification (http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification); I'm doubtful that it has any of the other capabilities described above.

Some academic researchers post their code online, and/or might be willing to share it with you. A search of Google Scholar reveals many people who've written Master's or PhD theses using Sphinx, so that could be a good place to start.

Lastly, you could try to implement a very crude gender-recognition algorithm without getting into the speech recognizer itself, if you know a little bit of signal processing. Basically, male and female voices differ in their fundamental frequency - according to Wikipedia (http://en.wikipedia.org/wiki/Voice_frequency), male voices are between 85-180Hz, while female voices are 165Hz-255Hz. You could use something like sox to determine the frequency spectrum (using something called the fast Fourier transform) of an utterance and classify speech as "male" or "female" depending on some summary statistic like the average frequency (see http://classicalconvert.com/tag/sox/). To make this work robustly (i.e. with many speakers, microphones, or recording environments), there are plenty of things that you can do. I'm not sure if I can predict how much time and effort would be required to get 70% accuracy, since it would depend on the nature of your task; my sense is that 90%+ would definitely be very hard.

Good luck!

游魂 2024-10-25 08:03:56

使用 CMU Sphinx 4 提取低级信息(例如音高和功率)可能有点困难(尽管旧版本可能具有该功能)。我建议你使用 Praat。您可以编写脚本来提取说话者声音中的音高层和每个共振峰。老实说,Praat 脚本语言很可怕,但它可以快速完成许多原本需要很长时间的事情。许多 Praat 脚本也发布在网上。请参阅http://www.fon.hum.uva.nl/praat/

It can be kind of difficult to extract low level information such as pitch and power using CMU Sphinx 4 (though the older version might have the capability). I would suggest you use Praat. You can write scripts to extract the pitch tier and each of the formants in a speaker's voice. Honestly, the Praat scripting language is horrific, but it does many things quickly that would otherwise take a long time. Many Praat scripts are posted online, too. See http://www.fon.hum.uva.nl/praat/.

愿得七秒忆 2024-10-25 08:03:56

对于您的语音/非语音分类和分类问题(确定说话者的数量以及他们何时说话):有一个开源工具包可以执行此操作(自动执行,因此输出当然会出现错误)。看看这篇文章:

有关二值化的 stackoverflow 问题

For your speech/non-speech classification and diarization question (determine number of speakers and when they are speaking): there is an open-source toolkit that can do this (automatically, so there will be mistakes in the output of course). Have a look at this post:

stackoverflow question on diarization

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文