用于检测人的声音、性别、年龄和情绪的音频分析——之前有做过开源工作吗？

发布于 2024-10-18 08:03:56 字数 150 浏览 10 评论 0原文

之前是否有在“音频分析”领域完成的开源工作来检测人声（例如，尽管有一些背景噪音），确定说话者的性别，可能确定没有。说话者的数量、说话者的年龄以及说话者的情绪？

我的预感是，像 CMU Sphinx 这样的语音识别软件可能是一个很好的起点，但如果有更好的东西，那就太好了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

意中人 2024-10-25 08:03:56

我是一名研究生，从事语音识别研究。这些是开放的研究问题，不幸的是，我不知道可以开箱即用地完成这些事情的开源包。

如果您有实施信号处理或机器学习算法的背景，您可以尝试使用以下一些搜索术语查找学术论文：

预测说话者的性别
性别识别（有时称为性别识别）：根据言语年龄识别：预测说话者的年龄
说话者识别：从一组可能的说话者中预测语音话语中最有可能的说话者
说话者验证：接受或拒绝属于说话者的话语（想象一下“声纹”类型的授权系统）
说话者分类：获取包含多个文件的音频文件，并标记哪些语音片段属于哪个说话者
情绪识别：从语音话语中预测说话者的情绪（一个非常新的研究领域）。

根据 http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4 -faq.html#speaker_identification，CMU Sphinx 可能是领先的开源语音识别器，但不支持说话者识别 (http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification);我怀疑它是否具有上述任何其他功能。

一些学术研究人员在线发布他们的代码，和/或可能愿意与您分享。 Google Scholar 的搜索显示许多人使用 Sphinx 撰写了硕士或博士论文，因此这可能是一个很好的起点。

最后，如果您了解一点信号处理知识，您可以尝试实现一种非常粗略的性别识别算法，而无需进入语音识别器本身。基本上，男性和女性声音的基频有所不同 - 根据维基百科 (http://en.wikipedia.org /wiki/Voice_Frequency），男声在85-180Hz之间，而女声在165Hz-255Hz之间。您可以使用诸如 sox 之类的东西来确定话语的频谱（使用称为快速傅里叶变换的东西），并根据一些汇总统计数据（例如平均频率）将语音分类为“男性”或“女性” （请参阅http://classicalconvert.com/tag/sox/）。为了使这项工作稳健地工作（即在许多扬声器、麦克风或录音环境下），您可以做很多事情。我不确定我是否可以预测需要多少时间和精力才能获得 70% 的准确率，因为这取决于您任务的性质；我的感觉是90%+肯定会很难。

祝你好运！

I'm a graduate student doing speech recognition research. These are open research problems, and, unfortunately, I'm not aware of open-source packages that can do these things out of the box.

If you have some background in implementing signal-processing or machine-learning algorithms, you could try looking up academic papers using some of these search terms:

gender identification (sometimes called gender recognition): predicting the gender of the speaker from the speech utterance
age identification: predicting the age of the speaker
speaker identification: predicting, from a set of possible speakers, the most likely speaker in a speech utterance
speaker verification: accepting or rejecting an utterance as belonging to a speaker (imagine a "voiceprint"-type authorization system)
speaker diarization: taking an audio file with multiple files and labeling which segments of speech belong to which speaker
emotion recognition: predicting the speaker's emotion from a speech utterance (a very new area of research).

According to http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification, CMU Sphinx, which is probably the leading open-source speech recognizer out there, does not support speaker identification (http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html#speaker_identification); I'm doubtful that it has any of the other capabilities described above.

Some academic researchers post their code online, and/or might be willing to share it with you. A search of Google Scholar reveals many people who've written Master's or PhD theses using Sphinx, so that could be a good place to start.

Lastly, you could try to implement a very crude gender-recognition algorithm without getting into the speech recognizer itself, if you know a little bit of signal processing. Basically, male and female voices differ in their fundamental frequency - according to Wikipedia (http://en.wikipedia.org/wiki/Voice_frequency), male voices are between 85-180Hz, while female voices are 165Hz-255Hz. You could use something like sox to determine the frequency spectrum (using something called the fast Fourier transform) of an utterance and classify speech as "male" or "female" depending on some summary statistic like the average frequency (see http://classicalconvert.com/tag/sox/). To make this work robustly (i.e. with many speakers, microphones, or recording environments), there are plenty of things that you can do. I'm not sure if I can predict how much time and effort would be required to get 70% accuracy, since it would depend on the nature of your task; my sense is that 90%+ would definitely be very hard.

Good luck!

回复收藏 0 原文

游魂 2024-10-25 08:03:56

使用 CMU Sphinx 4 提取低级信息（例如音高和功率）可能有点困难（尽管旧版本可能具有该功能）。我建议你使用 Praat。您可以编写脚本来提取说话者声音中的音高层和每个共振峰。老实说，Praat 脚本语言很可怕，但它可以快速完成许多原本需要很长时间的事情。许多 Praat 脚本也发布在网上。请参阅http://www.fon.hum.uva.nl/praat/。

回复收藏 0 原文