从任意音频文件中提取语音部分的好方法是什么?
我有一组由用户上传的音频文件,并且不知道它们包含什么。
我想获取任意音频文件,并将某人说话的每个实例提取到单独的音频文件中。我不想检测实际的单词,只是检测“开始说话”、“停止说话”点并在这些点生成新文件。
(我的目标是 Linux 环境,并在 Mac 上进行开发)
我找到了 Sox,这看起来很有前途,而且它有一个“vad”模式(语音活动检测)。然而,这似乎找到了第一个语音实例并删除了音频直到该点,所以它很接近,但不太正确。
我还研究了 Python 的“wave”库,但随后我需要编写自己的 Sox 的“vad”实现。
是否有任何现成的命令行工具可以执行我想要的操作?如果没有,有什么好的 Python 或 Ruby 方法吗?
I have a set of audio files that are uploaded by users, and there is no knowing what they contain.
I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.
(I'm targeting a Linux environment, and developing on a Mac)
I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.
I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.
Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
EnergyDetector
对于语音活动检测,我一直在使用 MISTRAL 的 EnergyDetector 程序(是 LIA_RAL)说话人识别工具包,基于 ALIZE 库。
它适用于功能文件,而不适用于音频文件,因此您需要提取信号的能量。我通常使用 log-energy 参数提取倒谱特征 (MFCC),并将此参数用于 VAD。您可以使用 sfbcep`,它是 SPro 信号处理工具包的实用程序部分其方式如下:
它将提取19个MFCC+对数能量系数+一阶和二阶δ系数。能量系数是第 19 个,您将在 EnergyDetector 配置文件中指定。
然后,您将以这种方式运行 EnergyDetector:
如果您使用在答案末尾找到的配置文件,则需要将
output.prm
放入prm/
中,您将在lbl/
中找到分段。作为参考,我附上我的 EnergyDetector 配置文件:
CMU Sphinx
CMU Sphinx 语音识别软件包含一个内置的VAD。它是用 C 语言编写的,您也许可以破解它来为您生成一个标签文件。
最近添加的一项内容是 GStreamer 支持。这意味着您可以在 GStreamer 媒体管道中使用其 VAD。请参阅将 PocketSphinx 与 GStreamer 和 Python 结合使用 -> 'vader' 元素
其他 VAD
我也一直在使用 AMR1 编解码器的修改版本,它输出带有语音/非语音分类的文件,但抱歉,我无法在网上找到其来源。
EnergyDetector
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
If you use the configuration file that you find at the end of the answer, you need to put
output.prm
inprm/
, and you'll find the segmentation inlbl/
.As a reference, I attach my EnergyDetector configuration file:
CMU Sphinx
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
Other VADs
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.
webrtcvad 是 Google 优秀 WebRTC 语音活动检测代码。
它附带一个文件 example.py,它的作用正是您正在寻找的内容:给定一个 .wav 文件,它会找到某人说话的每个实例,并将其写入一个新的单独的 .wav 文件。
webrtcvad API 非常简单,以防 example.py 不能完全满足您的要求:
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.
It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.
The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:
您好 pyAudioAnalysis 具有静音消除功能。
在此库中,静音消除可以像这样简单:
silenceRemoval()
实现参考:https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670内部静音
removal()
遵循半监督方法:首先,训练 SVM 模型来区分高能量和低能量短期帧。为此,使用了 10% 的最高能量帧和 10% 的最低能量帧。然后,将 SVM(具有概率输出)应用于整个记录,并使用动态阈值来检测活动片段。参考论文:https://journals.plos.org/plosone /article?id=10.1371/journal.pone.0144610
Hi pyAudioAnalysis has a silence removal functionality.
In this library, silence removal can be as simple as that:
silenceRemoval()
implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670Internally silence
removal()
follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610
SPro 和 HTK 是您需要的工具包。您还可以使用 Alize Toolkit 的文档查看实现。
http://alize.univ-avignon.fr/doc.html
SPro and HTK are the toolkits you neeed. You can also see there implementation using the documentation of Alize Toolkit.
http://alize.univ-avignon.fr/doc.html