从任意音频文件中提取语音部分的好方法是什么?

发布于 2024-10-28 07:01:45 字数 430 浏览 7 评论 0原文

我有一组由用户上传的音频文件,并且不知道它们包含什么。

我想获取任意音频文件,并将某人说话的每个实例提取到单独的音频文件中。我不想检测实际的单词,只是检测“开始说话”、“停止说话”点并在这些点生成新文件。

(我的目标是 Linux 环境,并在 Mac 上进行开发)

我找到了 Sox,这看起来很有前途,而且它有一个“vad”模式(语音活动检测)。然而,这似乎找到了第一个语音实例并删除了音频直到该点,所以它很接近,但不太正确。

我还研究了 Python 的“wave”库,但随后我需要编写自己的 Sox 的“vad”实现。

是否有任何现成的命令行工具可以执行我想要的操作?如果没有,有什么好的 Python 或 Ruby 方法吗?

I have a set of audio files that are uploaded by users, and there is no knowing what they contain.

I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.

(I'm targeting a Linux environment, and developing on a Mac)

I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.

I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.

Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

烟花肆意 2024-11-04 07:01:45

EnergyDetector

对于语音活动检测,我一直在使用 MISTRAL 的 EnergyDetector 程序(是 LIA_RAL)说话人识别工具包,基于 ALIZE 库。

它适用于功能文件,而不适用于音频文件,因此您需要提取信号的能量。我通常使用 log-energy 参数提取倒谱特征 (MFCC),并将此参数用于 VAD。您可以使用 sfbcep`,它是 SPro 信号处理工具包的实用程序部分其方式如下:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

它将提取19个MFCC+对数能量系数+一阶和二阶δ系数。能量系数是第 19 个,您将在 EnergyDetector 配置文件中指定。

然后,您将以这种方式运行 EnergyDetector:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output 

如果您使用在答案末尾找到的配置文件,则需要将 output.prm 放入 prm/ 中,您将在 lbl/ 中找到分段。

作为参考,我附上我的 EnergyDetector 配置文件:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

CMU Sphinx 语音识别软件包含一个内置的VAD。它是用 C 语言编写的,您也许可以破解它来为您生成一个标签文件。

最近添加的一项内容是 GStreamer 支持。这意味着您可以在 GStreamer 媒体管道中使用其 VAD。请参阅将 PocketSphinx 与 GStreamer 和 Python 结合使用 -> 'vader' 元素

其他 VAD

我也一直在使用 AMR1 编解码器的修改版本,它输出带有语音/非语音分类的文件,但抱歉,我无法在网上找到其来源。

EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output 

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.

前事休说 2024-11-04 07:01:45

webrtcvad 是 Google 优秀 WebRTC 语音活动检测代码。

它附带一个文件 example.py,它的作用正是您正在寻找的内容:给定一个 .wav 文件,它会找到某人说话的每个实例,并将其写入一个新的单独的 .wav 文件。

webrtcvad API 非常简单,以防 example.py 不能完全满足您的要求:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)
蔚蓝源自深海 2024-11-04 07:01:45

您好 pyAudioAnalysis 具有静音消除功能。

在此库中,静音消除可以像这样简单:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() 实现参考:https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

内部静音removal() 遵循半监督方法:首先,训练 SVM 模型来区分高能量和低能量短期帧。为此,使用了 10% 的最高能量帧和 10% 的最低能量帧。然后,将 SVM(具有概率输出)应用于整个记录,并使用动态阈值来检测活动片段。

参考论文:https://journals.plos.org/plosone /article?id=10.1371/journal.pone.0144610

Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610

我不在是我 2024-11-04 07:01:45

SPro 和 HTK 是您需要的工具包。您还可以使用 Alize Toolkit 的文档查看实现。

http://alize.univ-avignon.fr/doc.html

SPro and HTK are the toolkits you neeed. You can also see there implementation using the documentation of Alize Toolkit.

http://alize.univ-avignon.fr/doc.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文