当前位置：文江博客话题详情

从任意音频文件中提取语音部分的好方法是什么？

发布于 2024-10-28 07:01:45 字数 430 浏览 7 评论 0原文

我有一组由用户上传的音频文件，并且不知道它们包含什么。

我想获取任意音频文件，并将某人说话的每个实例提取到单独的音频文件中。我不想检测实际的单词，只是检测“开始说话”、“停止说话”点并在这些点生成新文件。

（我的目标是 Linux 环境，并在 Mac 上进行开发）

我找到了 Sox，这看起来很有前途，而且它有一个“vad”模式（语音活动检测）。然而，这似乎找到了第一个语音实例并删除了音频直到该点，所以它很接近，但不太正确。

我还研究了 Python 的“wave”库，但随后我需要编写自己的 Sox 的“vad”实现。

是否有任何现成的命令行工具可以执行我想要的操作？如果没有，有什么好的 Python 或 Ruby 方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟花肆意 2024-11-04 07:01:45

EnergyDetector

对于语音活动检测，我一直在使用 MISTRAL 的 EnergyDetector 程序（是 LIA_RAL）说话人识别工具包，基于 ALIZE 库。

它适用于功能文件，而不适用于音频文件，因此您需要提取信号的能量。我通常使用 log-energy 参数提取倒谱特征 (MFCC)，并将此参数用于 VAD。您可以使用 sfbcep`，它是 SPro 信号处理工具包的实用程序部分其方式如下：

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

它将提取19个MFCC+对数能量系数+一阶和二阶δ系数。能量系数是第 19 个，您将在 EnergyDetector 配置文件中指定。

然后，您将以这种方式运行 EnergyDetector：

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output

如果您使用在答案末尾找到的配置文件，则需要将 output.prm 放入 prm/ 中，您将在 lbl/ 中找到分段。

作为参考，我附上我的 EnergyDetector 配置文件：

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

CMU Sphinx 语音识别软件包含一个内置的VAD。它是用 C 语言编写的，您也许可以破解它来为您生成一个标签文件。

最近添加的一项内容是 GStreamer 支持。这意味着您可以在 GStreamer 媒体管道中使用其 VAD。请参阅将 PocketSphinx 与 GStreamer 和 Python 结合使用 -> 'vader' 元素

其他 VAD

我也一直在使用 AMR1 编解码器的修改版本，它输出带有语音/非语音分类的文件，但抱歉，我无法在网上找到其来源。

EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.

回复收藏 0 原文

前事休说 2024-11-04 07:01:45

webrtcvad 是 Google 优秀 WebRTC 语音活动检测代码。

它附带一个文件 example.py，它的作用正是您正在寻找的内容：给定一个 .wav 文件，它会找到某人说话的每个实例，并将其写入一个新的单独的 .wav 文件。

webrtcvad API 非常简单，以防 example.py 不能完全满足您的要求：

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)

回复收藏 0 原文

蔚蓝源自深海 2024-11-04 07:01:45

您好 pyAudioAnalysis 具有静音消除功能。

在此库中，静音消除可以像这样简单：

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() 实现参考：https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

内部静音removal() 遵循半监督方法：首先，训练 SVM 模型来区分高能量和低能量短期帧。为此，使用了 10% 的最高能量帧和 10% 的最低能量帧。然后，将 SVM（具有概率输出）应用于整个记录，并使用动态阈值来检测活动片段。

参考论文：https://journals.plos.org/plosone /article?id=10.1371/journal.pone.0144610

Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610

回复收藏 0 原文