从一种 MFCC 类型转换为另一种 - HTK

发布于 2024-11-27 05:22:07 字数 934 浏览 4 评论 0原文

我正在使用 HTK 工具包执行单词识别任务,并且有一个经典的训练和测试数据不匹配的情况。训练数据仅包含“干净”(通过麦克风记录)数据。数据被转换为 MFCC_E_D_A 参数,然后由 HMM 建模(手机级)。我的测试数据是通过固定电话和手机渠道记录的(导致失真等)。将 MFCC_E_D_A 参数与 HVite 结合使用会导致输出不正确。我想利用带有 MFCC_E_D_A_Z 参数的倒谱均值归一化,但它没有多大用处,因为 HMM 不是用这些数据建模的。我的问题如下:

  1. 有什么方法可以转换 MFCC_E_D_A_ZMFCC_E_D_A 吗?这样我就按照这样的方式: input -> MFCC_E_D_A_Z -> MFCC_E_D_A-> HMM 对数似然计算
  2. 有没有办法将现有的模型MFCC_E_D_A参数的HMM转换为MFCC_E_D_A_Z

如果有办法执行上面的 (1),那么 HCopy 的配置文件会是什么样子?我编写了以下 HCopy 配置文件进行转换:
<代码> 源格式 = MFCC_E_D_A_Z
目标种类 = MFCC_E_D_A
目标率 = 100000.0
压缩保存 = T
保存CRC = T
窗口大小 = 250000.0
乌塞汉明 = T
PREEMCOEF = 0.97
数量 = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T

这不起作用。我该如何改进这个?

I am working with the HTK toolkit on a word spotting task and have a classic training and testing data mismatch. The training data consisted of only "clean" (recorded over a mic) data. The data was converted to MFCC_E_D_A parameters which were then modelled by HMMs (phone-level). My test data has been recorded over landline and mobile phone channels (inviting distortions and the like). Using the MFCC_E_D_A parameters with HVite results in incorrect output. I want to make use of cepstral mean normalization with MFCC_E_D_A_Z parameters but it would not be of much use since the HMMs are not modelled with this data. My questions are as follows:

  1. Is there any way by which I can convert MFCC_E_D_A_Z into MFCC_E_D_A? That way I follow this way: input -> MFCC_E_D_A_Z -> MFCC_E_D_A -> HMM log likelihood computation.
  2. Is there any way to convert the existing HMMs which model MFCC_E_D_A parameters into MFCC_E_D_A_Z?

If there is a way to do (1) from above, what would the config file for HCopy look like? I wrote the following HCopy config file for conversion:

SOURCEFORMAT = MFCC_E_D_A_Z
TARGETKIND = MFCC_E_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T

This does not work. How can I improve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

臻嫒无言 2024-12-04 05:22:07

您需要了解电话录音具有另一个频率范围,因为它们被剪辑在通道中。通常存在 200 至 3500 Hz 的频率范围。宽带声学模型在 100 到 6800 的范围内进行训练。它不会可靠地解码电话语音,因为电话语音错过了 3500 到 6800 所需的频率。它与特征类型或平均归一化或失真无关,你只是不能这样做 您需要

在转换为 8khz 的音频上训练原始模型,或者至少修改滤波器组参数以匹配电话频率范围。

You need to understand that telephone recordings have another range of frequencies because they are clipped in the channels. Usually range of frequencies from 200 to 3500 Hz is present. Wideband acoustic model is trained on the range from 100 to 6800. It will not decode telephone speech reliably because telephone speech misses the required frequencies from 3500 to 6800. It's not related to feature type or mean normalization or distortion, you just can't do that

You need to train your original model on audio converted to 8khz or at least to modify the filterbank parameters to match telephone range of frequencies.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文