语音处理中的矢量量化讲解
我无法从 这个 确定研究论文准确地描述了我如何根据训练数据集重现标准矢量量化算法来确定未识别语音输入的语言。以下是一些基本信息:
摘要信息 利用声学特征的语言识别(例如日语、英语、德语等)是当前语音的一个重要而又困难的问题 技术。 ...本文使用的语音数据库包含20种语言:16种 4 名男性和 4 名女性说出了两次句子。每个的持续时间 句子大约8秒。第一个算法基于标准 矢量量化 (VQ) 技术。每种语言都有其特点 通过其自己的 VQ 代码本 。
识别算法 第一种算法基于标准矢量量化 (VQ) 技术。每种语言 k
都有其自己的 VQ 码本 的特征。在识别阶段,输入语音通过 进行量化,并计算累积的量化失真 d_k。被识别为最小失真的语言。计算 VQ 失真,应用几种 LPC 频谱失真测量...在本例中,WLR——加权最小比率——距离:
距离d
可以是与声学特征相对应的任意距离,并且必须与用于生成码本的距离相同。每种语言都有其 VQ 码本 的特征。
我的问题是,我到底该怎么做?我有一组 50 个英语句子。在 MATLAB 中,我可以轻松计算任何给定信号的 WLR。但是,我如何制定密码本,因为我必须使用 WLR 进行英语的“密码本生成”。我也很好奇如何将大小为 16 的 VQ 码本(被发现是最佳大小)与给定的输入信号进行比较。如果有人能帮我提炼这篇论文,我将不胜感激。
谢谢!
I'm having trouble determining from this research paper exactly how I can reproduce the Standard Vector Quantization algorithm to determine the language of an unidentified speech input, based on a training set of data. Here's some basic info:
Abstract info
Language recognition (e.g. Japanese, English, German, etc) using acoustic features is an important yet difficult problem for current speech
technology. ... The speech data base used in this paper contains 20 languages: 16
sentences uttered twice by 4 males and 4 females. The duration of each
sentence is about 8 seconds. The first algorithm is based on the standard
Vector Quantization (VQ) technique. Every language is characterized
by its own VQ codebook, .
Recognition Algorithms
The first algorithm is based on the standard Vector Quantization (VQ) technique. Every language, k
, is characterized by its own VQ codebook, . In the recognition stage input speech is quantized by and the accumulated quantization distortion, d_k, is calculated. The language which as the minimal distortion is recognized. Calcualating VQ distortion, several LPC spectral distortion measures are applied...in this case, the WLR -- weighted least ratio -- distance:
.
Standard VQ Algorithm:
A codebook,
, for each language is generated using training sentences. The accumulated distance for input vector in sentence, ![alt text][4], is defined as: [![alt text][5]][5]
The distance d
can be any distance which corresponds to the acoustic features and it must be the same as the one used for codebook generation. Each language is characterized by its VQ codebook, .
My question is, how exactly do I do this? I have a set of 50 sentences in English. In MATLAB, I can easily calculated the WLR for any given signal. But, how do I formulate a codebook, since I must use the WLR for "codebook generation" for English. I'm also curious as to how to compare a VQ codebook of size 16 (which was found to be the best size), to a given input signal. If anyone could help distill this paper down for me, I'd appreciate it greatly.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
第二个问题(将码本与给定信号进行比较)更简单:对于每个码本条目 V_k_j,您必须计算与输入信号的距离 d。具有最小距离“d”的“j”将对应于最适合的密码本条目。作为距离函数,您可以使用 WLR
构建码本(trainig)有点复杂。您必须将句子划分为长度为 N (16) 的向量,然后使用某种聚类算法(如 k-means)对这些向量进行聚类。然后找到每个簇的平均值。这意味着并且将是密码本条目。这是我首先想到的事情。
另一种算法(我相信,它会更好)可以在这里找到。
此外,Wikipedia 中描述了两种简单的训练算法
The second question (compare codebook to given signal) is more easy: for each codebook entry V_k_j you must calculate distance d with input signal. The 'j' with smallest distance 'd' will corespond to best fitted codebook entry. As a distance function you can use WLR
Building codebook (trainig) is bit more complicated. You must divide you sentences to vectors with lenght N (16) and then use some clustering algorithm (like k-means) to cluster these vectors. Then find mean in every cluster. This mean and will be codebook entry. It is a fisrt thing that comes to mind.
Another algorithm (I believe, it will be better) can be found here.
Also, two simple training algorithms are described in Wikipedia