非语音噪声或声音识别软件？

发布于 2024-09-30 11:49:10 字数 457 浏览 9 评论 0 原文

我正在开发一些针对儿童的软件，并希望增加该软件响应许多非语音的能力。例如，鼓掌、吠叫、口哨、放屁噪音等。

我过去使用过 CMU Sphinx 和 Windows Speech API，但是，据我所知，它们都不支持非语音噪音，并且事实上，我相信积极地将它们过滤掉。

一般来说，我正在寻找“如何获得此功能”，但我怀疑如果我将其分解为三个问题可能会有所帮助，这三个问题是我对下一步要搜索的内容的猜测：

是否有一种方法可以使用主要功能之一语音识别引擎通过改变声学模型或发音词典来识别非单词声音？
（或者）是否已经有一个现有的库可以进行非单词噪声识别？
（或者）我对隐马尔可夫模型和大学语音识别的底层技术有一点熟悉，但没有很好地估计从头开始创建一个非常小的噪声/声音识别器有多么困难（假设<20个噪声）才能被认可）。如果 1) 和 2) 失败，我自己估计需要多长时间？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅冇旳回忆 2024-10-07 11:49:10

是的，您可以使用 CMU Sphinx 等语音识别软件来识别非语音。为此，您需要创建自己的声学和语言模型，并定义仅限于您的任务的词典。但要训练相应的声学模型，您必须有足够的训练数据并标注感兴趣的声音。

简而言之，步骤顺序如下：

首先，准备训练资源：词典、字典等。该过程描述如下： http://cmusphinx.sourceforge.net/wiki/tutorialam。但就你而言，你需要重新定义音素集和词典。也就是说，您应该将填充词建模为真实单词（因此，周围没有 ++），并且不需要定义完整的音素集。有很多可能性，但最简单的可能是为所有语音音素使用单一模型。因此，您的词典将如下所示：

CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH

其次，准备带标签的训练数据：与 VoxForge 类似，但文本注释必须仅包含词典中的标签。当然，非语音也必须正确标记。这里的好问题是从哪里获得足够多的此类数据。但我想这应该是可能的。

有了这个，你就可以训练你的模型了。与语音识别相比，该任务更简单，例如，您不需要使用三音素，只需使用单音素。

假设任何声音/语音的先验概率相等，最简单的语言模型可以是类似循环的语法（http: //cmusphinx.sourceforge.net/wiki/tutoriallm)：

#JSGF V1.0;
/**
 * JSGF Grammar for Hello World example
 */
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;

这是使用 ASR 工具包完成任务的基本方法。可以通过微调 HMM 配置、使用统计语言模型和使用细粒度音素建模（例如区分元音和辅音而不是单一 SPEECH 模型。这取决于训练数据的性质）来进一步改进。

在语音识别框架之外，您可以构建一个简单的静态分类器，该分类器将逐帧分析输入数据。在频谱图上运行的卷积神经网络在这项任务上表现得相当好。

Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.

In short, the sequence of steps is the following:

First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:

CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH

Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.

Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.

Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):

#JSGF V1.0;
/**
 * JSGF Grammar for Hello World example
 */
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;

This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).

Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.

回复收藏 0 原文