我正在开发一些针对儿童的软件,并希望增加该软件响应许多非语音的能力。例如,鼓掌、吠叫、口哨、放屁噪音等。
我过去使用过 CMU Sphinx 和 Windows Speech API,但是,据我所知,它们都不支持非语音噪音,并且事实上,我相信积极地将它们过滤掉。
一般来说,我正在寻找“如何获得此功能”,但我怀疑如果我将其分解为三个问题可能会有所帮助,这三个问题是我对下一步要搜索的内容的猜测:
- 是否有一种方法可以使用主要功能之一语音识别引擎通过改变声学模型或发音词典来识别非单词声音?
- (或者)是否已经有一个现有的库可以进行非单词噪声识别?
- (或者)我对隐马尔可夫模型和大学语音识别的底层技术有一点熟悉,但没有很好地估计从头开始创建一个非常小的噪声/声音识别器有多么困难(假设<20个噪声)才能被认可)。如果 1) 和 2) 失败,我自己估计需要多长时间?
谢谢
I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.
I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.
In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:
- Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
- (or) Is there already an existing library to do non-word noise recognition?
- (or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?
Thanks
发布评论
评论(2)
是的,您可以使用 CMU Sphinx 等语音识别软件来识别非语音。为此,您需要创建自己的声学和语言模型,并定义仅限于您的任务的词典。但要训练相应的声学模型,您必须有足够的训练数据并标注感兴趣的声音。
简而言之,步骤顺序如下:
首先,准备训练资源:词典、字典等。该过程描述如下: http://cmusphinx.sourceforge.net/wiki/tutorialam。但就你而言,你需要重新定义音素集和词典。也就是说,您应该将填充词建模为真实单词(因此,周围没有
++
),并且不需要定义完整的音素集。有很多可能性,但最简单的可能是为所有语音音素使用单一模型。因此,您的词典将如下所示:其次,准备带标签的训练数据:与 VoxForge 类似,但文本注释必须仅包含词典中的标签。当然,非语音也必须正确标记。这里的好问题是从哪里获得足够多的此类数据。但我想这应该是可能的。
有了这个,你就可以训练你的模型了。与语音识别相比,该任务更简单,例如,您不需要使用三音素,只需使用单音素。
假设任何声音/语音的先验概率相等,最简单的语言模型可以是类似循环的语法(http: //cmusphinx.sourceforge.net/wiki/tutoriallm):
这是使用 ASR 工具包完成任务的基本方法。可以通过微调 HMM 配置、使用统计语言模型和使用细粒度音素建模(例如区分元音和辅音而不是单一 SPEECH 模型。这取决于训练数据的性质)来进一步改进。
在语音识别框架之外,您可以构建一个简单的静态分类器,该分类器将逐帧分析输入数据。在频谱图上运行的卷积神经网络在这项任务上表现得相当好。
Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.
In short, the sequence of steps is the following:
First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no
++
around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.
Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.
Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):
This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).
Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.
我不知道您可以使用任何现有的库,我怀疑您可能必须推出自己的库。
这篇论文会引起兴趣吗?它有一些技术细节,他们似乎能够识别鼓掌并将其与口哨声区分开来。
I don't know any existing libraries you can use, I suspect you may have to roll your own.
Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.