iOS / C:检测音素的算法
我正在寻找一种算法来确定实时音频输入是否与 144 个给定(且完全不同的)音素对之一匹配。
最好是完成这项工作的最低级别。
我正在为 iPhone / iPad 开发激进/实验性音乐培训软件。
我的音乐系统包含 12 个辅音音素和 12 个元音音素,如此处所示。这就产生了 144 个可能的音素对。学生必须唱出正确的音素对“laa duu bee”等以响应视觉刺激。
我对此做了很多研究,看起来我最好的选择可能是使用 iOS Sphinx 包装器之一( iPhone 应用程序 › 添加语音识别? 是我找到的最好的信息来源)。但是,我不知道如何调整这样的包,任何有使用这些技术经验的人都可以给出所需步骤的基本概要吗?
用户是否需要接受培训?我本以为不会,因为与数千个单词的完整语言模型和更大、更微妙的音素基础相比,这是一项非常基本的任务。然而,让用户训练 12 个音素对是可以接受的(不理想):{辅音1+元音1,辅音2+元音2,...,辅音12+元音12 }。满144太累赘了。
有没有更简单的方法?我觉得使用功能齐全的连续语音识别器就像使用大锤来破解坚果。使用最少的技术来解决问题会更加优雅。
所以我真的在寻找任何可以识别音素的开源软件。
PS我需要一个几乎实时运行的解决方案。因此,即使他们正在唱这个音符,它首先会闪烁以说明它拾取了所唱的音素对,然后它会发光以说明他们是否正在唱正确的音符音高
I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您正在寻找手机级开源识别器,那么我会推荐HTK。该工具以 HTK 书籍的形式提供了非常好的文档。它还包含一整章致力于构建电话级实时语音识别器。从上面的问题陈述来看,在我看来,您也许可以将该示例重新修改为您自己的解决方案。可能的陷阱:
由于你想做一个手机级别的识别器,训练手机模型所需的数据会非常多。此外,您的训练数据库应该在电话分布方面保持平衡。
构建一个独立于说话者的系统需要来自多个说话者的数据。还有很多。 构建一个独立于说话者
由于这是开源的,您还应该检查许可信息以获取有关发送代码的任何其他详细信息。一个好的替代方案是使用手机录音机,然后将记录的波形通过数据通道发送到服务器进行识别,这与谷歌的做法非常相似。
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
我对这种类型的信号处理有一点经验,我想说这可能不是可以明确回答的有限问题类型。
值得注意的一件事是,尽管您可以限制您感兴趣的音素,但可能性空间保持不变(即无限)。用户训练可能会对算法有所帮助,但有用的训练需要相当多的时间,而且你似乎不愿意花太多时间。
使用 Sphinx 可能是解决这个问题的一个很好的开始。我自己对这个库还没有深入了解,但我的猜测是,您将自己使用它的源代码来获得您想要的东西。 (开源万岁!)
我不会给你的问题贴上坚果的标签,我会说它更像是一个野兽。它可能与自然语言语音识别不同,但它仍然是一头野兽。
祝您解决问题一切顺利。
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
不确定这是否有帮助:查看 OpenEars'
LanguageModelGenerator
。 OpenEars 使用 Sphinx 和其他库。Not sure if this would help: check out OpenEars'
LanguageModelGenerator
. OpenEars uses Sphinx and other libraries.http://www.hfink.eu/matchbox
此页面链接到 YouTube 视频演示和 github 源代码。
我猜想把它塑造成我想要的形状仍然需要做很多工作,但也确实做了很多工作。
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.