补偿通道效应
我正在尝试开发一个由计算机对录制句子的质量进行评级的系统。该系统有三种运行模式:
- 当人使用麦克风和混音器装置录制句子时。
- 当用户通过固定电话录音时。
- 当用户通过手机录制时。
我注意到使用上述 3 个来源的录音得到的分数按以下顺序排列:Mic_score >固定电话分数> mobile_score
上面的顺序很可能是由于编解码器和信道特性的影响。我的问题是:
- 可以采取什么措施来补偿通道/编解码器引入的工件,以获得跨通道的一致分数?如果某种反向过滤,那么请提供一些我可以开始使用的链接。
- 如何检测输入语音是在哪个通道录制的?使用隐马尔可夫模型?
编辑1
:我无权详细介绍标准。我从麦克风、固定电话和手机(对于同一句话(以及通过三种媒体同样说出的内容)获得的当前分数约为 80、66、41。这种差异可能是由于渠道效应造成的。如果内容并且句子的说话方式是相同的,那么我正在寻找一种对分数进行归一化的算法(它们不必相同,但应该接近)。
I am trying to work on a system where the quality of a recorded sentence is rated by a computer. There are three modes under which this system operates:
- When the person records a sentence using a mic and mixer arrangement.
- When the user records over a landline.
- When the user records over a mobile phone.
I notice that the scores I get from recordings using the above 3 sources are in the following order: Mic_score > Landline_score > mobile_score
It is likely that the above order is because of the effects of the codecs and channel characteristics. My question is:
- What can be done to compensate for channel/codec introduced artifacts to get consistent scores across channels? If some sort of inverse filtering, then please provide some links where I could get started.
- How do I detect what channel the input speech has been recorded on? Use HMMs?
Edit 1
: I am not at liberty to go into the details of the criteria. The current scores that I get from the mic, landline and mobile (for the same sentence said (and similarly spoken over the three mediums) is something like 80, 66, 41. This difference may be because of the channel effects. If the content and manner of speaking the sentence is the same, then I am looking for an algorithm that normalizes
the scores (they need not be the same, but they should be close).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
很可能是音质不同。
您是否尝试过听一些例子?
您还可以使用任何频谱分析仪来详细查看该数据。我建议http://www.baudline.com/。您应该注意的事项: 本底噪声和语音之间的距离。
另请注意说出字母 t、f 和 s 时的高频噪声爆发。在低质量的线条中,这些字母之间的差异消失了。
为什么要扭曲质量衡量标准?对质量给出客观的回应似乎更有意义。
It may very well be that the sound quality is different.
Have you tried listening to some examples?
You can also use any spectrum analyzer to look at that data in detail. I suggest http://www.baudline.com/. Things your should look out for: Distance between the noise floor and the speech.
Also look at the high frequency noise bursts when the letters t, f and s are spoken. In low quality lines the difference between these letters disappears.
Why do you want to skew the quality measures? Giving an objective response of the quality seems to make more sense.
固定电话编解码器将删除 4 kHz 左右及以上的所有频率。作为有损压缩过程的一部分,手机编解码器将丢弃更多信息。除非您有关于原始音频内容的另一个侧面信息通道,否则没有可靠的方法来恢复被丢弃的音频。
标准化的最佳选择是对音频进行低通滤波以匹配 8 kHz 电信编解码器,并通过某种蜂窝标准压缩算法(可能有针对您的特定移动蜂窝协议发布的算法)运行结果。这应该会将所有 3 个信号的质量降低到大致相同。
The landline codec will remove all frequencies around and above 4 kHz. The cell phone codec will throw away more information as part of a lossy compression process. Unless you have another side channel of information regarding the original audio content, there is no reliable way to recover the audio that was thrown away.
You best bet to normalize is to low pass filter the audio to match the 8 kHz telco codec, and the run the result through some cellular standard compression algorithm (there may be one published for your particular mobile cellular protocol). This should reduce the quality of all 3 signals to about the same.