识别文件中的音频样本
我希望能够识别我拥有的音频文件(mp3)中的音频样本(由用户提供)。
mp3 文件是我出于测试目的而保留的广播流,并且我有节目的预卷。我想在文件中识别它并获取它在文件中播放的时间戳。
注意:该解决方案可以采用以下任何一种编程语言:Java、Python 或 C++。我不知道如何分析视频文件,任何有关此主题的参考都会有所帮助。
I want to be able to identify an audio sample (that is provided by the user) in a audio file I've got (mp3).
The mp3 file is a radio stream that I've kept for testing purposes, and I have the Pre-roll of the show. I want to identify it in the file and get the timestamp where it's playing in the file.
Note: The solution can be in any of the following programming languages: Java, Python or C++. I don't know how to analyze the video file and any reference about this subject will help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题属于音频指纹识别的范畴。如果您已将样本与歌曲相匹配,那么您肯定会知道样本在歌曲中出现的时间戳。 Shazam 背后的人写了一篇很棒的论文,描述了他们的技术: http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf 他们基本上挑选出频谱图中的局部最大值并根据它们创建一个哈希值相对位置。
以下是关于音频指纹识别算法的精彩评论:http://mtg. upf.edu/files/publications/MMSP-2002-pcano.pdf
无论如何,您可能会大量使用 FFT 和频谱图。 这篇文章讨论了如何在 Python 中做到这一点。
This problem falls under the category of audio fingerprinting. If you have matched a sample to a song, then you'll certainly know the timestamp where the sample occurs within the song. There is a great paper by the guys behind Shazam that describes their technique: http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf They basically pick out the local maxima in the spectrogram and create a hash based on their relative positions.
Here is a good review on audio fingerprinting algorithms: http://mtg.upf.edu/files/publications/MMSP-2002-pcano.pdf
In any case, you'll likely be working a lot with FFT and spectrograms. This post talks about how to do that in Python.
我首先计算 haystack 和 Needle 文件的 FFT 频谱图(可以这么说)。然后,您可以尝试(模糊地)匹配频谱图 - 如果将它们格式化为图像,您甚至可以使用现成的算法。
不确定这是否是规范的或最佳的方式,但我觉得它应该有效。
I'd start by computing the FFT spectrogram of both the haystack and needle files (so to speak). Then you could try and (fuzzily) match the spectrograms - if you format them as images, you could even use off-the-shelf algorithms for that.
Not sure if that's the canonical or optimal way, but I feel like it should work.