音乐识别与信号处理
我想构建类似于 Tunatic 或 Midomi(如果您不确定它们的作用,请尝试一下),我想知道我必须使用什么算法;我对此类应用程序的工作原理的想法是这样的:
- 拥有一个大数据库,
- 其中每首歌曲都有几首歌曲 1. 降低质量/比特率(例如至 64kbps)并计算声音“哈希”
- 具有您想要
- 在3中识别歌曲的音乐的声音/摘录。降低质量/比特率(再次至64kbps)并计算声音“哈希”
- 如果 4.声音哈希值处于2.声音哈希值中的任何一个中,
我认为由于环境噪音和编码差异而降低了质量/比特率,所以声音哈希值返回匹配的音乐。
我的方向正确吗?任何人都可以为我提供任何具体文档或示例吗? Midori 似乎甚至能认出嗡嗡声
,这真是太令人印象深刻了!他们是怎么做到的?
声音哈希是否存在或者是我刚刚编造的东西?如果有的话,我该如何计算它们?更重要的是,如何检查child-hash
是否在father-hash
中?
我将如何使用 Python(可能是内置模块)或 PHP 构建类似的系统?
一些示例(最好是 Python 或 PHP 的)将不胜感激。提前致谢!
I want to build something similar to Tunatic or Midomi (try them out if you're not sure what they do) and I'm wondering what algorithms I'd have to use; The idea I have about the workings of such applications is something like this:
- have a big database with several songs
- for each song in 1. reduce quality / bit-rate (to 64kbps for instance) and calculate the sound "hash"
- have the sound / excerpt of the music you want to identify
- for the song in 3. reduce quality / bit-rate (again to 64kbps) and calculate sound "hash"
- if 4. sound hash is in any of the 2. sound hashes return the matched music
I though of reducing the quality / bit-rate due to the environment noises and encoding differences.
Am I in the right track here? Can anyone provide me any specific documentation or examples? Midori seems to even recognize hum's
, that's pretty awesomely impressive! How do they do that?
Do sound hashes exist or is it something I just made up? If they do, how can I calculate them? And more importantly, how can I check if child-hash
is in father-hash
?
How would I go about building a similar system with Python (maybe a built-in module) or PHP?
Some examples (preferably in Python or PHP) will be greatly appreciated. Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我从事音乐信息检索(MIR)方面的研究。关于音乐指纹识别的开创性论文是 Haitsma 和 Kalker 在 2002 年 3 月左右发表的论文。谷歌应该可以帮你找到。
我读过一份早期(非常早;2000 年之前)关于 Shazam 方法的白皮书。那时,他们基本上只是检测到光谱时间峰值,然后对峰值进行哈希处理。我确信该程序已经发展。
这两种方法都解决了信号级别的音乐相似性,即它对环境失真具有鲁棒性。我认为它对于嗡嗡声查询(QBH)效果不佳。然而,这是一个不同的(但相关的)问题,具有不同的(但相关的)解决方案,因此您可以在文献中找到解决方案。 (数量太多,无法在此一一列举。)
ISMIR 会议记录可在网上免费获取。您可以在那里找到有价值的东西:http://www.ismir.net/
我同意使用现有的库就像玛尔西亚斯一样。取决于你想要什么。我认为 Numpy/Scipy 在这里是不可或缺的。简单的东西可以自己用 Python 编写。哎呀,如果你需要 STFT、MFCC 之类的东西,我可以通过电子邮件给你发送代码。
I do research in music information retrieval (MIR). The seminal paper on music fingerprinting is the one by Haitsma and Kalker around 2002-03. Google should get you it.
I read an early (really early; before 2000) white paper about Shazam's method. At that point, they just basically detected spectrotemporal peaks, and then hashed the peaks. I'm sure that procedure has evolved.
Both of these methods address music similarity at the signal level, i.e., it is robust to environment distortions. I don't think it works well for query-by-humming (QBH). However, that is a different (yet related) problem with different (yet related) solutions, so you can find solutions in the literature. (Too many to name here.)
The ISMIR proceedings are freely available online. You can find valuable stuff there: http://www.ismir.net/
I agree with using an existing library like Marsyas. Depends on what you want. Numpy/Scipy is indispensible here, I think. Simple stuff can be written in Python on your own. Heck, if you need stuff like STFT, MFCC, I can email you code.
我致力于一个很酷的框架的外围工作,该框架实现了多种音乐信息检索技术。我算不上专家(编辑:实际上我离专家还差得很远,只是为了澄清一下),但我可以看出,快速傅里叶变换在这些东西中随处可见。傅里叶分析很古怪,但它的应用却非常简单。基本上,当您在频域而不是时域中分析音频时,您可以获得大量有关音频的信息。这就是傅里叶分析为您提供的。
这可能与您想做的事情有点偏离主题。无论如何,项目中有一些很酷的工具可以使用,以及查看核心库本身的源代码:http ://marsyas.sness.net
I worked on the periphery of a cool framework that implements several Music Information Retrieval techniques. I'm hardly an expert (edit: actually i'm nowhere close to an expert, just to clarify), but I can tell that that the Fast Fourier Transform is used all over the place with this stuff. Fourier analysis is wacky but its application is pretty straight-forward. Basically you can get a lot of information about audio when you analyze it in the frequency domain rather than the time domain. This is what Fourier analysis gives you.
That may be a bit off topic from what you want to do. In any case, there are some cool tools in the project to play with, as well as viewing the sourcecode for the core library itself: http://marsyas.sness.net
我最近将基于音频地标的指纹识别系统移植到 Python:
https://github.com/dpwe/audfprint
它可以从包含数十或数千个曲目的参考数据库中识别小段(5-10 秒)摘录,并且对噪声和通道失真具有很强的鲁棒性。它使用局部光谱峰值的组合,类似于 Shazam 系统。
这只能匹配完全相同的曲目,因为它依赖于频率和时间差异的精细细节 - 它甚至不能匹配不同的镜头,当然不能覆盖版本或嗡嗡声。据我了解,Midomi/SoundHound 的工作原理是相互匹配嗡嗡声(例如通过 动态时间扭曲),然后在嗡嗡声组和预期的音乐曲目之间有一组人工策划的链接。
将哼唱直接与音乐曲目匹配(“哼唱查询”)是音乐信息检索中一个正在进行的研究问题,但仍然相当困难。您可以在 MIREX 2013 上查看去年评估的一组系统的摘要QBSH 结果。
I recently ported my audio landmark-based fingerprinting system to Python:
https://github.com/dpwe/audfprint
It can recognize small (5-10 sec) excerpts from a reference database of 10s of thousands of tracks, and is quite robust to noise and channel distortions. It uses combinations of local spectral peaks, similar to the Shazam system.
This can only match the exact same track, since it relies on fine details of frequencies and time differences - it wouldn't even match different takes, certainly not cover versions or hums. As far as I understand, Midomi/SoundHound works by matching hums to each other (e.g. via dynamic time warping), then has a set of human-curated links between sets of hums and the intended music track.
Matching a hum directly to a music track ("Query by humming") is an ongoing research problem in music information retrieval, but is still pretty difficult. You can see abstracts for a set of systems evaluated last year at the MIREX 2013 QBSH Results.
从音乐中提取的 MFCC 对于查找歌曲之间的音色相似性非常有用。这最常用于查找相似的歌曲。正如 darren 所指出的,Marsyas 是一个可以用来提取 MFCC 并通过将 MFCC 转换为单个向量表示来查找相似歌曲的工具。
除了 MFCC,Rhythm 也可用于查找歌曲相似度。很少有论文Mirex 2009 中介绍
的内容将为您提供良好的概述对检测音乐相似性最有帮助的不同算法和功能。
MFCC extracted from the music is very useful in finding the timbrel similarity between songs.. this is most often used to find similar songs. As pointed by darren, Marsyas is a tool that can be used to extract MFCC and find similar songs by converting the MFCC in to a single vector representation..
Other than MFCC, Rhythm is also used to find song similarity.. There are few papers presented in the Mirex 2009
that will give you good overview of different algorithms and features that are most helpful in detecting music similarity.
MusicBrainz 项目 维护着这样一个数据库。您可以根据指纹对其进行查询。
该项目已经存在一段时间了,并且过去使用过不同的指纹。请参阅此处查看列表。
他们使用的最新指纹是 AcoustId。有 Chromaprint 库(也带有 Python 绑定),您可以在其中创建此类指纹。您必须向其提供原始 PCM 数据。
我最近用 Python 编写了一个库,它可以进行解码(使用 FFmpeg),并提供生成 AcoustId 指纹(使用 Chromaprint)和其他功能(还可以通过 PortAudio 播放流)等功能。请参阅此处。
The MusicBrainz project maintains such a database. You can make queries to it based on a fingerprint.
The project exists already since a while and has used different fingerprints in the past. See here for a list.
The latest fingerprint they are using is AcoustId. There is the Chromaprint library (also with Python bindings) where you can create such fingerprints. You must feed it raw PCM data.
I have recently written a library in Python which does the decoding (using FFmpeg) and provides such functions as to generate the AcoustId fingerprint (using Chromaprint) and other things (also to play the stream via PortAudio). See here.
自从我上次进行信号处理以来已经有一段时间了,但您应该查看频域表示(例如 FFT 或 DCT),而不是下采样。然后,您可以进行某种哈希并搜索包含该序列的数据库歌曲。
棘手的部分是如何快速进行搜索(也许一些关于基因搜索的论文可能会感兴趣)。我怀疑 iTunes 也会对乐器进行一些检测以缩小搜索范围。
Its been a while since i last did signal processing, but rather than downsampling you should look at frequency-domain representations (eg FFT or DCT). Then you could make a hash of sorts and search for the database song with that sequence in.
Tricky part is making this search fast (maybe some papers on gene search might be of interest). I suspect that iTunes also does some detection of instruments to narrow down the search.
我确实读过一篇关于某种音乐信息检索服务(未提及名称)的方法的论文 - 通过计算音频样本的短时傅里叶变换。然后,该算法挑选出频域中的“峰值”,即振幅特别高的时间位置和频率,并使用这些峰值的时间和频率来生成散列。令人惊讶的是,该哈希值在不同样本之间几乎没有冲突,而且还可以承受峰值信息约 50% 的数据丢失......
I did read a paper about the method in which a certain music information retrieval service (no names mentioned) does it - by calculating the Short Time Fourier transform over the sample of audio. The algorithm then picks out 'peaks' in the frequency domain i.e. time positions and frequencies that are particularly high amplitude, and uses the time and frequency of these peaks to generate a hash. Turns out the hash has surprising few collisions between different samples, and also stands up against approx 50% data loss of the peak information.....
目前我正在使用 ActionScript 3 开发一个音乐搜索引擎。这个想法是首先分析和弦并标记频率急剧变化(旋律变化并忽略噪音)的帧(目前仅限于 mp3 文件)。之后,我对输入声音执行相同的操作,并将结果与倒置文件进行匹配。匹配的歌曲决定匹配的歌曲。
对于阿克塞尔的方法,我认为你不应该担心查询是唱歌还是只是哼唱,因为你没有实现语音识别程序。但我对你使用哈希函数的方法很好奇。你能向我解释一下吗?
Currently I'm developing a music search engine using ActionScript 3. The idea is analyzing the chords first and marking the frames (it's limited to mp3 files at the moment) where the frequency changes drastically (melody changes and ignoring noises). After that I do the same thing to the input sound, and match the results with the inverted files. The matching one determines the matching song.
For Axel's method, I think you shouldn't worry about the query whether it's a singing or just humming, since you don't implement a speech recognition program. But I'm curious about your method which uses hash functions. Could you explain that to me?
对于通过哼唱特征进行查询,它比音频指纹解决方案更复杂,困难来自:
这是一个通过 humming 开源项目进行演示查询,https://github.com/EmilioMolina/QueryBySingingHumming,可以作为参考。
For query by humming feature, it is more complicate than the audio fingerprinting solution, the difficult comes from:
Here is an demo query by humming open source project, https://github.com/EmilioMolina/QueryBySingingHumming, could be an reference.