两个音频序列之间的感知相似性
我想测量两个音频之间的距离。例如,我想将动物的声音与人类模仿该动物的声音进行比较,然后返回声音相似程度的分数。
这似乎是一个难题。解决这个问题的最佳方法是什么?我正在考虑从音频信号中提取一些特征,然后对这些特征进行欧几里德距离或余弦相似度(或类似的操作)。什么样的特征容易提取并且有助于确定声音之间的感知差异?
(我在某处看到 Shazam 使用哈希,但这是一个不同的问题,因为被比较的两段音频基本上是相同的,但其中一个有更多噪音。在这里,两段音频不一样,它们只是感知上的相似的。)
I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.
It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a Euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?
(I saw somewhere that Shazam uses hashing, but that's a different problem because there the two pieces of audio being compared are fundamentally the same, but one has more noise. Here, the two pieces of audio are not the same, they are just perceptually similar.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
比较一组声音的相似性的过程称为基于内容的音频索引< /a>, 检索,以及计算机科学研究中的指纹。
实现此目的的一种方法是:
对每个音频文件运行多个位的信号处理以提取特征,例如随时间变化的音调、频谱、自相关、动态范围、瞬态等。
将每个音频文件的所有特征放入多维数组中,并转储每个多维数组将
使用优化技术(例如梯度下降)在多维数据数据库中找到给定音频文件的最佳匹配。
使这项工作顺利进行的技巧是选择哪些功能。自动执行此操作并获得良好结果可能很棘手。 Pandora 的人在这方面做得非常好,在我看来,他们拥有最好的相似度匹配。不过,他们通过让人们听音乐并以多种不同的方式对它们进行评分,对向量进行手工编码。请参阅他们的音乐基因组计划和音乐基因组计划属性列表了解更多信息。
对于自动距离测量,有几个项目可以执行类似的操作,包括 marsysas、MusicBrainz 和 EchoNest。
Echonest 拥有我在这个领域见过的最简单的 API 之一。非常容易上手。
The process for comparing a set of sounds for similarities is called Content Based Audio Indexing, Retrieval, and Fingerprinting in computer science research.
One method of doing this is to:
Run several bits of signal processing on each audio file to extract features, such as pitch over time, frequency spectrum, autocorrelation, dynamic range, transients, etc.
Put all the features for each audio file into a multi-dimensional array and dump each multi-dimensional array into a database
Use optimization techniques (such as gradient descent) to find the best match for a given audio file in your database of multi-dimensional data.
The trick to making this work well is which features to pick. Doing this automatically and getting good results can be tricky. The guys at Pandora do this really well, and in my opinion they have the best similarity matching around. They encode their vectors by hand though, by having people listen to music and rate them in many different ways. See their Music Genome Project and List of Music Genome Project attributes for more info.
For automatic distance measurements, there are several projects that do stuff like this, including marsysas, MusicBrainz, and EchoNest.
Echonest has one of the simplest APIs I've seen in this space. Very easy to get started.
我建议研究频谱分析。虽然这并不像您最可能想要的那么简单,但我希望将音频分解为其基础频率将提供一些非常有用的数据进行分析。查看此链接
I'd suggest looking into spectrum analysis. Whilst this isn't as straightforward as you're most likely wanting, I'd expect that decomposing the audio into it's underlying frequencies would provide some very useful data to analyse. Check out this link
您的第一步肯定是对声波进行傅立叶变换(FT)。如果您对随时间变化的频率1的数据执行 FT,您将能够比较在噪声过程中某些关键频率被击中的频率。
也许你也可以从一个波中减去另一个波,以获得一种逐步差分函数。假设模拟噪声遵循与原始噪声相同的频率和音高趋势2,您可以计算差异函数点的最佳拟合线。将最佳拟合线与原始声波的最佳拟合线进行比较,您可以平均出一条趋势线作为比较的基础。诚然,这将是一种非常宽松的比较方法。
- 1.赫兹/毫秒,也许?我不熟悉这里使用的单位幅度,我通常在飞秒到纳米范围内工作。
- 2. 只要 ∀ΔT, ΔPitch/ΔT& Δ频率/ΔT 在一定的容差范围内x。
- 编辑格式,因为我实际上忘记写完完整的答案.
Your first step will definitely be taking a Fourier Transform(FT) of the sound waves. If you perform an FT on the data with respect to Frequency over Time1, you'll be able to compare how often certain key frequencies are hit over the course of the noise.
Perhaps you could also subtract one wave from the other, to get a sort of stepwise difference function. Assuming the mock-noise follows the same frequency and pitch trends2 as the original noise, you could calculate the line of best fit to the points of the difference function. Comparing the best fit line against a line of best fit taken of the original sound wave, you could average out a trend line to use as the basis of comparison. Granted, this would be a very loose comparison method.
- 1. hz/ms, perhaps? I'm not familiar with the unit magnitude being worked with here, I generally work in the femto- to nano- range.
- 2. So long as ∀ΔT, ΔPitch/ΔT & ΔFrequency/ΔT are within some tolerance x.
- Edited for formatting, and because I actually forgot to finish writing the full answer.