判断两个 MP3 文件是否重复的最快方法是什么?
我想编写一个程序来删除重复的 iTunes 音乐文件。识别欺骗的一种方法是比较 MP3 和 m4a 文件的 MD5 摘要。有没有更有效的策略?
顺便说一句,iTunes 上的“显示重复项”菜单命令显示误报。显然它只是比较艺术家和曲目标题字符串。
I want to write a program that deletes duplicate iTunes music files. One approach to identifying dupes is to compare MD5 digests of the MP3 and m4a files. Is there a more efficient strategy?
BTW the "Display Duplicates" menu command on iTunes shows false positives. Apparently it just compares on the Artist and Track title strings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您使用哈希来比较两组数据,理想情况下它们每次都必须具有完全相同的输入才能获得完全相同的输出(除非您奇迹般地选择了不同输入的两次碰撞,从而产生相同的输出)。如果您想通过散列整个文件来比较两个 MP3 文件,那么两组歌曲数据可能完全相同,但由于 ID3 存储在文件内,因此差异可能会使文件看起来完全不同。由于您使用的是哈希,因此您不会注意到两个文件中的 99% 可能是匹配的,因为输出差异太大。
如果您确实想使用散列来执行此操作,则应该只散列声音数据,不包括可能附加到文件的任何标签。不建议这样做,例如,如果从 CD 翻录音乐,并且同一张 CD 被翻录两次不同的时间,则结果可能会根据翻录参数进行不同的编码/压缩。
更好(但更复杂)的替代方案是尝试比较未压缩的音频数据值。通过对已知输入进行一些尝试和错误,可以得出一个不错的算法。完美地做到这一点会非常困难(如果可能的话),但如果你得到的结果准确率超过 50%,那会比手工完成更好。
请注意,即使算法可以检测两首歌曲是否接近(例如在不同参数下录制同一首歌曲),该算法也必须比判断现场版本是否类似于录音室版本更复杂。如果你能做到这一点,这里就有钱赚了!
并回顾一下最初的想法,即如何快速判断它们是否重复。与任何具有此目的的算法相比,哈希会快得多,但准确性要低得多。这是速度与准确性和复杂性的比较。
If you use hashes to compare two sets of data, ideally they'd have to have exactly the same input each time in order to get exactly the same output (unless you miraculously picked two collisions of different input resulting in the same output). If you wanted to compare two MP3 files by hashing the entire file, then the two sets of song data might be exactly the same but since ID3 is stored inside the file, discrepancies there might make the files appear to be completely different. Since you're using a hash, you won't notice that perhaps 99% of the two files are matches because the outputs will be too different.
If you really want to use a hash to do this, you should only hash the sound data excluding any tags that may be attached to the file. This is not recommended, if music is ripped from CDs for example, and the same CD is ripped two different times, the results might be encoded/compressed differently depending on ripping parameters.
A better (but much more complicated) alternative would be an attempt to compare the uncompressed audio data values. With a little trial and error with known inputs can lead to a decent algo. Doing this perfectly will be very hard (if possible at all), but if you get something that's more than 50% accurate, it'll be better than going through by hand.
Note that even an algorithm that can detect if two songs are close (say the same song ripped under different parameters), the algo would have to be more complex than it's worth to tell if a live version is anything like a studio version. If you can do that, there's money to be made here!
And touching back on the original idea of how fast to tell if they're duplicates. A hash would be a lot faster, but a lot less accurate than any algorithm with this purpose. It's speed vs accuracy and complexity.