确定文件身份的算法(优化)
进一步回答这个问题:确定文件身份的算法
回顾< /strong>:我正在寻找一种廉价的算法来确定文件身份,该算法在绝大多数情况下都有效。
我继续实现了一种算法,为每个文件提供“非常独特”的哈希值。
我的算法的工作方式是:
对于小于特定阈值的文件,我使用完整的文件内容作为身份哈希。
对于大于阈值的文件,我会随机抽取 N 个 X 大小的样本。
我将文件大小包含在哈希数据中。 (意味着所有不同大小的文件都会产生不同的哈希值)
问题:
我应该为 N 和 X 选择什么值(我应该从哪个大小中抽取多少个随机样本?)我使用了 4 个样本,每个样本 8K,并且我无法难倒算法。 我发现增加样本量会快速降低算法的速度(因为查找非常昂贵)
数学问题:我的文件需要有多大的差异才能使该算法爆炸。 (两个具有相同长度的不同文件最终具有相同的哈希值)
优化一:有什么方法可以优化我的具体实现以提高吞吐量(我似乎能够在我的系统上每秒处理大约 100 个文件) )。
这个实现看起来正常吗? 你能想到任何现实世界中这种方法会失败的例子吗? (我的重点是媒体文件)
相关信息:
感谢您的帮助!
Further to this question: Algorithm for determining a file’s identity
Recap: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time.
I went ahead and implemented an algorithm that gives me a "pretty unique" hash per file.
The way my algorithm works is:
For files smaller than a certain threshold I use the full files content for the identity hash.
For files larger than the threshold I take random N samples of X size.
I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)
Questions:
What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)
The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)
The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).
Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)
Relevant information:
Thanks for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是因为它们很可能因文件而异。 如果您考虑 BMP,它可能具有相当标准的标头(如 800x600 图像、24 位、空休息),因此您可能需要稍微超出标头以获得差异化数据。 问题是标题的大小差异很大。
最后一个块用于将数据附加到原始文件格式。
即使如此,除非你很幸运,否则你会错误地将某些文件识别为相同的文件(例如 SQL Server 数据库文件,仅在几次插入后它就是 1:1 备份副本;除了 SS 确实写入了时间戳..)
This is because they're most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800x600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size.
Last block is for fileformats that append data to original.
Even then unless you're lucky you will misidentify some files as same (for example SQL Server database file and it's 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)
我会避免这样的解决方案。 我的实践是,两个媒体文件在压缩格式的相应位置具有相同的大小和相同的数据几乎是不可能的。 但是,如果您必须处理未压缩的图像或波形文件,则无法检测到小的局部更改的可能性就会增加。
所以我认为你应该真正散列整个文件。 虽然这看起来很昂贵,但如果您可以访问所有文件,则可能并不昂贵 - 例如,如果您构建文件服务器或类似的东西。 您可以增量地构建哈希。
如果您看到一个具有唯一文件长度的新文件,只需存储该文件长度即可。 如果添加了另一个长度相同的文件,则逐块计算两个文件的哈希值,直到它们不同为止。 存储文件长度、哈希值以及哈希值中包含文件的多少个块。 每当您检测到匹配的文件长度和哈希值并且尚未对整个文件进行哈希处理时,您就可以通过添加更多块来扩展哈希值。
关于表演的一些想法。 对于小文件,文件长度相等的可能性相当高 - 没有那么多不同的小文件长度。 但散列小文件并不昂贵。
对于较大的文件,文件长度冲突的可能性会降低,因为可能的文件长度越来越多。 对于不同的媒体文件,它们很可能直接在标头之外有所不同,因此您只需要对文件开头的一小部分进行哈希处理。
最后,您将确保检测到不同的文件(散列冲突除外),因为如果需要,您将对整个文件进行散列。
更新
对于电影,我会认为文件长度实际上是唯一的,但是重新编码以适合给定介质的文件可能会使这个想法无效 - (S)VCD 电影都将在较小的文件长度范围内关于CD-ROM容量。
但对于一般的电影文件,我只会从文件中间散列一个块(可能是 512 字节)。 两部不同的电影在同一位置具有相同的图像和声音? 除了你操纵文件来使这个测试失败之外,实际上是不可能的。 但是您可以轻松生成文件以使所有确定性采样策略失败 - 所以这应该不重要。
I would avoid an solution like this. I practice it might be close to imposible that two media files have the same size and the same data at coresponding positions for compressed formats. But if you have to deal with uncompressed images or wave files, chances that small local changes are not detected grow .
So I think you should realy hash the whole file. While this seems expensive it might not be if you have access to all the files - for example if you build a file server or something like that. You could build the hash incrementaly.
If you see a new file with an unique file length, just store the file length. If another file with the same length is added, calculate the hashes of both files block by block until they differ. Store the file length, the hash and how many blocks of the file are included in the hash. Whenever you detect matching file lengths and hashes and you have not yet hashed the whole file, you extend the hash by adding more blocks.
Some thoughts about the performance. For small files, the chances of equal file length is quite high - there are not so many diffrent small file lengths. But it is not expensive to hash small files.
For larger files the chances of file lenght collisons goes down as there are more and more possible file lengths. For diffrent media files the chances are very good that they differ directly beyond the header so you will need to hash only a short part of the begining of the file.
Finally you will be sure to detect diffrent files (except for hash collisions) because you will hash the whole file if required.
UPDATE
For movies I would consider the file length practical unique, but files recoded to fit on a given medium probably render this idea void - (S)VCD movies will all be in a small range of file lenghs of about CD-ROM capacity.
But for movie files in general, I would just hash one block (maybe 512 Byte) from the middle of the file. Two different movies with the same image and sound at the same position? Practicaly imposible besides you manipulate files to fail this test. But you could easily generate files to fail all deterministic sampling strategies - so this should not really matter.
(选择 X 个随机数,然后对它们进行排序)。
(Select X random numbers then sort them).