确定文件身份的算法

发布于 2024-07-11 08:13:06 字数 792 浏览 11 评论 0原文

对于一个开源项目,我正在文件系统之上编写一个抽象层。

该层允许我将元数据和关系附加到每个文件。

我希望该层能够优雅地处理文件重命名,并在文件被重命名/移动或复制时维护元数据。

为此,我需要一种计算文件身份的机制。 显而易见的解决方案是计算每个文件的 SHA1 哈希值,然后根据该哈希值分配元数据。 但是……这确实很昂贵,尤其是对于电影而言。

所以,我一直在考虑一种算法,虽然不是 100% 正确,但在绝大多数情况下都是正确的,而且成本低廉。

一种这样的算法可能是使用文件大小和该文件的字节样本来计算哈希值。

我应该为样本选择哪些字节? 如何保持计算成本低廉且相当准确? 我知道这里需要权衡,但性能至关重要。 用户将能够处理系统出错的情况。

我需要这个算法来处理非常大的文件(1GB+ 和小文件 5K)

编辑

我需要这个算法来处理 NTFS 和所有 SMB 共享(基于 Linux 或 Windows),我希望它支持将文件从一个位置复制到另一个位置的情况(存在 2 个物理副本被视为一个身份)。 我什至可能考虑希望它在 MP3 被重新标记的情况下工作(物理文件已更改,因此我可能为每个文件类型提供一个身份提供程序)。

编辑2

相关问题:确定文件身份的算法(优化)

For an open source project I have I am writing an abstraction layer on top of the filesystem.

This layer allows me to attach metadata and relationships to each file.

I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.

To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.

So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.

One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.

Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.

I need this algorithm to work for very large files (1GB+ and tiny files 5K)

EDIT

I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).

EDIT 2

Related question: Algorithm for determining a file’s identity (Optimisation)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

策马西风 2024-07-18 08:13:06

分桶、多层比较应该是最快的,并且在您正在讨论的文件范围内可扩展。

第一级索引只是文件的长度。

第二级是哈希。 低于一定大小,它是整个文件的哈希值。 除此之外,是的,我同意你关于采样算法的想法。 我认为可能影响采样速度的问题:

  1. 为了避免命中可能高度相似或相同的规则间隔的标头,您需要输入不合格的数字,例如:素数或连续素数的倍数。
  2. 避免可能最终遇到常规记录标头的步骤,因此,如果您从样本字节中获得相同的值,尽管位置不同,请尝试通过另一个素数调整步骤。
  3. 处理具有大量相同值的异常文件,因为它们是未编码的图像或只是填充了空值。

Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.

First level of indexing is just the length of the file.

Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:

  1. To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
  2. Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
  3. Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
撩发小公举 2024-07-18 08:13:06

执行第一个 128k,在 1mb 标记处执行另一个 128k,在 10mb 标记处执行另一个 128k,在 100mb 标记处执行另一个 128k,在 1000mb 标记处执行另一个 128k,等等。随着文件大小变大,您更有可能您将能够仅根据两个文件的大小来区分两个文件,您将散列越来越小的数据部分。 128k 以下的一切都得到了彻底处理。

Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.

鸠魁 2024-07-18 08:13:06

不管你相信与否,我使用了文件的上次写入时间的刻度。 它非常便宜,而且我仍然会看到不同文件之间的冲突。

Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.

九厘米的零° 2024-07-18 08:13:06

如果您可以放弃 Linux 共享要求并将自己限制在 NTFS 上,那么 NTFS 备用数据流将是一个完美的解决方案,它:

  • 不需要任何类型的散列;
  • 重命名后仍然存在; 并且
  • 在移动后仍然存在(即使在不同的 NTFS 卷之间)。

您可以在此处阅读更多相关信息。 基本上,您只需为流附加一个冒号和一个名称(例如“:meta”),然后写入您喜欢的任何内容。 因此,如果您有目录“D:\Movies\Terminator”,请使用普通文件 I/O 将元数据写入“D:\Movies\Terminator:meta”。 如果您想保存特定文件(而不是整个文件夹)的元数据,您可以执行相同的操作。

如果您希望将元数据存储在其他位置并且只能检测同一 NTFS 卷上的移动/重命名,则可以使用 GetFileInformationByHandle API 调用(请参阅 MSDN /en-us/library/aa364952(VS.85))。 aspx) 来获取文件夹的唯一 ID(结合 VolumeSerialNumber 和 FileIndex 成员)。 如果文件/文件夹在同一卷上移动/重命名,则此 ID 不会更改。

If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:

  • doesn't require any kind of hashing;
  • survives renames; and
  • survives moves (even between different NTFS volumes).

You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).

If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.

美人骨 2024-07-18 08:13:06

存储一些随机整数 ri 并查找字节 (ri mod n)(其中 n 是文件的大小)怎么样? 对于带有标头的文件,您可以先忽略它们,然后对剩余字节执行此过程。

如果您的文件实际上非常不同(不仅仅是某个地方的单个字节的差异,而是至少有 1% 的差异),那么随机选择的字节会注意到这一点。 例如,字节差异为 1%,100 个随机字节将无法注意到的概率为 1/e ~ 37%; 增加您查看的字节数会使该概率呈指数下降。

使用随机字节背后的想法是,它们本质上保证(从概率上来说)与任何其他字节序列一样好,除了它们不易受到其他序列的一些问题的影响(例如,碰巧查看文件格式的每个第 256 个字节,其中该字节需要为 0 或其他值)。

更多建议:

  • 不要抓取字节,而是抓取更大的块来证明查找成本的合理性。
  • 我建议始终查看文件的第一个块左右。 由此,您可以确定文件类型等。 (例如,您可以使用 file 程序。)
  • 至少权衡整个文件的 CRC 之类的成本/收益。 它不像真正的加密哈希函数那么昂贵,但仍然需要读取整个文件。 好处是它注意到单字节差异。

How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.

If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.

The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).

Some more advice:

  • Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
  • I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
  • At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
攒眉千度 2024-07-18 08:13:06

好吧,首先您需要更深入地了解文件系统的工作原理。 您将使用哪些文件系统? 大多数文件系统支持硬链接和软链接等内容,因此“文件名”信息不一定存储在文件本身的元数据中。

实际上,这就是可堆叠分层文件系统的全部要点,您可以通过各种方式扩展它,例如支持压缩或加密。 这就是“vnode”的全部内容。 实际上,您可以通过多种方式来做到这一点。 其中一些非常依赖于您正在查看的平台。 这在使用 VFS 概念的 UNIX/Linux 系统上要简单得多。 例如,您可以在 ext3 之上实现您自己的层或您拥有的层。

**
阅读您的编辑后,还有更多事情。 正如前面提到的,文件系统已经使用索引节点之类的东西来做到这一点。 散列可能是一个坏主意,不仅因为它成本高昂,而且因为两个或多个原像可以共享同一个图像; 也就是说,两个完全不同的文件可以具有相同的哈希值。 我认为您真正想做的是利用文件系统已经公开的元数据。 当然,这在开源系统上会更简单。 :)

Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.

Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.

**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)

一绘本一梦想 2024-07-18 08:13:06

我应该为样本选择哪些字节?

我想我会尝试使用一些算术级数,例如斐波那契数列。 这些很容易计算,并且它们的密度递减。 小文件比大文件具有更高的样本比率,并且样本仍然会覆盖整个文件中的点。

Which bytes should I choose for the sample?

I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.

独﹏钓一江月 2024-07-18 08:13:06

这项工作听起来可以在文件系统级别或通过版本控制系统的某种松散近似来更有效地实现(两者都?)。

为了解决最初的问题,您可以为每个文件保留一个数据库(文件大小、哈希字节数、哈希值),并尝试最小化每个文件大小的哈希字节数。 每当您检测到冲突时,您要么拥有相同的文件,要么增加哈希长度以超过第一个差异。

毫无疑问,还需要进行优化以及 CPU 与 I/O 的权衡,但对于不会出现误报的情况来说,这是一个良好的开端。

This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).

To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.

There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文