高效区分大文件的算法

发布于 2024-08-17 14:40:04 字数 425 浏览 8 评论 0 原文

我必须存储两个文件 A 和 B,它们都非常大(例如 100GB)。然而 B 可能在很大程度上与 A 相似,所以我可以存储 A 和 diff(A, B)。这个问题有两个有趣的方面:

  1. ​​ 文件太大,无法由我所知道的任何 diff 库进行分析,因为它们位于内存中
  2. 我实际上不需要 diff - diff 通常具有插入、编辑和删除功能,因为它是供人类阅读的。我可以获得更少的信息:我只需要“新的字节范围”和“从任意偏移量的旧文件复制字节”。

我目前不知道如何在这些条件下计算从 A 到 B 的增量。有谁知道这方面的算法?

同样,问题很简单:编写一个算法,可以用尽可能少的字节存储文件 A 和 B,因为两者非常相似。

附加信息:虽然大部件可能相同,但它们可能具有不同的偏移量并且出现故障。最后一个事实是为什么传统的差异可能不会节省太多。

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:

  1. The files are too big to be analyzed by any diff library I know of because they are in-memory
  2. I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

南薇 2024-08-24 14:40:04

您可以使用 rdiff,它非常适合处理大文件。在这里,我创建了两个大文件 AB 的差异:

  1. 创建一个文件的签名,例如

    rdiff 签名 A sig.txt
    
  2. 使用生成的签名文件sig.txt和另一个大文件,创建增量:

    rdiff delta sig.txt B delta
    
  3. 现在delta包含您的所有信息当您同时拥有 Adelta 时,需要重新创建文件 B。要重新创建 B,请运行

    rdiff 补丁 A 增量 B
    

在 Ubuntu 中,只需运行 sudo apt-get install rdiff 即可安装它。它的速度相当快,我的 PC 上每秒大约可以获取 40 MB 的数据。我刚刚在8GB的文件上尝试过,rsync使用的内存约为1MB。

You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

  1. Create a signature of one file, with e.g.

    rdiff signature A sig.txt
    
  2. using the generated signature file sig.txt and the other big file, create the delta:

    rdiff delta sig.txt B delta
    
  3. now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run

    rdiff patch A delta B
    

In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

意中人 2024-08-24 14:40:04

看一下 RSYNC 算法,因为它的设计几乎就是为了做到这一点,因此它可以有效地复制增量。我记得这个算法有很好的文档记录。

Take a look at RSYNCs algorithm, as it's designed pretty much to do exactly this so it can efficiently copy deltas. And the algorithm is pretty well documented, as I recall.

意中人 2024-08-24 14:40:04

这正是所谓的“重复数据删除” 问题。最常用的方法是:

  • 以块的形式读取文件:
    • 分割所谓“块”的数据。最常用的方法称为“使用 Rabins 指纹方法定义内容块”(代码)。与使用静态大小的块相比,使用这种分块方法可以在大多数数据集上实现更好的重复数据删除(例如显示 此处)。
    • 使用加密指纹识别方法(例如 SHA-256)对块进行指纹识别。
    • 将指纹存储在索引中,如果指纹已知,则查找每个块。如果指纹已知,则无需再次存储该块。只有当指纹未知时,才需要存储数据。

这种重复数据删除算法并不像 xdelta 那样精确,但对于大型数据集来说它更快且更具可扩展性。分块和指纹识别的执行速度约为每个核心 50 MB/s (Java)。索引大小取决于冗余、块大小和数据大小。对于 200 GB,它应该适合内存中的块大小,例如 16KB。

Bentleys 和 Mciloys 压缩方法非常相似(例如由 Google BigTable 使用) ,但是我不知道有任何使用压缩技术的开箱即用的命令行工具。

"fs-c" 开源项目包含大部分必需的代码。然而,fs-c 本身仅尝试测量内存中的冗余和分析文件或使用 Hadoop 集群。

That is exactly the problem known as "data deduplication". The most commonly used approach is:

  • Read over the files in blocks:
    • Split the data of the so called "chunks". The most often used approach is called "Content defined Chunking using Rabins Fingerprinting method" (Code). Using that chunking approach leads to a better deduplication on most data set then using static sized chunks (e.g. shown here).
    • Fingerprint the chunks using a cryptographic fingerprinting method, e.g. SHA-256.
    • Store the fingerprints in an index and lookup for each chunk if the fingerprint is already known. If the fingerprint is known, there is no need to store the chunk a second time. Only when the fingerprint is not known, the data has to be stored.

Such an data deduplication algorithm is not as exact as e.g. xdelta, but it is faster and more scalable for large data sets. The chunking and fingerprinting is performed with around 50 MB/s per core (Java). The index size depends on the redundancies, the chunk size and the data size. For 200 GB, it should fit in memory for chunk sizes of e.g. 16KB.

Bentleys and Mciloys compression approach is very similar (used e.g. by Googles BigTable), however I am not aware of any out-of-the box command line tools using the compression technique.

The "fs-c" open source project contains most of the code that is necessary. However, fs-c itself tries only to measure the redundancies and the analzye files in-memory or using a Hadoop cluster.

冷夜 2024-08-24 14:40:04

一个问题是文件中的记录大小是多少,即偏移量是否可以逐字节更改,或者文件是否由 1024B 块组成。假设数据是面向字节的,您可以执行以下操作:

  1. 为文件 A 创建一个后缀数组。该数组是文件 A 的所有索引值的排列。如果 A 有 2^37 字节,则索引数组最简单地用 64 位整数表示,因此每个字节(到文件的偏移量)对应索引数组中的 8 个字节,因此索引数组的长度将为 2^40 个字节。例如 800 GB。例如,您还可以仅对每 1024 个位置建立索引,以减少索引数组的大小。然后,这会根据可复制片段的平均运行时间来降低打包质量。

  2. 现在,为了贪婪地打包文件 B,您从偏移量 o=0 处开始,然后使用索引数组查找 A 中与从“o”开始的数据相匹配的最长匹配项。您在打包文件中输出该对。在您的情况下,这需要没有任何编码的 16 个字节,因此如果运行 < 16 个字节实际上会损失空间。通过使用位级编码并使用位标记来标记是否编码隔离字节(标记 + 8 位 = 9 位)或偏移/长度对(标记 + 40 位 + 40 位 = 81位),说。在 o 处打包最长的片段后,将 o 增加到片段后的下一个字节并重复直到文件末尾。

后缀数组的构造和使用很容易,您应该很容易找到参考。在高速应用程序中,人们使用后缀树或后缀尝试来代替,它们操作起来更复杂,但提供更快的查找。在您的情况下,您将把数组放在辅助存储上,如果打包阶段的运行速度不是问题,后缀数组应该足够了。

one question is what is the record size in your files, i.e. can the offsets change byte by byte or do the files consist of, say, 1024B blocks. Assuming the data is byte-oriented, you could do the following:

  1. Create a suffix array for the file A. This array is a permutation of all index values to the file A. If A has 2^37 bytes then the index array is easiest represented by 64-bit integers, so every byte (offset to the file) corresponds to 8 bytes in the index array, so the index array will be 2^40 bytes long then. E.g. 800 GB, say. You can also index only every 1024th location, say, to reduce the size of the index array. This then detoriates the quality of packing depending on how long the average runs of copyable fragments are.

  2. Now then to greedily pack the file B you start from its beginning at offset o=0 and then use the index array to find the longest match in A that matches the data starting at 'o'. You output the pair in the packed file. This takes in your case without any encoding 16 bytes, so if the run is < 16 bytes you actually lose space. This can be easily remedied by using then bit-level encoding and using a bit marker to mark whether you encode an isolated byte (marker + 8 bits = 9 bits) or an offset/length pair (marker + 40 bits + 40 bits = 81 bits), say. After packing the longest fragment at o, increase o to the next byte after the fragment and repeat until at end of file.

The construction and use of a suffix array is easy and you should find references easily. In high-speed applications people use suffix trees or suffix tries instead, which are more complex to manipulate but provide faster lookup. In your case you're going to have the array on secondary storage and if the run speed of the packing phase is not an issue a suffix array should be enough.

爱她像谁 2024-08-24 14:40:04

根据您的性能要求,您可以对指纹块进行采样,并在匹配时对它们进行增长。这样您就不必对整个大文件运行校验和。

如果您需要任意字节对齐并且您确实关心性能,请查看 simhash 算法,并用它来查找相似但未对齐的块。

Depending on your performance requirements, you could get away with sampling the chunks you fingerprint, and growing them when they match. That way you don't have to run a checksum on your entire large file.

If you need arbitrary byte alignments and you really care about performance, look at the simhash algorithm, and use it to find similar but unaligned blocks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文