流 rdiff - 增量差分?

发布于 2024-10-17 15:18:21 字数 666 浏览 9 评论 0原文

我有一个使用 rdiff 进行在线备份的产品。当前发生的情况是:

  1. 将文件复制到暂存区域(这样在我们处理文件时文件就不会消失或被修改)

  2. 对原始文件进行哈希处理,并计算 rdiff 签名(用于增量差分) 计算 rdiff delta 差异(如果我们没有之前的版本,则跳过此步骤)

  3. 压缩和压缩加密所得的增量差异

目前,这些阶段的执行方式彼此不同。最终结果是我们多次迭代该文件。对于小文件来说,这并不是什么大问题(特别是考虑到磁盘缓存),但对于大文件(10 甚至 100 GB)来说,这才是真正的性能杀手。

我想将所有这些步骤合并到一个读/写过程中。

为此,我们必须能够以流式传输方式执行上述所有步骤,同时仍然保留所有“输出”——文件哈希、rdiff 签名、压缩和压缩。加密的增量差异文件。这需要从源文件中读取一个数据块(比如 100k?),然后迭代内存中的文件以更新哈希值、rdiff 签名、进行增量差分,然后将输出写入压缩/加密输出流。我们的目标是大大减少磁盘抖动的次数。

目前,我们使用 rdiff.exe(它是底层 librsync 库之上的薄层)来计算签名并生成二进制增量。这意味着这些是在一个单独的过程中完成的,并且是一次性完成的,而不是流式传输的方式。

如何使用 librsync 库让它完成我需要的操作?

I have a product that does online backups using rdiff. What currently happens is:

  1. Copy the file to a staging area (so the file won't disappear or be modified while we work on it)

  2. Hashes the original file, and computes an rdiff signature (used for delta differencing)
    Computes an rdiff delta difference (if we have no prior version, this step is skipped)

  3. Compresses & encrypts the resulting delta difference

Currently, these phases are performed distinctly from one another. The end result is we iterate over the file multiple times. For small files, this is not a big deal (especially given disk caching), but for big files (10's or even 100's of GB) this is a real performance-killer.

I want to consolidate all of these steps into one read/write pass.

To do so, we have to be able to perform all of the above steps in a streaming fashion, while still preserving all of the "outputs" -- file hash, rdiff signature, compressed & encrypted delta difference file. This will entail reading a block of data from the source file (say, 100k?), then iterating over the file in memory to update the hash, rdiff signature, do delta differencing, and then write the output to a compress/encrypt output stream. The goal is to greatly minimize the amount of disk thrashing we do.

Currently we use rdiff.exe (which is a thin layer on top of an underlying librsync library) to calculate signatures and generate binary deltas. This means these are done in a separate process, and are done in one-shot instead of a streaming fashion.

How can I get this to do what I need using the librsync library?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

命比纸薄 2024-10-24 15:18:21

您或许可以完全跳过步骤 1。文件在打开时无法删除,并且在打开文件时选择适当的锁定标志也可以防止它被修改。例如, CreateFile 函数需要dwShareMode 参数。

在开始创建 rdiff 增量之前,您需要计算整个 rdiff 签名。您可以通过计算签名,然后一次计算文件的每个(例如)100 MB 块的增量来避免读取整个文件。这样您会损失一些压缩效率*。您还可以考虑从 rdiff 切换到 xdelta,它可以在输入的单次传递中创建增量文件。

压缩和加密可以与计算增量并行进行。如果压缩和加密是由单独的程序完成的,则它们通常允许从标准输入读取并写入标准输出。这可以通过批处理文件中的管道最简单地使用,例如:

rdiff signature oldfile oldfile.sig
rdiff delta oldfile.sig newfile | gzip -c | gpg -e -r ... > compressed_encrypted_delta

如果您在程序中使用库进行压缩/加密,则需要选择支持流操作的库。

*或者如果数据在文件中移动,效率会降低。如果有人在 10 GB 文件前添加 100 MB,rdiff 将生成大约 100 MB 的增量文件。一次以 100 MB 或更少的块进行 rdiff 将产生大约 10 GB 的增量。 200 MB 的块将产生大约 5 GB 的增量,因为每个块中只有一半的数据来自旧版本文件的相应块。

You can probably skip step 1 completely. The file can't be deleted while it's open, and choosing appropriate locking flags when opening it can prevent it from being modified as well. For example, the CreateFile function takes a dwShareMode argument.

You need to compute the entire rdiff signature before you can start creating the rdiff delta. You can avoid reading the entire file by computing signatures and then deltas for each (say) 100 MB block of the file at a time. You will lose some compression efficiency this way*. You might also consider switching from rdiff to xdelta, which can create a delta file in a single pass over the input.

Compression and encryption can be done in parallell with computing the delta. If the compression and encryption is done by separate programs, they often allow reading from standard input and writing to standard output. This can be used easiest by pipes in a batch file, for example:

rdiff signature oldfile oldfile.sig
rdiff delta oldfile.sig newfile | gzip -c | gpg -e -r ... > compressed_encrypted_delta

If you use libraries for compression/encryption in your program, you will need to choose libraries that support streaming operation.

*or lose a lot of efficiency if data is moved around in the file. If someone prepends 100 MB to a 10 GB file, rdiff will produce a delta file of about 100MB. rdiff done in blocks of 100 MB or less at a time will produce about 10 GB of delta. Blocks of 200 MB will produce about 5 GB of delta, since only half the data in each block is from the corresponding block of the old version of the file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文