当前位置：文江博客话题详情

C# 中的二进制补丁生成

发布于 2024-07-04 02:55:04 字数 2150 浏览 8 评论 0原文

有谁拥有或知道 C# 中的二进制补丁生成算法实现吗？

基本上，比较两个文件（指定旧和新），并生成一个补丁文件，可用于升级旧文件以具有与新文件内容相同。

实施必须相对较快，并且可以处理大文件。它应该表现出 O(n) 或 O(logn) 运行时间。

我自己的算法往往要么很糟糕（速度快，但产生巨大的补丁），要么很慢（产生小补丁，但运行时间为 O(n^2)）。

任何建议或实施指导都会很好。

具体来说，该实现将用于使我们拥有一台主服务器的各种大型数据文件的服务器保持同步。当主服务器数据文件发生变化时，我们也需要更新多个异地服务器。

我制作的最幼稚的算法仅适用于可以保存在内存中的文件，如下所示：

从旧文件中获取前四个字节，称之为键
将这些字节添加到字典中，其中 key -> 位置，其中位置是我抓取这4个字节的位置，0开始
跳过这四个字节中的第一个，抓取另外4个（3个重叠，1个），然后添加以同样的方式写入字典
对旧文件中的所有4字节块重复步骤1-3
从新文件的开头，抓取4个字节，并尝试在字典中查找
如果找到，则通过比较两个文件中的字节来查找最长的匹配项（如果有多个）
对 old 文件中该位置的引用进行编码，并跳过新文件
如果未找到，则对新文件中的 1 个字节进行编码，并跳过它
对新文件的其余部分重复步骤 5-8

这有点像压缩，没有窗口，所以会使用大量内存。然而，只要我尝试使代码输出最小化，它的速度相当快，并且产生相当小的补丁。

内存效率更高的算法使用窗口，但会生成更大的补丁文件。

我在这篇文章中跳过了上述算法的更多细微差别，但如果需要，我可以发布更多详细信息。然而，我确实觉得我需要一个完全不同的算法，所以改进上述算法可能不会让我走得足够远。

编辑#1：这里是上述算法的更详细描述。

首先，合并两个文件，这样就得到一个大文件。记住两个文件之间的切点。

其次，对于整个文件中的所有内容，抓取 4 个字节并将其位置添加到字典步骤中。

第三，从新文件开始的位置开始循环，尝试定位现有的 4 个字节组合，并找到最长的匹配。确保我们只考虑旧文件中的位置，或者新文件中比当前位置更早的位置。这确保了我们可以在补丁应用期间重复使用旧文件和新文件中的材料。

编辑#2：上述算法的源代码

您可能会收到有关证书存在问题的警告。我不知道如何解决这个问题，所以暂时只接受证书。

该源使用了我的库的其余部分中的许多其他类型，因此该文件并不是全部，但这就是算法实现。

@lomaxx，我试图为颠覆中使用的算法找到一个很好的文档，称为 xdelta，但是除非你已经知道该算法是如何工作的，否则我找到的文档无法告诉我我需要知道什么。

或者也许我只是很笨......:)

我从你提供的那个网站上快速浏览了算法，不幸的是它不可用。二进制差异文件的注释如下：

找到一组最佳差异需要相对于输入大小的二次方时间，因此它很快就会变得不可用。

但我的需求不是最佳的，所以我正在寻找更实用的解决方案。

不过，谢谢您的回答，如果我需要的话，请在他的实用程序中添加一个书签。

编辑#1：注意，我会查看他的代码，看看是否能找到一些想法，稍后我还会向他发送一封包含问题的电子邮件，但我已经读过他引用的那本书虽然该解对于寻找最优解很有帮助，但由于时间要求，在使用中并不实用。

编辑 #2：我一定会找到 python xdelta 实现。

原文

Does anyone have, or know of, a binary patch generation algorithm implementation in C#?

Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.

The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.

My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).

Any advice, or pointers for implementation would be nice.

Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.

The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:

Grab the first four bytes from the old file, call this the key
Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
Repeat steps 1-3 for all 4-byte blocks in the old file
From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
If found, find the longest match if there are several, by comparing bytes from the two files
Encode a reference to that location in the old file, and skip the matched block in the new file
If not found, encode 1 byte from the new file, and skip it
Repeat steps 5-8 for the rest of the new file

This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.

A more memory-efficient algorithm uses windowing, but produces much bigger patch files.

There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.

Edit #1: Here is a more detailed description of the above algorithm.

First, combine the two files, so that you have one big file. Remember the cut-point between the two files.

Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.

Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.

Edit #2: Source code to the above algorithm

You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.

The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.

@lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.

Or perhaps I'm just dense... :)

I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:

Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.

My needs aren't optimal though, so I'm looking for a more practical solution.

Thanks for the answer though, added a bookmark to his utilities if I ever need them.

Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.

Edit #2: I'll definitely hunt down the python xdelta implementation.

分享到QQ

分享到微博