rsync算法如何正确识别重复块?
我个人想了解 rsync 算法的工作原理。经过一些阅读和思考,我发现了一种我认为算法失败的情况。我试图弄清楚在实际实施中如何解决这个问题。
考虑这个例子,其中 A 是接收者,B 是发送者。
A = abcde1234512345fghij
B = abcde12345fghij
如您所见,唯一的变化是 12345
已被删除。
现在,为了使这个示例变得有趣,我们选择 5 字节(字符)的块大小。使用弱校验和对发送方的值进行哈希处理给出以下值列表。
abcde|12345|fghij
abcde -> 495
12345 -> 255
fghij -> 520
values = [495, 255, 520]
接下来我们检查 A 中是否有任何哈希值不同。如果有匹配的块,我们可以跳到该块的末尾进行下一次检查。如果存在不匹配的块,那么我们就发现了差异。我将逐步完成这个过程。
- 对第一个块进行哈希处理。该哈希值是否存在于值列表中?
abcde-> 495(是的,所以跳过)对
- 第二个块进行哈希处理。该哈希值是否存在于值列表中? <代码> 12345 - > 255(是的,所以跳过)对
- 第三个块进行哈希处理。该哈希值是否存在于值列表中? <代码> 12345 - > 255(是的,所以跳过)对
- 第四个块进行哈希处理。该哈希值是否存在于值列表中?
fghij-> 520(是的,所以跳过)
- 没有更多数据,我们完成了。
由于每个哈希值都在值列表中找到,因此我们得出 A 和 B 相同的结论。以我的拙见,这不是真的。
在我看来,只要有多个块共享相同的哈希值,就会发生这种情况。我知道我已经跳过了计算和检查强哈希的步骤,但这不会产生任何影响,因为第二个和第三个块完全相同
我错过了什么?
I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.
Consider this example, where A is the receiver and B is the sender.
A = abcde1234512345fghij
B = abcde12345fghij
As you can see, the only change is that 12345
has been removed.
Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.
abcde|12345|fghij
abcde -> 495
12345 -> 255
fghij -> 520
values = [495, 255, 520]
Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.
- Hash the first block. Does this hash exist in the values list?
abcde -> 495
(yes, so skip) - Hash the second block. Does this hash exist in the values list?
12345 -> 255
(yes, so skip) - Hash the third block. Does this hash exist in the values list?
12345 -> 255
(yes, so skip) - Hash the fourth block. Does this hash exist in the values list?
fghij -> 520
(yes, so skip) - No more data, we're done.
Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.
It seems to me this will happen whenever there is more than one block that share the same hash. I know I have skipped the step of calculating and checking the strong hash, but that won't make a difference since the second and third blocks are exactly the same
What am I missing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
rsync 算法发送两个校验和:一个用于每个块,另一个用于整个文件的“滚动”校验和。在您的示例中,一旦 A 到达“双倍”块,就会看到滚动校验和的差异。
The rsync algorithm sends two checksums: one for each chunk, and a "rolling" checksum for the whole file. In your example, A will see a difference in the rolling checksum once it gets to the "doubled-up" block.