rsync算法如何正确识别重复块？

发布于 2024-08-27 04:12:24 字数 985 浏览 4 评论 0原文

我个人想了解 rsync 算法的工作原理。经过一些阅读和思考，我发现了一种我认为算法失败的情况。我试图弄清楚在实际实施中如何解决这个问题。

考虑这个例子，其中 A 是接收者，B 是发送者。

A = abcde1234512345fghij
B = abcde12345fghij

如您所见，唯一的变化是 12345 已被删除。

现在，为了使这个示例变得有趣，我们选择 5 字节（字符）的块大小。使用弱校验和对发送方的值进行哈希处理给出以下值列表。

abcde|12345|fghij

abcde -> 495
12345 -> 255
fghij -> 520

values = [495, 255, 520]

接下来我们检查 A 中是否有任何哈希值不同。如果有匹配的块，我们可以跳到该块的末尾进行下一次检查。如果存在不匹配的块，那么我们就发现了差异。我将逐步完成这个过程。

对第一个块进行哈希处理。该哈希值是否存在于值列表中？ abcde-> 495（是的，所以跳过）对
第二个块进行哈希处理。该哈希值是否存在于值列表中？ <代码> 12345 - > 255（是的，所以跳过）对
第三个块进行哈希处理。该哈希值是否存在于值列表中？ <代码> 12345 - > 255（是的，所以跳过）对
第四个块进行哈希处理。该哈希值是否存在于值列表中？ fghij-> 520（是的，所以跳过）
没有更多数据，我们完成了。

由于每个哈希值都在值列表中找到，因此我们得出 A 和 B 相同的结论。以我的拙见，这不是真的。

在我看来，只要有多个块共享相同的哈希值，就会发生这种情况。我知道我已经跳过了计算和检查强哈希的步骤，但这不会产生任何影响，因为第二个和第三个块完全相同

我错过了什么？

原文

I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.

Consider this example, where A is the receiver and B is the sender.

A = abcde1234512345fghij
B = abcde12345fghij

As you can see, the only change is that 12345 has been removed.

Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.

abcde|12345|fghij

abcde -> 495
12345 -> 255
fghij -> 520

values = [495, 255, 520]

Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.

Hash the first block. Does this hash exist in the values list? abcde -> 495 (yes, so skip)
Hash the second block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the third block. Does this hash exist in the values list? 12345 -> 255 (yes, so skip)
Hash the fourth block. Does this hash exist in the values list? fghij -> 520 (yes, so skip)
No more data, we're done.

Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.

It seems to me this will happen whenever there is more than one block that share the same hash. I know I have skipped the step of calculating and checking the strong hash, but that won't make a difference since the second and third blocks are exactly the same

What am I missing?

分享到QQ

分享到微博