为什么 Adler-32 校验和算法中要对 65521 求模?

发布于 2024-07-22 17:15:57 字数 294 浏览 5 评论 0原文

Adler-32 校验和算法对 65521 进行求和。我知道 65521 是适合 16 位的最大素数,但为什么在该算法中使用素数很重要?

(我确信一旦有人告诉我,答案就会显而易见,但我大脑的数论部分不起作用。即使没有校验和算法方面的专业知识,聪明的人也会阅读 http://en.wikipedia.org/wiki/Fletcher%27s_checksum 可能可以向我解释一下。)

The Adler-32 checksum algorithm does sums modulo 65521. I know that 65521 is the largest prime number that fits in 16 bits, but why is it important to use a prime number in this algorithm?

(I'm sure the answer will seem obvious once someone tells me, but the number-theory parts of my brain just aren't working. Even without expertise in checksum algorithms, a smart person who reads http://en.wikipedia.org/wiki/Fletcher%27s_checksum can probably explain it to me.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

尘曦 2024-07-29 17:15:57

为什么 Adler32 使用 mod prime?

来自 Adler 自己的网站 http://zlib.net /zlib_tech.html

然而,Adler-32
已建成,以最大限度地减少
对数据进行微小更改的方法
导致相同的检查值,
通过大量使用金额
大于字节并使用
prime (65521) 表示模数。 它是
在这个领域,一些分析是
应得的,但尚未实现
完成。

Adler-32 的主要原因是
当然,软件速度
实现。

Adler-32 的替代方案是 Fletcher-32,它将 65521 的模替换为 65535。本文表明,Fletcher-32 对于低速率随机误码的通道具有优越性。

使用它是因为底漆往往具有更好的混合性能。 究竟有多好还有待讨论。

其他解释

此线程中的其他人提出了一个有点令人信服的论点,即模素数更适合检测位交换。 然而,情况很可能并非如此,因为位交换极其罕见。 两个最常见的错误是:

  1. 随处常见的随机位翻转 (1 <-> 0)。
  2. 网络中常见的位移位(1 2 3 4 5 -> 2 3 4 5 或 1 1 2 3 4 5)

大部分位交换是由随机位翻转引起的,这些随机位翻转恰好看起来像位交换。

事实上,纠错码旨在承受 n 位偏差。 来自阿德勒的网站:

正确构造的 CRC-n 具有
不错的属性,少于 n 位
错误总是可以检测到的。 这是
Adler-32 并不总是如此——它可以
检测所有一字节或两字节错误,但是
可能会遗漏一些三字节错误。

使用质数模数的有效性

我就同一个问题写了一篇很长的文章。 为什么要对素数取模?

http://www.codexon.com/posts/hash-函数模素数神话

简短答案

我们对素数的了解远少于合数。 因此,像 Knuth 这样的人开始使用它们。

虽然素数与我们散列的大部分数据的关系可能确实较小,但增加表/模大小也会降低冲突的概率(有时比向下舍入到最接近的素数所获得的好处更多)。

这是 每个存储桶与 1000 万个加密随机整数的碰撞图表,比较 mod 65521 与 65535。

Why was mod prime used for Adler32?

From Adler's own website http://zlib.net/zlib_tech.html

However, Adler-32
has been constructed to minimize the
ways to make small changes in the data
that result in the same check value,
through the use of sums significantly
larger than the bytes and by using a
prime (65521) for the modulus. It is
in this area that some analysis is
deserved, but it has not yet been
done.

The main reason for Adler-32 is, of
course, speed in software
implementations.

An alternative to Adler-32 is Fletcher-32, which replaces the modulo of 65521 with 65535. This paper shows that Fletcher-32 is superior for channels with low-rate random bit errors.

It was used because primes tend to have better mixing properties. Exactly how good it is remains to be discussed.

Other Explanations

Someone else in this thread makes a somewhat convincing argument that modulus a prime is better for detecting bit-swapping. However, this is most likely not the case because bit-swapping is extremely rare. The two most prevalent errors are:

  1. Random bit-flips (1 <-> 0) common anywhere.
  2. Bit shifting (1 2 3 4 5 -> 2 3 4 5 or 1 1 2 3 4 5) common in networking

Most of the bit-swapping out there is caused by random bit-flips that happened to look like a bit swap.

Error correction codes are in fact, designed to withstand n-bits of deviation. From Adler's website:

A properly constructed CRC-n has the
nice property that less than n bits in
error is always detectable. This is
not always true for Adler-32--it can
detect all one- or two-byte errors but
can miss some three-byte errors.

Effectiveness of using a prime modulus

I did a long writeup on essentially the same question. Why modulo a prime number?

http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

The short answer

We know much less about prime numbers than composite ones. Therefore people like Knuth started using them.

While it might be true that primes have less relationship to much of the data we hash, increasing the table/modulo size also decreases the probability of a collision (sometimes more than any benefit gained from rounding down to the nearest prime).

Here is a graph of collisions per bucket with 10 million cryptographically random integers comparing mod 65521 vs 65535.

我不在是我 2024-07-29 17:15:57

Adler-32 算法用于计算

A = 1 + b1 + b2 + b3 + ...

B = (1 + b1) + (1 + b1 + b2) + (1 + b1 + b2 + b3) + ... = 1 + b1 + 2 * b2 + 3 * b3 + ...

报告它们以 m 为模。 当 m 为素数时,对 m 进行模的数字形成数学家所说的。 域有一个方便的属性,对于任何非零 c,当且仅当 c * a = c * b 时,我们有 a = b。 将模 6 的乘法表(不是质数)与模 5 的乘法表进行比较,即:

* 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 1 2 3 4 5
2 0 2 4 0 2 4
3 0 3 0 3 0 3
4 0 4 2 0 4 2
5 0 5 4 3 2 1

* 0 1 2 3 4
0 0 0 0 0 0
1 0 1 2 3 4
2 0 2 4 1 3
3 0 3 1 4 2
4 0 4 3 2 1

现在,只要我们交换两个字节,A 部分就会被愚弄——毕竟加法是可交换的。 B 部分应该检测到这种错误,但是当 m 不是质数时,更多位置容易受到攻击。 考虑 Adler 校验和 mod 6

1 3 2 0 0 4

我们有 A = 4 和 B = 1。现在考虑交换 b2 和 b4:

1 0 2 3 0 4

A 和 B 不变,因为 2 * 3 = 4 * 0 = 2 * 0 = 4 * 3(模 6)。 也可以交换 2 和 5 以获得相同的效果。 当时间表不平衡时,这种情况更有可能发生——模 5,这些变化会被检测到。 事实上,素数模数唯一无法检测到单个交换的情况是当两个相等的索引 mod m 交换时(如果 m 很大,它们一定相距很远!)。^ 此逻辑也可以应用于交换的子字符串。

使用较小模数的缺点是它在随机数据上失败的频率会稍微高一些; 然而,在现实世界中,腐败很少是随机的。

^ 证明:假设我们将索引 i 和 j 与值 a 和 b 交换。 那么 ai + bj = aj + bi,所以 ai - aj + bj - b em>i = 0 且 (a - b)*(i - j) = 0。由于域是积分域,因此可以得出 a = b(值全等)或 i = j(索引全等)。

编辑:未知链接到的网站(http://www.zlib.net/zlib_tech.html) 清楚地表明 Adler-32 的设计根本没有原则性。 由于 DEFLATE 流中的霍夫曼代码,即使很小的错误也可能会更改帧(因为它依赖于数据)并导致输出中出现较大错误。 将此答案视为一个稍微人为的示例,说明为什么人们将某些属性归因于素数。

The Adler-32 algorithm is to compute

A = 1 + b1 + b2 + b3 + ...

and

B = (1 + b1) + (1 + b1 + b2) + (1 + b1 + b2 + b3) + ... = 1 + b1 + 2 * b2 + 3 * b3 + ...

and report them modulo m. When m is prime, the numbers modulo m form what mathematicians call a field. Fields have the handy property that for any nonzero c, we have a = b if and only if c * a = c * b. Compare the times table modulo 6, which is not a prime, with the times table modulo 5, which is:

* 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 1 2 3 4 5
2 0 2 4 0 2 4
3 0 3 0 3 0 3
4 0 4 2 0 4 2
5 0 5 4 3 2 1

* 0 1 2 3 4
0 0 0 0 0 0
1 0 1 2 3 4
2 0 2 4 1 3
3 0 3 1 4 2
4 0 4 3 2 1

Now, the A part gets fooled whenever we interchange two bytes -- addition is commutative after all. The B part is supposed to detect this kind of error, but when m is not a prime, more locations are vulnerable. Consider an Adler checksum mod 6 of

1 3 2 0 0 4

We have A = 4 and B = 1. Now consider swapping b2 and b4:

1 0 2 3 0 4

A and B are unchanged because 2 * 3 = 4 * 0 = 2 * 0 = 4 * 3 (modulo 6). One can also swap 2 and 5 to the same effect. This is more likely when the times table is unbalanced -- modulo 5, these changes are detected. In fact, the only time a prime modulus fails to detect a single swap is when two equal indexes mod m are swapped (and if m is big, they must be far apart!).^ This logic can also be applied to interchanged substrings.

The disadvantage in using a smaller modulus is that it will fail slightly more often on random data; in the real world, however, corruption is rarely random.

^ Proof: suppose that we swap indexes i and j with values a and b. Then ai + bj = aj + bi, so ai - aj + bj - bi = 0 and (a - b)*(i - j) = 0. Since a field is an integral domain, it follows that a = b (values are congruent) or i = j (indexes are congruent).

EDIT: the website that Unknown linked to (http://www.zlib.net/zlib_tech.html) makes it clear that the design of Adler-32 was not at all principled. Because of the Huffman code in a DEFLATE stream, even small errors are likely to change the framing (because it's data-dependent) and cause large errors in the output. Consider this answer a slightly contrived example for why people ascribe certain properties to primes.

朕就是辣么酷 2024-07-29 17:15:57

长话短说:

素数的模具有最好的位洗牌属性,而这正是我们想要的哈希值。

Long story short:

The modulo of a prime has the best bit-shuffeling properties, and that's exactly what we want for a hash-value.

可爱咩 2024-07-29 17:15:57

对于完全随机的数据,桶越多越好。

假设数据在某种程度上是非随机的。 现在,非随机性影响算法的唯一方法是创建一种情况,其中某些存储桶比其他存储桶有更高的使用概率。

如果模数是非素数,则影响组成模数的数字之一的任何模式都可能影响哈希。 因此,如果您使用 15,则每 3 或 5 个以及每 15 个模式可能会导致冲突,而如果您使用 13,则模式必须每 13 个才会导致冲突。

65535 = 3*5*17*257,因此涉及 3 或 5 的模式可能会导致使用此模数的冲突 - 例如,如果 3 的倍数由于某种原因更常见,那么只有 3 的倍数的桶才会发生冲突得到充分利用。

现在我不确定这实际上是否可能成为一个问题。 最好根据想要散列的类型的实际数据而不是随机数来凭经验确定冲突率。 (例如,涉及 http://en.wikipedia.org/wiki/Benford's_law">Benford 定律或某些此类不规则性的数值数据是否会导致影响该算法的模式?对于真实文本使用 ASCII 代码怎么样?)

For perfectly random data, the more buckets the better.

Let's say the data is non-random in some way. Now, the only way that the non-randomness could affect the algorithm is by creating a situation where some buckets have a higher probability of being used than others.

If the modulo number is non-prime, then any pattern affecting one of the numbers making up the modulo could affect the hash. So if you're using 15, a pattern every 3 or 5 as well as every 15 could cause collisions, while if you're using 13 the pattern would have to be every 13 to cause collisions.

65535 = 3*5*17*257, so a pattern involving 3 or 5 could cause collisions using this modulo-- if multiples of 3 were much more common for some reason, for instance, then only the buckets which were multiples of 3 would be put to good use.

Now I'm not sure whether, realistically, this is likely to be an issue. It would be good to determine the collision rate empirically with actual data of the type one wants to hash, not random numbers. (For instance, would numerical data involving http://en.wikipedia.org/wiki/Benford's_law">Benford's Law or some such irregularity cause patterns that would affect this algorithm? How about using ASCII codes for realistic text?)

东风软 2024-07-29 17:15:57

校验和通常用于检测两个事物是否不同,特别是在两个事物在同一时间和地点不可用的情况下。 它们可能在不同的地方(例如发送的信息包与接收的信息包)或不同的时间可用(例如存储的信息块与读回的信息块) 。 在某些情况下,可能需要检查独立存储在两个不同位置的两个事物是否可能匹配,而不必将实际数据从一个设备发送到另一设备(例如比较加载的代码图像或配置)。

如果所比较的事物不匹配的唯一原因是其中之一的随机损坏,那么对 Adler-32 校验和使用质数模数可能不是特别有用。 然而,如果其中一件事可能进行了一些“故意”的更改,那么使用非质数模数可能会导致某些更改被忽视。 例如,将一个字节从 00 更改为 FF,以及将早于或晚于 257 字节的倍数的另一个字节从 FF 更改为 00,在使用 Fletcher 校验和时会抵消,但在使用 Adler-32 校验和时则不会。 这种情况不太可能因随机损坏而发生,但在更改程序时可能会发生这种抵消性更改。 它们不太可能出现 257 字节的精确倍数,但可以通过使用质数模来避免这种风险(至少前提是文件中的字节数小于模数)

Checksums are generally used with the intention of detecting that two things are different, especially in cases where both things are not available at the same time and place. They might be available at different places (e.g. a packet of information as sent, versus a packet of information as received), or different times (e.g. a block of information when it was stored, versus a block of information when it was read back). In some cases, it may be desirable to check whether two things that are stored independently in two different places are likely to match, without having to send the actual data from one device to the other (e.g. comparing loaded code images or configurations).

If the only reasons that the things being compared wouldn't match would be random corruption of one of them, then the use of a prime modulus for an Adler-32 checksum is probably not particularly helpful. If, however, it's possible that one of the things might have had some 'deliberate' changes made to it, use of a non-prime modulus may cause certain changes to go unnoticed. For example, the effects of changing a byte from 00 to FF, and changing another byte that's some multiple of 257 bytes earlier or later from FF to 00, would cancel out when using a Fletcher's checksum, but not when using Adler-32 checksum. It's not particularly likely that such a scenario would occur from random corruption, but such offsetting changes could occur when changing a program. It wouldn't be especially likely that they'd occur an exact multiple of 257 bytes apart, but it's a risk which can be avoided by using a prime modulus (provided, at least, that the number of bytes in the file is smaller than the modulus)

生生漫 2024-07-29 17:15:57

答案在于场论。
当n是素数时(即模n的加法和乘法),具有加和乘运算的集合Z/Z_n是一个域。

换句话说,下面的方程:

m * x = (in Z/Z_n) 

对于任何 m 值(即 x = 0)只有一个解

考虑这个例子:

2 * x = 0 (mod 10)

这个方程有两个解,x = 0 AND x = 5。这是因为 10 不是素数,可以可写为 2 * 5。

该属性负责哈希值的更好分布。

The answer lies in the field theory.
The set Z/Z_n with the operations plus und times is a field when n is a prime (i.e. addition und multiplication with modulo n).

In other words, the following equation:

m * x = (in Z/Z_n) 

has only one solution for any value ofm (namely x = 0)

Consider this example:

2 * x = 0 (mod 10)

This equation has two solutions, x = 0 AND x = 5. That is because 10 is not a prime and can be written as 2 * 5.

This property is responsible for better distribution of the hash values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文