为什么 XOR 是组合哈希值的默认方式？

发布于 2024-11-05 06:41:45 字数 503 浏览 9 评论 0原文

假设您有两个哈希值 H(A) 和 H(B) 并且您想要将它们组合起来。我读过，组合两个散列的一个好方法是对它们进行异或，例如 XOR( H(A), H(B) ) 。

我找到的最好的解释是关于这些哈希函数指南的简要介绍：

对两个具有大致随机分布的数字进行异或会产生另一个仍然具有大致随机分布*的数字，但它现在取决于这两个值。
...
* 在要组合的两个数字的每一位上，如果两位相等，则输出 0，否则输出 1。换句话说，在 50% 的组合中，将输出 1。因此，如果两个输入位分别有大约 50-50 的机会为 0 或 1，那么输出位也将如此。

您能否解释为什么 XOR 应该成为组合哈希函数（而不是 OR 或 AND 等）的默认操作背后的直觉和/或数学原理？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡紫姑娘！ 2024-11-12 06:41:46

xor 是散列时使用的危险默认函数。它比 and 和 or 更好，但这并不能说明什么。

xor 是对称的，因此元素的顺序会丢失。因此，"bad" 将与 "dab" 进行哈希组合。

xor 将成对相同的值映射为零，并且您应该避免将“常见”值映射为零：

因此 (a,a) 被映射为 0，并且 ( b,b) 也被映射到 0。由于这种对几乎总是比随机性所暗示的更常见，因此最终会出现比应有的更多零碰撞。

有了这两个问题，xor 最终成为一个哈希组合器，表面上看起来还算不错，但经过进一步检查后却并非如此。

在现代硬件上，添加通常与xor一样快（不可否认，它可能使用更多的功率来实现这一点）。加法的真值表类似于相关位上的xor，但当两个值均为 1 时，它也会将一个位发送到下一位。这意味着它擦除的信息较少。

因此，hash(a) + hash(b) 比 hash(a) xor hash(b) 更好，因为 if a==b ，结果是 hash(a)<<1 而不是 0。

这仍然是对称的；因此 "bad" 和 "dab" 获得相同的结果仍然是一个问题。我们可以以适度的成本打破这种对称性：

hash(a)<<1 + hash(a) + hash(b)

又名 hash(a)*3 + hash(b)。（如果您使用移位解决方案，建议计算hash(a)一次并存储）。任何奇数常数而不是 3 都会双射地将“k 位”无符号整数映射到其自身，因为无符号整数上的映射是数学模 2^kk，任何奇数常数都与2^k互质。

对于更高级的版本，我们可以检查 boost::hash_combine，这是有效的：

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

这里我们将 lhs 的一些移位版本与一个常量（基本上是随机的< code>0s 和 1s – 特别是它是黄金比例的倒数（作为 32 位定点分数），并带有一些加法和异或。这会破坏对称性，如果传入的散列值很差，则会引入一些“噪音”（即，假设每个组件散列为 0 – 上面的处理很好，生成 1 和 0 的污点 每次组合之后，我天真的 3*hash(a)+hash(b) 在这种情况下只是输出 0）。

将其扩展到 64 位（使用 pi 的扩展作为 64 位的常量，因为它在 64 位上是奇数）：（

size_t hash_combine( size_t lhs, size_t rhs ) {
  if constexpr (sizeof(size_t) >= 8) {
    lhs ^= rhs + 0x517cc1b727220a95 + (lhs << 6) + (lhs >> 2);
  } else {
    lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  }
  return lhs;
}

对于那些不熟悉 C/C++ 的人，size_t 是一个无符号整数值足以描述内存中任何对象的大小，在 64 位系统上，它通常是 64 位无符号整数。在 32 位系统上，它通常是 32 位无符号整数。）

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.

So hash(a) + hash(b) is better than hash(a) xor hash(b) in that if a==b, the result is hash(a)<<1 instead of 0.

This remains symmetric; so the "bad" and "dab" getting the same result remains a problem. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a "k-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of lhs with a constant (which is basically random 0s and 1s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1 and 0s after each combine. My naive 3*hash(a)+hash(b) simply outputs a 0 in that case).

Extending this to 64 bits (using the expansion of pi as our constant for 64 bits, as it is odd at 64 bits):

size_t hash_combine( size_t lhs, size_t rhs ) {
  if constexpr (sizeof(size_t) >= 8) {
    lhs ^= rhs + 0x517cc1b727220a95 + (lhs << 6) + (lhs >> 2);
  } else {
    lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  }
  return lhs;
}

(For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)

回复收藏 0 原文

微暖i 2024-11-12 06:41:46

假设均匀随机（1 位）输入，AND 函数输出概率分布为 75% 0 和 25% 1。相反，OR 为 25% 0 和 75% 1。

XOR 函数为 50% 0 和 50% 1，因此它有利于组合均匀概率分布。

这可以通过写出真值表来看出：

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

练习：两个 1 位输入 a 和 b 有多少个逻辑函数具有这种均匀的输出分布？为什么 XOR 最适合您问题中所述的目的？

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

回复收藏 0 原文

孤芳又自赏 2024-11-12 06:41:46

尽管 XOR 具有方便的位混合特性，但由于其可交换性，XOR 并不是组合哈希值的好方法。考虑一下如果将 {1, 2, …, 10} 的排列存储在 10 元组的哈希表中会发生什么。

更好的选择是 m * H(A) + H(B)，其中 m 是一个很大的奇数。

图片来源：上面的组合器是 Bob Jenkins 的提示。

回复收藏 0 原文

著墨染雨君画夕 2024-11-12 06:41:46

Xor 可能是组合哈希的“默认”方式，但 Greg Hewgill 的答案也说明了为什么它有其缺陷：
两个相同哈希值的异或为零。
在现实生活中，相同的哈希值比人们想象的更常见。然后您可能会发现，在这些（并非罕见的）极端情况下，生成的组合哈希值始终相同（零）。哈希冲突比您预期的要频繁得多。

在一个人为的示例中，您可能会组合来自您管理的不同网站的用户的哈希密码。不幸的是，大量用户重复使用他们的密码，令人惊讶的是，所得哈希值的比例为零！

回复收藏 0 原文

静若繁花 2024-11-12 06:41:46

我想向其他找到此页面的人明确指出一些事情。 AND 和 OR 限制输出，如 BlueRaja - Danny Pflughoe 试图指出，但可以更好地定义：

首先，我想定义两个简单的函数，我将用它来解释这一点：Min() 和 Max()。

Min(A, B) 将返回 A 和 B 之间较小的值，例如：Min(1, 5) 返回 1。

Max(A, B) 将返回 A 和 B 之间较大的值，例如：Max(1, 5) 返回 5。

如果给定： C = A AND B

那么你可以发现 C <= Min(A, B) 我们知道这一点是因为您无法与 A 或 B 的 0 位进行 AND 操作以使它们变为 1。因此，每个零位都保持为零位，并且每个一位都有机会成为零位（从而成为更小的值）。

其中：C = A OR B

反之亦然：C >= Max(A, B) 由此，我们看到了 AND 函数的推论。任何已经是 1 的位都不能通过或运算变成 0，因此它仍然是 1，但每个 0 位都有机会变成 1，从而成为更大的数字。

这意味着输入的状态对输出施加限制。如果将任何值与 90 进行 AND 运算，则无论其他值是什么，您都知道输出将等于或小于 90。

对于 XOR，没有基于输入的隐含限制。在某些特殊情况下，您会发现如果将一个字节与 255 进行异或，则会得到相反的结果，但可以从中输出任何可能的字节。每个位都有机会根据另一个操作数中的相同位来改变状态。

回复收藏 0 原文

初熏 2024-11-12 06:41:46

如果将随机输入与有偏差的输入进行XOR，则输出是随机的。对于 AND 或 OR 则不然。示例：

00101001 XOR 00000000 = 00101001
00101001 AND 00000000 = 00000000
00101001 OR  11111111 = 11111111

正如 @Greg Hewgill 提到的，即使两个输入都是随机的，使用 AND 或 OR 也会导致有偏差的输出。

我们在更复杂的事情上使用 XOR 的原因是，好吧，没有必要：XOR 工作完美，而且速度快得惊人。

If you XOR a random input with a biased input, the output is random. The same is not true for AND or OR. Example:

00101001 XOR 00000000 = 00101001
00101001 AND 00000000 = 00000000
00101001 OR  11111111 = 11111111

As @Greg Hewgill mentions, even if both inputs are random, using AND or OR will result in biased output.

The reason we use XOR over something more complex is that, well, there's no need: XOR works perfectly, and it's blazingly stupid-fast.

回复收藏 0 原文

黒涩兲箜 2024-11-12 06:41:46

覆盖左边的两列，并尝试仅使用输出来计算出输入是什么。

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

当你看到一个 1 位时，你应该已经知道两个输入都是 1。

现在对 XOR 执行相同的操作，

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

XOR 不会泄露任何有关输入的信息。

Cover the left 2 columns and try to work out what the inputs are using just the output.

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

When you saw a 1-bit you should have worked out that both inputs were 1.

Now do the same for XOR

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

XOR gives away nothing about it inputs.

回复收藏 0 原文

回首观望 2024-11-12 06:41:46

XOR 不会忽略某些输入，例如 OR 和 AND。

如果以 AND(X, Y) 为例，输入 X 为 false，则输入 Y 并不重要...人们可能希望输入在组合哈希时起作用。

如果您采用XOR(X, Y)，则两个输入总是很重要。当 Y 无关紧要时，X 就没有值。如果 X 或 Y 发生更改，则输出将反映这一点。

回复收藏 0 原文

來不及說愛妳 2024-11-12 06:41:46

java.util.Arrays 是可靠的通用哈希算法的一个很好的参考。它们很容易理解并翻译成其他编程语言。

粗略地说，大多数多属性 hashCode() 实现都遵循以下模式：

public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}

您可以搜索其他 StackOverflow 问答，了解有关 31 背后的魔力以及为什么使用 Java 代码的更多信息如此频繁地使用它。它并不完美，但具有非常好的一般性能特征。

The source code for various versions of hashCode() in java.util.Arrays is a great reference for solid, general use hashing algorithms. They are easily understood and translated into other programming languages.

Roughly speaking, most multi-attribute hashCode() implementations follow this pattern:

public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}

You can search other StackOverflow Q&As for more information about the magic behind 31, and why Java code uses it so frequently. It is imperfect, but has very good general performance characteristics.

回复收藏 0 原文

~没有更多了~