将 32 位 int 哈希为 16 位 int？

发布于 2024-09-06 14:43:48 字数 410 浏览 4 评论 0原文

有哪些简单的方法可以将 32 位整数（例如 IP 地址，例如 Unix time_t 等）哈希为 16 位整数？

例如，hash_32b_to_16b(0x12345678) 可能会返回 0xABCD。

让我们从一个可怕但实用的示例解决方案开始：

function hash_32b_to_16b(val32b) {
    return val32b % 0xffff;
}

问题专门针对 JavaScript，但请随意添加任何与语言无关的解决方案，最好不使用库函数。

这个问题的背景是生成唯一的 ID（例如，64 位 ID 可能由多个 32 位值的 16 位哈希值组成）。避免碰撞很重要。

简单=好。古怪+混乱=有趣。

原文

What are some simple ways to hash a 32-bit integer (e.g. IP address, e.g. Unix time_t, etc.) down to a 16-bit integer?

E.g. hash_32b_to_16b(0x12345678) might return 0xABCD.

Let's start with this as a horrible but functional example solution:

function hash_32b_to_16b(val32b) {
    return val32b % 0xffff;
}

Question is specifically about JavaScript, but feel free to add any language-neutral solutions, preferably without using library functions.

The context for this question is generating unique IDs (e.g. a 64-bit ID might be composed of several 16-bit hashes of various 32-bit values). Avoiding collisions is important.

Simple = good. Wacky+obfuscated = amusing.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溇涏 2024-09-13 14:43:48

最大化保留某些原始 32 位“信号”的熵的关键是确保 32 个输入位中的每一个都具有独立且平等的能力来改变16 位输出字的值。

由于OP请求的位大小恰好是原始大小的一半，因此满足此标准的最简单方法是对上半部分和下半部分进行异或，正如其他人提到的那样。使用xor是最佳选择，因为正如显而易见 xor 的定义——独立翻转 32 个输入位中的任何一个都保证会更改 16 位输出的值。

当您需要进一步减少一半的大小时，例如从32位输入到2位输入，问题就变得更有趣了。位输出。请记住，目标是尽可能多地保留源中的熵，因此涉及使用 (i & 3) 简单地屏蔽两个最低位的解决方案通常会走向错误的方向；这样做保证除未屏蔽位之外的任何位都无法影响结果，这通常意味着运行时信号中有一个任意的、可能有价值的部分，即毫无原则地被立即抛弃。

根据前面的段落，您当然可以使用xor 进行另外三次迭代，以生成一个 2 位输出，并具有受每个/任何同等影响的所需属性输入位。当然，该解决方案仍然是最佳正确的，但涉及循环或多个展开操作，事实证明，这是不必要的！

幸运的是，有一种只需两次操作的好技术，对于这种情况给出了相同的最佳结果。与xor一样，它不仅确保对于任何给定的32位值，旋转任何输入位都会导致2位输出发生变化 ，而且，给定输入值的均匀分布，2 位输出值的分布也将是完全均匀的。在当前示例中，该方法将 4,294,967,296 可能的输入值精确地划分为 1,073,741,824 四个可能的 2 位哈希结果 { 0, 1, 2, 3}。

我在这里提到的方法使用了我通过详尽的搜索发现的特定魔法值，并且互联网上的其他地方似乎没有对此进行太多讨论，至少对于此处讨论的特定用途而言（即，确保统一的哈希分布）最大熵保持）。奇怪的是，根据同样的详尽搜索，魔术值实际上是唯一的，这意味着对于每个目标位宽 { 16, 8, 4, 2 }，我在下面显示的魔术值是唯一值，当按照我在此处显示的方式使用时，满足上面概述的完美哈希标准。

言归正传，将 32 位哈希为 n = { 16, 8, 4, 2 } 的独特且数学上最佳的过程是乘以对应的魔法值n（无符号，丢弃溢出），然后取结果的n最高位。要将这些结果位隔离为 [0 ... (2ⁿ - 1)] 范围内的哈希值，只需将乘法结果右移（无符号！）32 - n< /代码> 位。

“神奇”值和类似 C 表达式语法如下：

方法

保留最大熵的哈希值，用于将 32 位减少到。。 .

Target Bits    Multiplier    Right Shift       Expression ^[1, 2]
-----------   ------------   -----------   -----------------------
    16         0x80008001        16        (i * 0x80008001) >> 16
     8         0x80808081        24        (i * 0x80808081) >> 24
     4         0x88888889        28        (i * 0x88888889) >> 28
     2         0xAAAAAAAB        30        (i * 0xAAAAAAAB) >> 30

最大熵保留哈希，用于将 64 位减少到。。 .

Target Bits   Multiplier           Right Shift            Expression ^[1, 2]
-----------   ------------------   -----------   -------------------------------
    32        0x8000000080000001       32        (i * 0x8000000080000001) >> 32
    16        0x8000800080008001       48        (i * 0x8000800080008001) >> 48
     8        0x8080808080808081       56        (i * 0x8080808080808081) >> 56
     4        0x8888888888888889       60        (i * 0x8888888888888889) >> 60
     2        0xAAAAAAAAAAAAAAAB       62        (i * 0xAAAAAAAAAAAAAAAB) >> 62

注意：

使用无符号乘法并丢弃任何溢出（不需要64位乘法）。
如果使用右移隔离结果（如图所示），请务必使用无符号移位运算。

进一步的讨论

我发现这一切都很酷。实际上，关键的信息理论要求是保证对于任何m 位输入值及其对应的n-bit哈希值结果，翻转以下任意一个： m 源位总是会导致n 位 结果值发生一些变化。现在，尽管总共有 2ⁿ 个可能的结果值，其中一个已经“正在使用”（通过结果本身），因为“从任何其他结果切换到该结果根本不会有任何变化。这留下了 2ⁿ - 1 个结果值，可以由按一位翻转的整个 m 个输入值集使用。

让我们考虑一个例子；事实上，为了展示这种技术如何看起来有点诡异或完全神奇，我们将考虑更极端的情况，即 m = 64 和 n = 2。对于 2 个输出位，有四种可能的结果值，{ 0, 1, 2, 3 }。假设任意64位输入值0x7521d9318fbdf523，我们得到它的2位哈希值1：

 (0x7521d9318fbdf523 * 0xAAAAAAAAAAAAAAAB) >> 62   // result -->  '1'

所以结果是1并且声明是64个值的集合中没有值，其中0x7521d9318fbdf523的一位code> 已切换可能具有相同的结果值。也就是说，这 64 个其他结果都不能使用值 1，并且全部必须使用 0、<代码>2或<代码>3。因此，在这个例子中，似乎 2⁶⁴ 输入值中的每一个（不包括其他 64 个输入值）都会自私地独占输出空间的四分之一。当您考虑到这些相互作用的约束的巨大程度时，是否存在同时令人满意的整体解决方案？

果然，为了证明（确切地说？）一个确实，这里是按顺序列出的哈希结果值，用于翻转 0x7521d9318fbdf523 的单个位（一个一次），从 MSB（位置 63）到 LSB（0）。

3 2 0 3 3 3 3 3 3 0 0 0 3 0 3 3 0 3 3 3 0 0 3 3 3 0 0 3 3 0 3 3  // continued…
0 0 3 0 0 3 0 3 0 0 0 3 0 3 3 3 0 3 0 3 3 3 3 3 3 0 0 0 3 0 0 3  // notice: no '1' values

正如您所看到的，没有 1 值，这意味着源“按原样”中的每一位都必须对影响结果做出贡献 （或者，如果您愿意，0x7521d9318fbdf523 中每一位的事实上状态对于保持整个整体来说至关重要是“not-1”的结果）。因为无论您对 64 位输入进行什么单位更改，2 位结果值都将不再是 1。

请记住，上面显示的“缺失值”表是从对一个随机选择的示例值 0x7521d9318fbdf523 的分析中转储的； 所有其他可能的输入值都有自己的类似表，每个表都奇怪地缺少其所有者的实际结果值，同时又以某种方式在其集合成员中保持全局一致。该属性本质上对应于在（固有有损）位宽缩减任务期间最大限度地保留可用熵。

因此，我们看到 2⁶⁴ 中的每一个可能的源值都独立地对 64 个其他源值施加了排除其中一个可能的结果值的约束。与我的直觉相反的是，这些由 64 个成员组成的集合有无数万亿个，其中每个成员还属于 63 个其他，看似互不相关的琐碎集合。然而，尽管存在这种最令人困惑的相互交织的约束难题，但利用一个（我推测）同时完全满足所有这些条件的解决方案仍然是微不足道的。

所有这些似乎都与您可能在上表中注意到的内容有关：即，我没有看到任何明显的方法可以将该技术扩展到压缩到1位结果的情况。在这种情况下，只有两个可能的结果值 { 0, 1 }，因此，如果任何/每个给定（例如）64 位输入值仍然概括地将其自己的结果排除在所有结果之外64 个单位翻转邻居，那么现在基本上强加其他，只有这 64 个剩下的价值。我们的数学分解表中看到的结果似乎表明，在这种条件下同时得出的结果是一座太过遥远的桥梁。

换句话说，xor< 的特殊“信息保留”特征 /code> （也就是说，它的豪华可靠保证，与 and、or 等相反，它 c̲a̲n̲ 和 w̲i̲l̲l̲ 总是会改变一点）毫不奇怪地要求一定的成本，即对一定量的肘部空间（至少 2 位）的强烈不可协商的需求。

The key to maximizing the preservation of entropy of some original 32-bit 'signal' is to ensure that each of the 32 input bits has an independent and equal ability to alter the value of the 16-bit output word.

Since the OP is requesting a bit-size which is exactly half of the original, the simplest way to satisfy this criteria is to xor the upper and lower halves, as others have mentioned. Using xor is optimal because—as is obvious by the definition of xor—independently flipping any one of the 32 input bits is guaranteed to change the value of the 16-bit output.

The problem becomes more interesting when you need further reduction beyond just half-the-size, say from a 32-bit input to, let's say, a 2-bit output. Remember, the goal is to preserve as much entropy from the source as possible, so solutions which involve naively masking off the two lowest bits with (i & 3) are generally heading in the wrong direction; doing that guarantees that there's no way for any bits except the unmasked bits to affect the result, and that generally means there's an arbitrary, possibly valuable part of the runtime signal which is being summarily discarded without principle.

Following from the earlier paragraph, you could of course iterate with xor three additional times to produce a 2-bit output with the desired property of being equally-influenced by each/any of the input bits. That solution is still optimally correct of course, but involves looping or multiple unrolled operations which, as it turns out, aren't necessary!

Fortunately, there is a nice technique of only two operations which gives the same optimal result for this situation. As with xor, it not only ensures that, for any given 32-bit value, twiddling any input bit will result in a change to the 2-bit output, but also that, given a uniform distribution of input values, the distribution of 2-bit output values will also be perfectly uniform. In the current example, the method divides the 4,294,967,296 possible input values into exactly 1,073,741,824 each of the four possible 2-bit hash results { 0, 1, 2, 3 }.

The method I mention here uses specific magic values that I discovered via exhaustive search, and which don't seem to be discussed very much elsewhere on the internet, at least for the particular use under discussion here (i.e., ensuring a uniform hash distribution that's maximally entropy-preserving). Curiously, according to this same exhaustive search, the magic values are in fact unique, meaning that for each of target bit-widths { 16, 8, 4, 2 }, the magic value I show below is the only value that, when used as I show here, satisfies the perfect hashing criteria outlined above.

Without further ado, the unique and mathematically optimal procedure for hashing 32-bits to n = { 16, 8, 4, 2 } is to multiply by the magic value corresponding to n (unsigned, discarding overflow), and then take the n highest bits of the result. To isolate those result bits as a hash value in the range [0 ... (2ⁿ - 1)], simply right-shift (unsigned!) the multiplication result by 32 - n bits.

The "magic" values, and C-like expression syntax are as follows:

Method

Maximum-entropy-preserving hash for reducing 32 bits to. . .

Target Bits    Multiplier    Right Shift       Expression ^[1, 2]
-----------   ------------   -----------   -----------------------
    16         0x80008001        16        (i * 0x80008001) >> 16
     8         0x80808081        24        (i * 0x80808081) >> 24
     4         0x88888889        28        (i * 0x88888889) >> 28
     2         0xAAAAAAAB        30        (i * 0xAAAAAAAB) >> 30

Maximum-entropy-preserving hash for reducing 64 bits to. . .

Target Bits   Multiplier           Right Shift            Expression ^[1, 2]
-----------   ------------------   -----------   -------------------------------
    32        0x8000000080000001       32        (i * 0x8000000080000001) >> 32
    16        0x8000800080008001       48        (i * 0x8000800080008001) >> 48
     8        0x8080808080808081       56        (i * 0x8080808080808081) >> 56
     4        0x8888888888888889       60        (i * 0x8888888888888889) >> 60
     2        0xAAAAAAAAAAAAAAAB       62        (i * 0xAAAAAAAAAAAAAAAB) >> 62

Notes:

Use unsigned multiply and discard any overflow (64-bit multiply is not needed).
If isolating the result using right-shift (as shown), be sure to use an unsigned shift operation.

Further discussion

I find this all this quite cool. In practical terms, the key information-theoretical requirement is the guarantee that, for any m-bit input value and its corresponding n-bit hash value result, flipping any one of the m source bits always causes some change in the n-bit result value. Now although there are 2ⁿ possible result values in total, one of them is already "in-use" (by the result itself) since "switching" to that one from any other result would be no change at all. This leaves 2ⁿ - 1 result values that are eligible to be used by the entire set of m input values flipped by a single bit.

Let's consider an example; in fact, to show how this technique might seem to border on spooky or downright magical, we'll consider the more extreme case where m = 64 and n = 2. With 2 output bits there are four possible result values, { 0, 1, 2, 3 }. Assuming an arbitrary 64-bit input value 0x7521d9318fbdf523, we obtain its 2-bit hash value of 1:

 (0x7521d9318fbdf523 * 0xAAAAAAAAAAAAAAAB) >> 62   // result -->  '1'

So the result is 1 and the claim is that no value in the set of 64 values where a single-bit of 0x7521d9318fbdf523 is toggled may have that same result value. That is, none of those 64 other results can use value 1 and all must instead use either 0, 2, or 3. So in this example it seems like every one of the 2⁶⁴ input values—to the exclusion of 64 other input values—will selfishly hog one-quarter of the output space for itself. When you consider the sheer magnitude of these interacting constraints, can a simultaneously satisfying solution overall even exist?

Well sure enough, to show that (exactly?) one does, here are the hash result values, listed in order, for inputs that flipping a single bit of 0x7521d9318fbdf523 (one at a time), from MSB (position 63) down to LSB (0).

3 2 0 3 3 3 3 3 3 0 0 0 3 0 3 3 0 3 3 3 0 0 3 3 3 0 0 3 3 0 3 3  // continued…
0 0 3 0 0 3 0 3 0 0 0 3 0 3 3 3 0 3 0 3 3 3 3 3 3 0 0 0 3 0 0 3  // notice: no '1' values

As you can see, there are no 1 values, which entails that every bit in the source "as-is" must be contributing to influence the result (or, if you prefer, the de facto state of each-and-every bit in 0x7521d9318fbdf523 is essential to keeping the entire overall result from being "not-1"). Because no matter what single-bit change you make to the 64-bit input, the 2-bit result value will no longer be 1.

Keep in mind that the "missing-value" table shown above was dumped from the analysis of just the one randomly-chosen example value 0x7521d9318fbdf523; every other possible input value has a similar table of its own, each one eerily missing its owner's actual result value while yet somehow being globally consistent across its set-membership. This property essentially corresponds to maximally preserving the available entropy during the (inherently lossy) bit-width reduction task.

So we see that every one of the 2⁶⁴ possible source values independently imposes, on exactly 64 other source values, the constraint of excluding one of the possible result values. What defies my intuition about this is that there are untold quadrillions of these 64-member sets, each of whose members also belongs to 63 other, seemingly unrelated bit-twiddling sets. Yet somehow despite this most confounding puzzle of interwoven constraints, it is nevertheless trivial to exploit the one (I surmise) resolution which simultaneously satisfies them all exactly.

All this seems related to something you may have noticed in the tables above: namely, I don't see any obvious way to extend the technique to the case of compressing down to a 1-bit result. In this case, there are only two possible result values { 0, 1 }, so if any/every given (e.g.) 64-bit input value still summarily excludes its own result from being the result for all 64 of its single-bit-flip neighbors, then that now essentially imposes the other, only remaining value on those 64. The math breakdown we see in the table seems to be signalling that a simultaneous result under such conditions is a bridge too far.

In other words, the special 'information-preserving' characteristic of xor (that is, its luxuriously reliable guarantee that, as opposed to and, or, etc., it c̲a̲n̲ and w̲i̲l̲l̲ always change a bit) not surprisingly exacts a certain cost, namely, a fiercely non-negotiable demand for a certain amount of elbow room—at least 2 bits—to work with.

回复收藏 0 原文

執念 2024-09-13 14:43:48

我认为这是你能得到的最好的。您可以将代码压缩到一行，但 var 现在作为文档存在：

function hash_32b_to_16b(val32b) {
    var rightBits = val32b & 0xffff; // Left-most 16 bits
    var leftBits = val32b & 0xffff0000; // Right-most 16 bits

    leftBits = leftBits >>> 16; // Shift the left-most 16 bits to a 16-bit value

    return rightBits ^ leftBits; // XOR the left-most and right-most bits
}

给定问题的参数，最佳解决方案将使每个 16 位哈希恰好对应于 2^16 32 - 位数字。 IMO 也会以不同的方式散列连续的 32 位数字。除非我遗漏了什么，否则我相信这个解决方案可以做到这两件事。

我认为安全性不能成为这个问题的考虑因素，因为散列值的位数太少了。我相信我给出的解决方案提供了 32 位数字到 16 位哈希的均匀分布

I think this is the best you're going to get. You could compress the code to a single line but the var's are there for now as documentation:

function hash_32b_to_16b(val32b) {
    var rightBits = val32b & 0xffff; // Left-most 16 bits
    var leftBits = val32b & 0xffff0000; // Right-most 16 bits

    leftBits = leftBits >>> 16; // Shift the left-most 16 bits to a 16-bit value

    return rightBits ^ leftBits; // XOR the left-most and right-most bits
}

Given the parameters of the problem, the best solution would have each 16-bit hash correspond to exactly 2^16 32-bit numbers. It would also IMO hash sequential 32-bit numbers differently. Unless I'm missing something, I believe this solution does those two things.

I would argue that security cannot be a consideration in this problem, as the hashed value is just too few bits. I believe that the solution I gave provides even distribution of 32-bit numbers to 16-bit hashes

回复收藏 0 原文

眼眸 2024-09-13 14:43:48

这取决于整数的性质。
如果它们可以包含一些位掩码，或者可以相差 2 的幂，那么简单的 XOR 将具有很高的冲突概率。
您可以尝试类似 (i>>16) ^ ((i&0xffff) * p) 的内容，其中 p 是质数。

像 MD5 这样的安全哈希值都很好，但它们在这里显然是大材小用了。任何比 CRC16 更复杂的东西都太过分了。

回复收藏 0 原文

后来的我们 2024-09-13 14:43:48

我想说的是，只需应用 sha1 或 md5 等标准哈希，然后获取其中的最后 16 位。

回复收藏 0 原文

错々过的事 2024-09-13 14:43:48

假设您期望最低有效位“变化”最大，我认为仅使用该值的较低 16 位作为散列就可能获得足够好的分布。

如果您要散列的数字不具有这种分布，那么在高 16 位中进行异或运算的附加步骤可能会有所帮助。

当然，这个建议是如果您打算仅将哈希用于某种查找/存储方案，而不是寻找不可猜测性和不可逆性的与加密相关的属性（异或建议不这样做）也不是真的买你的）。

回复收藏 0 原文

揪着可爱 2024-09-13 14:43:48

像这样简单的事情......

function hash_32b_to_16b(val32b) {    
    var h = hmac(secretKey, sha512);
    var v = val32b;
    for(var i = 0; i < 4096; ++i)
        v = h(v);
    return v % 0xffff;
}

Something simple like this....

function hash_32b_to_16b(val32b) {    
    var h = hmac(secretKey, sha512);
    var v = val32b;
    for(var i = 0; i < 4096; ++i)
        v = h(v);
    return v % 0xffff;
}

回复收藏 0 原文

~没有更多了~