我正在使用 MD5 算法对磁盘哈希表的密钥进行哈希处理(我知道这是否是用于此目的的最佳算法是值得怀疑的,但我现在就使用它。这个问题可以推广到任何生成字节数组的算法)。我的问题是这样的:
哈希码的大小决定了哈希表中组合(桶)的数量。由于 MD5 是 128 位,因此存在大量组合(~ 3.4e38),这对于我的目的来说太大了。所以我想要做的是取出 MD5 生成的字节数组的前 n 位,并将它们转换为 long (或 ulong)值。由于MD5产生的是字节数组,如果我想要整数个字节,这很容易做到,但这会导致组合数量跳跃太大。我发现单位版本要棘手得多。
目标:
n = 10 // I.e. I want 2^10 combinations
long pos = someFcn(byte[] key, n)
其中 key 是被散列的值,n 是我想要使用的 MD5 结果的位数。那么,Pos 将是 0 到 1023 之间的整数(在 n = 10 的情况下)。如果 n = 11,则代码将从 0 到 2^11-1 = 2027 等。必须有点快/高效。
看起来并不难,但它却让我困惑。任何帮助将不胜感激。谢谢。
I'm using the MD5 algorithm to hash the key for an on-disk hash table (I know it's questionable whether this is the best algorithm to use for this, but I'm going with it for now. The problem is generalizable to any algorithm that produces a byte array). My problem is this:
The size of the hash code determines the number of combinations (buckets) in the hash table. Since MD5 is 128 bit, there are a huge number of combinations (~ 3.4e38) which is way too big for my purpose. So what I want to do is pick off the first n bits of the byte array that MD5 produces, and convert those into a long (or ulong) value. Since MD5 produces a byte array, it would be easy to do if I wanted an integral number of bytes, but this leads to too big a jump in the number of combinations. I'm finding the single bit version to be a lot trickier.
Goal:
n = 10 // I.e. I want 2^10 combinations
long pos = someFcn(byte[] key, n)
where key is the value being hashed, and n is the number of bits of the MD5 result I want to use. Pos, then, will be an integer from 0 to 1023 (in the case of n = 10). If n = 11, the code will be from 0 to 2^11-1 = 2027, etc. Has to be somewhat fast/efficient.
Doesn't seem that hard but it's eluding me. Any help would be much appreciated. Thanks.
发布评论
评论(4)
首先,将前四个字节转换为整数,使用
BitConverter.ToInt32
。无论如何,它都会获得 4 个字节,但这可能不会使其明显变慢,因为无论如何,您都在使用 32 位寄存器进行其余计算,以及诸如“如果它 < 16 则执行此操作”之类的复杂内容与前两个字节”只会使它变得更加复杂然后,给定该整数,取最低的 N 位。如果您确实想要在编译时未知的特定位数[桶数的两个幂],
~((-1)< 是一个获得 2 的好技巧^N-1。
或者您可以简单地使用
ToUInt32
来代替并对质数取模[转换为 UInt64 可能会稍微好一些,然后您就可以从一半的位开始,在这种情况下]First, convert the first four bytes into an integer, with
BitConverter.ToInt32
. It's getting four bytes no matter what, but this probably won't make it measurably slower, since you're working with 32-bit registers for the rest of the calculations anyway, and complex stuff like "if it's < 16 then do this with the first two bytes" will just make it more complicatedThen, given that integer, take the lowest N bits. If you really want a specific number of bits [a power of two number of buckets] not known at compile time,
~((-1)<<N)
is a nice trick to get 2^N-1.Or you could simply use
ToUInt32
instead and modulo a prime number [it might be slightly better to convert to UInt64 instead, then you've got fully half the bits to start with, in this case]获取前10位,例如:
To obtain the first 10 bits, for example:
如果你有一个像这样的数组,
那么你可以将前 n 位刮掉成一个整数,如下所示:
我假设
CHAR_BITS == 8
,如果你愿意,可以随意概括代码。此外,数组的大小乘以 8 必须至少为n
。If you have an array like this,
then you can just scrape off the first n bits into an integer like so:
I'm assuming that
CHAR_BITS == 8
, feel free to generalize the code if you like. Also the size of the array times 8 must be at leastn
.