在这种情况下是否可以创建一个最小完美哈希函数?
我想创建一个哈希映射(或其他结构,如果您有任何建议)来存储键值对。这些键将在创建地图的同时一次性插入,但我不知道键是什么(任意长度的字符串),直到运行时,当我需要创建地图时。
我正在解析这样的查询字符串 "x=100&name=bob&color=red&y=150"
(但该字符串可以有无限数量的变量,并且变量可以有任意长度姓名)。
我想解析一次并创建一个哈希映射,最好是最小的并且具有完美的哈希函数以满足线性存储要求。创建映射后,值将不会被修改或删除,也不会再向映射添加更多键值对,因此整个映射实际上是一个常量。我假设一个变量不会在字符串中出现两次(IE."x=1&x=2"
无效)。
我正在使用 C
进行编码,目前有一个可以使用的函数,例如 get("x")
,它将返回字符串 "100"
>,但每次都会解析查询字符串,这需要 O(n)
时间。我想在第一次加载时解析它一次,因为它是一个非常大的查询字符串,并且每个值都会被读取多次。尽管我使用的是 C
,但我不需要 C
中的代码作为答案。伪代码,或者任何建议都很棒!
I want to create a Hash Map (or another structure, if you have any suggestions) to store key value pairs. The keys will all be inserted at once at the same time as the map is created, but I don't know what the keys will be (arbitrary length strings) until runtime, when I need to create the map.
I am parsing a query string like this "x=100&name=bob&color=red&y=150"
(but the string can have an unlimited number of variables and the variables can have any length name).
I want to parse it once and create a Hash Map, preferably minimal and with a perfect hash function to satisfy linear storage requirements. Once the map is created the values won't be modified or deleted, no more key value pairs will be added to the map either, so the entire map is effectively a constant. I'm assuming that a variable doesn't occur twice in the string (IE. "x=1&x=2"
is not valid).
I am coding in C
, and currently have a function that I can use like get("x")
which will return the string "100"
, but it parses the query string each time which takes O(n)
time. I'd like to parse it once when it is first loaded since it is a very large query string and every value will be read several times. Even though I'm using C
, I don't need code in C
as an answer. Pseudocode, or any suggestions at all would be awesome!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
尝试 GPL 的 gperf,或 Bob Jenkins 在 C 中的公共域实现
过程:
接收查询字符串并通过枚举键列表来识别完美哈希函数的域
向从上述参考实现派生的完美哈希生成函数提供这些键和列表大小(范围为 1..size)
使用生成的完美哈希函数来创建 HashMap
使用相同的完美哈希函数来处理中的
get
请求HashMap编辑 Necrolis 在下面的评论中指出,参考实现在 C 中输出完美哈希函数源代码,因此您需要修改它们以生成类似于 VM 的字节码之类的内容。您还可以使用解释性语言,例如嵌入式Scheme 或Lua。
有趣的是,当创建完美哈希函数的开销通过查找分摊时,这是否值得在简单(不完美)的 HashMap 上付出努力
另一个选择是 Cuckoo 哈希,它也有 O(1) 查找
Try GPL'd gperf, or Bob Jenkins' public domain implementation in C
Procedure:
receive query string and identify domain of perfect hash function by enumerating the list of keys
provide these keys and list size (the range will be 1..size) to the perfect hash generation function derived from above reference implementations
Use the perfect hash function generated to create the HashMap
Use the same perfect hash function to process the
get
requests in the HashMapEdit Necrolis noted in the comment below that the reference implementations output perfect hash functions in C source code, so you'll need to modify them to generate something like a bytecode for a VM instead. You could also use an interpretative language like embedded Scheme or Lua.
It would be interesting to know if this is worth the effort over a simple (non-perfect) HashMap when the overhead of creating the perfect hash function is amortized over the lookups
Another option is Cuckoo hashing which also has O(1) lookups
有一些非常好的哈希例程;然而,要证明其中一个接近完美,需要对输入有大量了解。看来你的输入不受约束,使得这样的证明几乎不可能。
一般来说,完美(或接近完美)的例程对输入的每个位/字节都很敏感。为了速度,组合运算通常是异或。此类例程防止两个相同字节相互抵消的方法是移位或循环位。然而,这种移位应该由与可表示的最大数互质的数来完成;否则,输入中的模式可能会被先前的输入部分抵消。这减少了溶液中的熵,增加了碰撞的机会。
典型的解决方案是
这样的例程的问题是已知的。基本上,输入缺乏变化,这使得分散输入并不理想。也就是说,只要有足够的输入偏离初始质数起始数,该技术就可以在整个输出域中提供良好的输入位分散。不幸的是,选择随机起始数并不是一个解决方案,因为这样就不可能准确地重新计算哈希值。
无论如何,乘法中使用的素数不应溢出乘法。同样,如果您想避免丢失初始输入的色散效应(并且结果仅围绕后面的位/字节分组),则必须以低位替换高位的捕获。素数的选择会影响色散,有时需要进行调整才能获得良好的效果。
到现在为止,您应该很容易地看到,一个近乎完美的哈希值比一个像样的不太完美的哈希值需要更多的计算时间。哈希算法的设计考虑到了冲突,并且大多数 Java 哈希结构会在占用阈值(通常在 70% 的范围内,但它是可调的)处调整大小。由于调整大小是内置的,只要您不编写糟糕的散列,Java 数据结构就会继续重新调整您以减少冲突的机会。
可以加速哈希的优化包括对位组进行计算、删除偶尔的字节、预先计算常用乘数的查找表(按输入索引)等。不要假设优化速度更快,具体取决于架构,机器细节和优化的“年龄”,有时优化的假设不再成立,应用优化实际上会增加计算哈希的时间。
There are some very good hashing routines; however, proving one of them to be near-perfect requires a lot of knowledge of the inputs. It seems that your inputs are unconstrained enough to make such a proof near-impossible.
Generally speaking a perfect (or near-perfect) routine is sensitive to each bit/byte of input. For speed, the combination operation is typically XOR. The way that such routines prevent two identical bytes from cancelling each other out is to shift or rotate the bits. However such shifting should be done by a number that is a relative prime to the maximum number that can be represented; otherwise, patterns in the input could partially be cancelled by previous input. This reduces entropy in the solution, increasing chance of collision.
The typical solution is to
The problems with such a routine are known. Basically there is a lack of variation in the input, and this makes dispersing the input non-ideal. That said, this technique gives a good dispersion of input bits across the entire domain of outputs provided there is sufficient input to wander away from the initial prime starting number. Unfortunately, picking a random starting number is not a solution, as then it becomes impossible to accurately recompute the hash.
In any case, the prime to be used in the multiplication should not overflow the multiplication. Likewise the capturing of high-order bits must be replaced in the low order if you want to avoid losing dispersion effects of the initial input (and the result becoming grouped around the latter bits / bytes only). Prime number selection effects the dispersion, and sometimes tuning is required for good effect.
By now you should easily be able to see that a near-perfect hash takes more computational time than a decent less-than-near-perfect hash. Hash algorithms are designed to account for collision, and most Java hash structures resize at occupancy thresholds (typically in the 70% range, but it is tunable). Since the resizing is built in, as long as you don't write a terrible hash, the Java data structures will continue to retune you into having less of a chance of collision.
Optimizations which can speed a hash include computing on groups of bits, dropping the occasional byte, pre-computing lookup tables of commonly used multiplied numbers (indexed by input), etc. Don't assume that an optimization is faster, depending on architecture, machine details, and "age" of the optimization, sometimes the assumptions of the optimization no longer hold and applying the optimization actually increases the time to compute the hash.
您所描述的内容中不存在完美的哈希之类的东西。完美的哈希将是原始输入。如果你保证你的数据只是某些东西(例如基于拉丁语的 ASCII 或只是某些键),那么你可以很好地进行散列,但完美吗?不,不可能。您还必须创建一个链接列表或向量哈希未命中机制。系统中的任何变体(例如您的情况下的输入计数)都会使完美的哈希概念失效。
你想要的东西违反了数学定律。
您可以实现接近 O(1) 的效果,但这里还有一些未解答的问题。问题是:
尽管完美的散列是不可能的,但如果您可以简单地拥有一个简单的链表,其存储桶大小至少与潜在唯一散列的平均值相差两个标准差,那么它就完全是学术性的了。它的内存最小(当然是相对而言,取决于总的潜在大小),删除友好,并且只要回答问题 3 类似“小得多”的查找时间几乎 O(1) 。
以下内容应该可以帮助您入门,但我将把使用哪种哈希算法的决定留给您......
使用示例(作为断言)和效率测试。使用
int
作为数据值类型...此外,我使用 100,000 个随机生成的 ASCII 键(长度在 5 到 1000 个字符之间)进行了一些测试,显示了以下内容...
正如您所看到的,它有潜力表现得相当好。 80% 的效率意味着大约 80% 的查找为 O(1),大约 16% 的查找为 O(2),大约 3.2% 的查找为 O(3),大约 0.8% 的查找为 O(3)。 O(4+)。这意味着平均一次查找将花费 O(1.248)
同样,50% 的效率意味着 50% 的查找为 O(1),25% 为 O(2),12.5% 为 O(2)。 O(3),12.5% 是 O(4+)
您实际上只需要为您的已知因素选择(或编写)正确的哈希算法,并根据您的特定需求进行调整。
注意:
move()
、swap()
、sort()
、insert() 等内容
等,通过管理entry->prev
和entry->next
There's no such thing as a perfect hash in what you're describing. A perfect hash would be the original input. If you're guaranteed that your data will only be certain things (such as latin based ASCII or only certain keys) then you can hash well, but perfect? No. Not possible. You have to create a link-list or vector hash miss mechanism as well. Any varient in the system (like count of inputs in your case) will invalidate the perfect hash concept.
What you want defies the laws of math.
You can achieve near O(1) but there's unanswered questions here. The questions are:
Although a perfect hash isn't possible, it becomes entirely academic if you can simply have a simple linked list with a bucket size that is at least two standard deviations out from the mean of your potential unique hashes. It's minimal memory (relatively speaking of course and depending on total potential size), deletion friendly, and would be nearly O(1) lookup time as long as question 3 is answered something like, "far smaller".
The following should get you started but I'll leave decisions about which hash algorithm to use up to you...
Usage examples (as assertions) and efficiency tests. Using
int
as the data value type...Additionally I did some tests using 100,000 randomly generated ASCII keys with lengths between 5 and 1000 characters that showed the following...
As you can see, it has the potential to perform quite well. An efficiency of 80% means that approximately 80% of the lookups are O(1), about 16% of the lookups are O(2), about 3.2% of the lookups are O(3), and about 0.8% of lookups are O(4+). This means that on average a lookup would take O(1.248)
Likewise, an efficiency of 50% means that 50% of lookups are O(1), 25% are O(2), 12.5% are O(3), and 12.5% are O(4+)
You really just need to pick (or write) the right hashing algorithm for your known factors and tweak things for your specific needs.
Notes:
move()
,swap()
,sort()
,insert()
, etc by managingentry->prev
andentry->next
如果您知道所有可能的变量名称的集合,那么可以使用将名称完美散列为数字,
但每个散列表最终将具有相同的长度,例如,如果
X
和y
是名称,如果
perfect(str)
变成'x'
和'y, 则映射的长度始终为 2 '
变为 0 和 1;那么函数get
将是if you know the set of all possible variable names, then it would be possible to use to perfect hash the names to numbers
but each of the hash tables would end up having the same length an example is if
X
andy
are the names then the map would always be of length 2if
perfect(str)
turns'x'
and'y'
into 0 and 1; then the functionget
would be