帮忙设计一个哈希函数来检测重复记录?
让我解释一下到目前为止我的计划。这是一个魔方解算器。我得到了一个打乱的立方体(这是初始状态)。这成为图的根节点。我正在使用迭代加深深度优先搜索来“强力”这个混乱的立方体到可识别的状态,然后我可以使用模式识别来解决。
正如您可以想象的那样,这是一个非常大的图,因此我想提出某种哈希功能来检测该图中的重复节点(从而加快遍历速度)。
我对哈希函数很不熟悉,但这就是我的想法......每个节点本质上都是魔方的不同状态。因此,如果我遇到已经看到的立方体状态(节点),我想跳过它。因此,我需要一个哈希函数,将我从状态变量转换为校验和,其中状态变量是一个 54 个字符的字符串。唯一允许的字符是 y、r、g、o、b、w(对应于颜色)。
任何设计此哈希函数的帮助将不胜感激。
Let me explain my program thus far. It is a rubiks cube solver. I am given a scrambled cube (this is the initial state). This becomes the root node of a graph. I am using iterative deepening depth first search
to "brute force" this scrambled cube to a recognizable state which I can then use pattern recognition to solve.
As you can imagine, this is a very large graph, so I would like to come up with some sort of hashing functionality to detect duplicate nodes in this graph (thus speeding up the traversal).
I am largely unfamiliar with hashing functions, but here is what I am thinking... Each node is essentially a different state of the rubik's cube. So if I come to a cube state (node) that has already be seen, I want to skip over it. So I need a hashing function that takes me from the state variable to a checksum, where the state variable is a 54-character string. The only allowed characters are y, r, g, o, b, w
(which correspond to colors).
Any help designing this hash function would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
为了最快地检测和删除重复项 - 首先避免生成许多重复位置。这很容易做到,并且比生成然后查找重复项更快。例如,如果您有像 F 和 B 这样的动作,如果您允许子序列 FB,则不允许 BF,这会产生相同的结果。如果您刚刚完成 3F,则不要在其后添加 F。您可以根据最后 3 个动作生成一个小型查找表,用于允许的下一步动作。
对于剩余的重复项,您需要快速哈希,因为有很多位置。为了让你的散列速度更快,正如其他人评论的那样,你希望它散列的内容,即位置的表示,要小。有 12 个边立方体和 8 个角立方体。表示每个立方体的位置和方向仅需要每个立方体 5 位,即总共 100 位(12.5 字节)。对于边缘,四位用于位置,一位用于翻转。对于角球,其 3 位用于位置,2 位用于旋转。您可以忽略最后一个边缘立方体,因为它的位置和翻转是由其他边缘立方体固定的。通过这种表示,您的位置已经减少到 12 个字节。
在魔方位置中大约有 70 个真实信息位,而 96 位足够接近 70 位,这使得进一步散列这些位实际上会产生相反的效果。即,将棋盘的这种表示视为您的散列。这可能听起来有点奇怪,但从你的问题来看,我设想你同时尝试使用更不紧凑的立方体表示,它更适合你的模式匹配。在这种情况下,可以将 12 字节值视为哈希,其优点是永远不会发生冲突的哈希。这使得重复的测试代码和新值插入更短、更简单、更快。它将比目前建议的 MD5 解决方案便宜。
您还可以使用许多其他技巧来减少搜索重复位置的工作。查看 http://cube20.org/ 获取想法。
For the fastest duplicate detection and removal - avoid generating many of the repeated positions in the first place. This is easy to do and quicker than generating and then finding the repeats. So for example if you have moves like F and B, if you allow the sub sequence FB don't also allow BF, which gives the same result. If you've just done 3F, don't follow it with F. You can generate a small look-up table for allowed next moves, given the last three moves.
For the remaining duplicates you want a fast hash because there are a lot of positions. To make your hash go fast, as others have commented, you want what it hashes from, the representation of the position, to be small. There are 12 edge cubies and there are 8 corner cubies. Representing each cubies position and orientation need take only five bits per cubie, i.e. 100 bits (12.5 bytes) total. For edges its four bits for position and one for flip. For corners its three bits for position and 2 for spin. You can ignore the last edge cubie since its position and flip is fixed by the others. With this representation you are already down to 12 bytes for the position.
You have about 70 real bits of information in a rubik cube position, and 96 bits is close enough to 70 to make it actually counter productive hashing those bits further. I.e. treat this representation of the board as your hash. That may sound a bit strange, but from your question I'm envisaging you at the same time experimenting with a less compact representation of the cube that is more amenable to your pattern matching. In that case the 12 byte value can be treated as if it were a hash, with the advantage that it's a hash that never has a collision. That makes the duplicate testing code and new value insertion shorter and simpler and faster. It's going to be cheaper than the MD5 solutions suggested so far.
There are many other tricks you could use to cut down the work in searching for repeated positions. Have a look at http://cube20.org/ for ideas.
您始终可以尝试加密哈希函数。由于您的问题不是安全问题(没有攻击者故意尝试查找哈希为相同值的不同状态),因此您可以使用损坏哈希函数。我建议尝试 MD4,速度相当快。您的 54 个字符的字符串非常适合 MD4 输入(MD4 可以将最多 55 个字节的输入作为单个块处理)。
一台基本的 2.4 GHz PC 使用单个内核,通过简单的展开 C 实现(例如,类似于示例代码中的
MD4Transform()
函数),每秒可以散列大约 1200 万个这样的字符串在 RFC 1320 中)。这可能足以满足您的需求。You can always try a cryptographic hash function. Since your problem is not a question of security (there is no attacker purposely trying to find distinct states which hash to the same value), you can use a broken hash function. I recommend trying MD4, which is quite fast. Your 54-character string is quite appropriate for MD4 input (MD4 can process inputs up to 55 bytes as a single block).
A basic 2.4 GHz PC can hash about 12 millions such strings per second, using a single core, with a simple unrolled C implementation (e.g. one which would look like the
MD4Transform()
function in the sample code included in RFC 1320). This may be enough for your needs.1)不要使用哈希
魔方上有
9*6 = 54
个独立的面。即使每个面浪费 1 个字节,这也是 432 位,因此散列不会为您节省太多空间。每个面 3 位的更好封装达到 162 位(21 字节)。在我看来,你需要一种紧凑的方式来表示魔方。OTOH,如果您想要存储一组许多以前访问过的状态,那么我发现使用布隆过滤器而不是真实的集合可以给我带来不错的结果(但通常不是最佳的),而且空间利用率要低得多。
2)如果您热衷于哈希的想法:
只需使用 MD5,它比提议的 rubik 状态稍微更紧凑,速度相当快,并且具有良好的碰撞特性 - 这不像你有一个恶意对手试图引起 rubik 立方体哈希碰撞;-)。
编辑:一旦您拥有实现算法的库或函数(例如:OpenSSL、GNU TLS 和许多独立的实现),使用加密哈希函数(例如 MD4/MD5)通常很简单。通常该函数类似于
void md5(unsigned char *buf, size_t len, unsigned char *digest)
,其中digest
指向预先分配的 16 字节缓冲区,buf
是要散列的数据(你的魔方结构)。以下是一些未经测试的 C 代码:并确保使用
-lssl
进行编译/链接。1) Don't Use A Hash
You have
9*6 = 54
separate faces on a rubik cube. Even wastefully using 1 byte per face this is 432 bits, so hashing won't save you too much space. A better packing of 3 bits per face comes to 162 bits (21 bytes). It sounds to me like you need a compact way to represent the rubik.OTOH, if you are looking to store a set of many many previously-visited states then I've found that using a bloom filter instead of a true set gets me decent results (but often non-optimal) with much lower space utilization.
2) If you are married to the idea of a hash:
Just use MD5, its slightly more compact than the proposed rubik states, rather fast, and has good collision properties - it's not like you have a malicious adversary trying to cause rubik cube hash collisions ;-).
EDIT: Using cryptographic hash functions, such as MD4/MD5, is usually simple once you have a library or function implementing the algorithm (ex: OpenSSL, GNU TLS, and many stand-alone implementations exist). Usually the function is something like
void md5(unsigned char *buf, size_t len, unsigned char *digest)
wheredigest
points to a pre-allocated 16 byte buffer andbuf
is the data to be hashed (your rubik cube structure). Here is some untested C code:And be sure to compile/link with
-lssl
.8 个角立方体:
8 中! = 40320
排列,40320
可以用 16 位表示。每个角锥可以正确定向或顺时针或逆时针旋转120°以处于三个不同的位置(分别表示为0、1和2)。
因此,如果您想解码排列,角立方体可以用 28 位表示;如果您想直接记录 8 个角中的 7 个角的位置,则可以用 33 位表示。
12 个边立方体:
12! = 479001600
边的排列可以存储在 29 位中。每条边可以正确定向或翻转:
因此,如果您想解码排列,边立方体可以用 40 位表示;如果您想记录 12 个边中的 11 个边的所有位置和翻转,则可以用 55 位表示。
6 个中心立方体
您不需要记录有关中心立方体的任何信息 - 它们相对于魔方中心的球是固定的(因此假设您不担心任何徽标的方向立方体上)是不动的。
总计:
8 corner cubes:
8! = 40320
permutations and40320
can be represented in 16 bits.Each corner cube can be orientated correctly or be rotated 120° clockwise or anti-clockwise to be in three different positions (represented as 0, 1 and 2 respectively).
So the corners cubes can be represented in 28 bits, if you want to decode the permutations, or in 33 bits, if you want to directly record the positions of 7-of-8 corners.
12 edge cubes:
12! = 479001600
permutations of the edges can be stored in 29 bits.Each edge can be either be oriented correctly or flipped:
So edge cubes can be represented in 40 bits, if you want to decode the permutations, or in 55 bits if you want to record all the positions and flips of 11-of-12 edges.
6 centre cubes
You do not need to record any information about the centre cubes - they are fixed relative to the ball at the centre of the Rubik's cube (so assuming you are not worried about the orientation of any logos on the cube) are immobile.
Total:
只是为了建立理论上的最小表示 - 有效魔方的状态空间约为
4.3*10^19
。 Log2(4.3*10^19) 将确定需要多少位来表示整个空间,上限为66。因此,理论上,如果您可以对每个有效状态进行编号,则任何给定状态都可以用 66 位唯一表示。虽然您可能想遵循其他人的建议并找到一种更紧凑的方式来表示立方体,但请考虑用边缘、角和面块来表示状态。由于合法立方体移动的交换法则,您应该能够连接 12 个 4 位边缘位置、8 个 3 位角位置和 6 个 3 位面位置的序列。这应该会产生使用 90 位的唯一表示。
这种表示可能不利于您创建树的方式,但它是唯一的,易于比较,并且应该可以在现有表示中找到给定的状态。
Just to establish the theoretical minimum representation - the state space of a valid Rubik's cube is about
4.3*10^19
. Log2(4.3*10^19) will then determine how many bits you need to represent that full space, the ceiling of which is 66. So in theory, if you could number every valid state, any given state could be uniquely represented in 66 bits.While you may want to follow others' advice and find a more compact way of representing the cube, consider representing the state in terms of edge, corner, and face pieces. Due to the swapping laws of legal cube moves, you should be able to concatenate a sequence of 12 4-bit edge locations, 8 3-bit corner locations, and 6 3-bit face locations. This should result in a unique representation using 90 bits.
This representation may not be conducive to the way you are creating your tree, but it is unique, easily comparable, and should be possible to find given a state in your existing representation.