如何找到n维空间中的k近值?
我读过有关 kd 树的内容,但当空间维度较高时,它们的效率很低。我有一个有价值的数据库,我想找到查询的特定汉明距离内的值。例如,数据库是一个 32 位数字的列表,我想找到与查询值相差小于 3 位的所有数字。
我在某处听说过有关多变量分区树的信息,但找不到很好的参考。我知道 min-Hash 给出了一个很好的近似值,更好,但我想要一个确切的答案。
I read about kd-trees but they are inefficient when the dimensionality of the space is high. I have a database of value and I want to find the values that are within a certain hamming distance of the query. For instance, the database is a list of 32-bit numbers and I want to find all numbers that differ from the query value by less than 3 bits.
I heard somewhere about MultiVariate Partition trees but couldn't find a good reference. I know that min-Hash gives a good approximation, better as the but I'd like an exact answer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
汉明距离与 levenshtein 距离密切相关,并且与用于拼写纠正的算法类似。
一种有效的方法是在 trie。对于近距离,所花费的时间是指数级的,直到字典大小是线性的。
如果字典是存储在二进制 trie 中的二进制单词,具有严格的汉明距离,那么这里有一个简单的伪代码:
其想法是遍历整个 trie,跟踪当前 trie 节点和原始单词之间的距离。您可以通过预算可以容忍的距离来修剪搜索。这是有效的,因为当你深入到 trie 时,距离永远不会减少。
然后,您重复执行此操作,预算从零开始并逐步增加,直到打印出您想要的点击数。由于每次步行所覆盖的节点比后续步行少得多,因此进行多次步行不会有什么坏处。如果
k
是固定的,您可以简单地以此作为预算。The hamming distance is closely related to levenshtein distance, and is similar to algorithms used for spelling correction.
A method that works is branch-and-bound search in a trie. It takes time that is exponential in the distance, for near distance, up to being linear in the dictionary size.
If the dictionary is of binary words stored in a binary trie, with strict hamming distance, here is a simple pseudo-code:
The idea is you walk the entire trie, keeping track of the distance between the current trie node and the original word. You prune the search by having a budget of how much distance you will tolerate. This works because the distance can never decrease as you go deeper into the trie.
Then you do this repeatedly with budgets starting at zero and increasing in steps until you print out the hits you want. Since each walk covers so many fewer nodes than the subsequent walk, it doesn't hurt that you're doing multiple walks. If
k
is fixed, you can simply start out with that as your budget.