如何找到n维空间中的k近值？

发布于 2024-08-24 06:33:29 字数 188 浏览 19 评论 0原文

我读过有关 kd 树的内容，但当空间维度较高时，它们的效率很低。我有一个有价值的数据库，我想找到查询的特定汉明距离内的值。例如，数据库是一个 32 位数字的列表，我想找到与查询值相差小于 3 位的所有数字。

我在某处听说过有关多变量分区树的信息，但找不到很好的参考。我知道 min-Hash 给出了一个很好的近似值，更好，但我想要一个确切的答案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为人所爱 2024-08-31 06:33:29

汉明距离与 levenshtein 距离密切相关，并且与用于拼写纠正的算法类似。

一种有效的方法是在 trie。对于近距离，所花费的时间是指数级的，直到字典大小是线性的。

如果字典是存储在二进制 trie 中的二进制单词，具有严格的汉明距离，那么这里有一个简单的伪代码：

walk(trie, word, i, hit, budget){
  if (budget < 0 || i > word.length) return;
  if (trie==NULL){
    if (i==word.length) print hit;
    return;
  }
  hit[i] = 0;
  walk(trie.subtrie[0], word, i+1, hit, (word[i]==0 ? budget : budget-1));
  hit[i] = 1;
  walk(trie.subtrie[1], word, i+1, hit, (word[i]==1 ? budget : budget-1));
}

main(){
  for (int budget = 0; ; budget++){
    walk(trie, word, 0, hit, budget);
    /* quit if enough hits have been printed */
  }
}

其想法是遍历整个 trie，跟踪当前 trie 节点和原始单词之间的距离。您可以通过预算可以容忍的距离来修剪搜索。这是有效的，因为当你深入到 trie 时，距离永远不会减少。

然后，您重复执行此操作，预算从零开始并逐步增加，直到打印出您想要的点击数。由于每次步行所覆盖的节点比后续步行少得多，因此进行多次步行不会有什么坏处。如果k是固定的，您可以简单地以此作为预算。

The hamming distance is closely related to levenshtein distance, and is similar to algorithms used for spelling correction.

A method that works is branch-and-bound search in a trie. It takes time that is exponential in the distance, for near distance, up to being linear in the dictionary size.

If the dictionary is of binary words stored in a binary trie, with strict hamming distance, here is a simple pseudo-code:

walk(trie, word, i, hit, budget){
  if (budget < 0 || i > word.length) return;
  if (trie==NULL){
    if (i==word.length) print hit;
    return;
  }
  hit[i] = 0;
  walk(trie.subtrie[0], word, i+1, hit, (word[i]==0 ? budget : budget-1));
  hit[i] = 1;
  walk(trie.subtrie[1], word, i+1, hit, (word[i]==1 ? budget : budget-1));
}

main(){
  for (int budget = 0; ; budget++){
    walk(trie, word, 0, hit, budget);
    /* quit if enough hits have been printed */
  }
}

The idea is you walk the entire trie, keeping track of the distance between the current trie node and the original word. You prune the search by having a budget of how much distance you will tolerate. This works because the distance can never decrease as you go deeper into the trie.

Then you do this repeatedly with budgets starting at zero and increasing in steps until you print out the hits you want. Since each walk covers so many fewer nodes than the subsequent walk, it doesn't hurt that you're doing multiple walks. If k is fixed, you can simply start out with that as your budget.

回复收藏 0 原文

~没有更多了~