当前位置：文江博客话题详情

高维数据中的最近邻？

发布于 2024-11-02 22:57:07 字数 436 浏览 8 评论 0原文

几天前，我问了一个问题，关于如何找到给定向量的最近邻。我的向量现在是 21 维，在进一步进行之前，因为我既不是机器学习领域的人，也不是数学领域的人，我开始问自己一些基本问题：

欧几里德距离是首先找到最近邻居的一个很好的度量标准吗？？如果没有，我有什么选择？
此外，如何确定确定 k 邻居的正确阈值？是否可以进行一些分析来计算出该值？
之前，有人建议我使用 kd-Trees，但维基百科页面明确指出，对于高维，kd-Tree 几乎相当于暴力搜索。在这种情况下，有效地在一百万点数据集中找到最近邻居的最佳方法是什么？

有人可以澄清上述部分（或全部）问题吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绮烟 2024-11-09 22:57:07

我目前正在研究音乐信息检索的分类、最近邻搜索等问题。

您可能对近似最近邻 (ANN) 算法感兴趣。这个想法是让算法返回足够近的邻居（也许不是最近的邻居）；这样做可以降低复杂性。您提到了kd-tree；这就是一个例子。但正如你所说，kd-tree 在高维度上效果不佳。事实上，所有当前的索引技术（基于空间分区）都会退化为足够高维度的线性搜索 [1][2][3]。

在最近提出的ANN算法中，最流行的可能是局部敏感哈希（LSH），它将一组点映射到高概率分布中。将维度空间分解为一组 bin，即哈希表 [1][3]。但与传统哈希不同的是，位置敏感哈希将附近的点放入同一个容器中。

利星行有一些巨大的优势。首先，它很简单。您只需计算数据库中所有点的哈希值，然后根据它们创建一个哈希表。要进行查询，只需计算查询点的哈希值，然后从哈希表中检索同一 bin 中的所有点。

其次，有严格的理论支持其性能。可以看出，查询时间相对于数据库的大小是次线性的，即比线性搜索更快。快多少取决于我们可以容忍多少近似值。

最后，LSH 与任何 0 <<< 的 Lp 范数兼容。 p<=2。因此，要回答您的第一个问题，您可以将 LSH 与欧几里德距离度量一起使用，也可以将其与曼哈顿 (L1) 距离度量一起使用。汉明距离和余弦相似度也有变体。

Malcolm Slaney 和 Michael Casey 于 2008 年为 IEEE 信号处理杂志撰写了一篇不错的概述[4]。

LSH似乎已被应用到各处。您可能想尝试一下。

[1] Datar, Indyk, Immorlica, Mirrokni，“基于 p 稳定分布的局部敏感哈希方案”，2004 年。

[2] Weber, Schek, Blott，“高水平相似性搜索方法的定量分析和性能研究” -维空间”，1998。

[3] Gionis、Indyk、Motwani，“通过散列在高维中进行相似性搜索” 1999.

[4] Slaney, Casey，“用于查找最近邻居的位置敏感哈希”，2008。

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.

You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].

Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.

LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.

Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.

Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.

A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].

LSH has been applied seemingly everywhere. You may want to give it a try.

[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.

[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.

[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.

[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

高维数据中的最近邻？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（15）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。