如何在高维数据中高效找到k近邻?
所以我有大约 16,000 个 75 维数据点,对于每个点,我想找到它的 k 个最近邻(使用欧几里德距离,当前 k=2,如果这使它更容易)
我的第一个想法是使用 kd 树来实现,但事实证明,随着维数的增加,它们的效率变得相当低下。在我的示例实现中,它仅比穷举搜索快一点。
我的下一个想法是使用 PCA(主成分分析)来减少维数,但我想知道:是否有一些聪明的算法或数据结构可以在合理的时间内准确解决这个问题?
So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
kd-trees 的 Wikipedia 文章有一个指向 ANN 库 的链接:
就算法/数据结构而言:
我会首先直接尝试它,如果这不能产生令人满意的结果,我会在应用 PCA/ICA 后将它与数据集一起使用(因为你不太可能最终得到足够少的维度来让 kd 树处理)。
The Wikipedia article for kd-trees has a link to the ANN library:
As far as algorithm/data structures are concerned:
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
不幸的是,在高维中,这种数据结构严重遭受维数诅咒 ,这使得它的搜索时间与暴力搜索相当。
降维是一种很好的方法,它提供了一个公平的权衡准确性和速度。当你减小维度时,你会丢失一些信息,但会获得一些速度。
我所说的准确性是指找到精确的最近邻(NN)。
当您想要减少数据所在的维度空间。
近似最近邻搜索 (ANNS),您满意地找到一个可能的点不是精确的最近邻,而是它的一个很好的近似值(例如,当您正在寻找第一个 NN 时,这是查询的第四个 NN)。
这种方法会降低准确性,但会显着提高性能。而且,找到好的神经网络(足够接近查询)的概率相对较高。
您可以在我们的 kd-GeRaF 论文的简介中阅读有关 ANNS 的更多信息。
一个好主意是将 ANNS 与降维结合起来。
局部敏感哈希(LSH)是一种解决高度最近邻问题的现代方法方面。关键思想是彼此靠近的点被散列到同一个桶中。因此,当查询到达时,它将被散列到一个存储桶,其中该存储桶(通常及其邻近的存储桶)包含良好的 NN 候选者)。
FALCONN 是一个很好的 C++ 实现,它专注于余弦相似度。另一个很好的实现是我们的 DOLPHINN,这是一个更通用的库。
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
您可以想象使用 Morton 代码,但如果有 75 个维度,它们将变得巨大。如果您只有 16,000 个数据点,则详尽搜索不会花费太长时间。
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
没有理由相信这是 NP 完全的。你并没有真正优化任何东西,我很难弄清楚如何将其转换为另一个 NP 完全问题(我有 Garey 和 Johnson 在我的书架上,找不到类似的东西)。实际上,我只是追求更有效的搜索和排序方法。如果您有 n 个观测值,则必须预先计算 nxn 距离。然后,对于每个观察,您需要选出前 k 个最近邻居。距离计算为 n 平方,排序为 n log (n),但您必须进行 n 次排序(每个 n 值都不同)。虽然很混乱,但仍然需要多项式时间才能得到答案。
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree 并不是一个坏主意。请查看 Nick 关于 Levenshtein Automata 的博客。虽然他的重点是弦乐,但它应该为您提供其他方法的跳板。我能想到的另一件事是 R-Trees,但我不知道是否它们已被推广到大尺寸。我不能说更多,因为我既没有直接使用它们,也没有自己实现它们。
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
一种非常常见的实现是对您为每个数据点计算的最近邻居数组进行排序。
由于对整个数组进行排序可能非常昂贵,因此您可以使用间接排序等方法,例如 Python Numpy 库中的 Numpy.argpartition 来仅对您感兴趣的最接近的 K 值进行排序。无需对整个数组进行排序。
@Grembo 上面的答案应该大大减少。因为你只需要 K 个最接近的值。并且不需要对每个点的整个距离进行排序。
如果您只需要 K 个邻居,则此方法将非常有效地降低您的计算成本和时间复杂度。
如果您需要排序的 K 个邻居,请再次对输出进行排序
,请参阅
argpartition 的文档
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
@Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition