高维数据中的最近邻?

发布于 2024-11-02 22:57:07 字数 436 浏览 4 评论 0原文

几天前,我问了一个问题,关于如何找到给定向量的最近邻。我的向量现在是 21 维,在进一步进行之前,因为我既不是机器学习领域的人,也不是数学领域的人,我开始问自己一些基本问题:

  • 欧几里德距离是首先找到最近邻居的一个很好的度量标准吗? ?如果没有,我有什么选择?
  • 此外,如何确定确定 k 邻居的正确阈值?是否可以进行一些分析来计算出该值?
  • 之前,有人建议我使用 kd-Trees,但维基百科页面明确指出,对于高维,kd-Tree 几乎相当于暴力搜索。在这种情况下,有效地在一百万点数据集中找到最近邻居的最佳方法是什么?

有人可以澄清上述部分(或全部)问题吗?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:

  • Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
  • In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
  • Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?

Can someone please clarify the some (or all) of the above questions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

绮烟 2024-11-09 22:57:07

我目前正在研究音乐信息检索的分类、最近邻搜索等问题。

您可能对近似最近邻 (ANN) 算法感兴趣。这个想法是让算法返回足够近的邻居(也许不是最近的邻居);这样做可以降低复杂性。您提到了kd-tree;这就是一个例子。但正如你所说,kd-tree 在高维度上效果不佳。事实上,所有当前的索引技术(基于空间分区)都会退化为足够高维度的线性搜索 [1][2][3]。

在最近提出的ANN算法中,最流行的可能是局部敏感哈希LSH),它将一组点映射到高概率分布中。将维度空间分解为一组 bin,即哈希表 [1][3]。但与传统哈希不同的是,位置敏感哈希将附近的点放入同一个容器中。

利星行有一些巨大的优势。首先,它很简单。您只需计算数据库中所有点的哈希值,然后根据它们创建一个哈希表。要进行查询,只需计算查询点的哈希值,然后从哈希表中检索同一 bin 中的所有点。

其次,有严格的理论支持其性能。可以看出,查询时间相对于数据库的大小是次线性的,即比线性搜索更快。快多少取决于我们可以容忍多少近似值。

最后,LSH 与任何 0 <<< 的 Lp 范数兼容。 p<=2。因此,要回答您的第一个问题,您可以将 LSH 与欧几里德距离度量一起使用,也可以将其与曼哈顿 (L1) 距离度量一起使用。汉明距离和余弦相似度也有变体。

Malcolm Slaney 和 Michael Casey 于 2008 年为 IEEE 信号处理杂志撰写了一篇不错的概述[4]。

LSH似乎已被应用到各处。您可能想尝试一下。


[1] Datar, Indyk, Immorlica, Mirrokni,“基于 p 稳定分布的局部敏感哈希方案”,2004 年。

[2] Weber, Schek, Blott,“高水平相似性搜索方法的定量分析和性能研究” -维空间”,1998。

[3] Gionis、Indyk、Motwani,“通过散列在高维中进行相似性搜索” 1999.

[4] Slaney, Casey,“用于查找最近邻居的位置敏感哈希”,2008。

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.

You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].

Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.

LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.

Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.

Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.

A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].

LSH has been applied seemingly everywhere. You may want to give it a try.


[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.

[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.

[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.

[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

守不住的情 2024-11-09 22:57:07

我。距离度量

首先,数据集中的特征(列)数量不是选择 kNN 中使用的距离度量的因素。有相当多已发表的研究正是针对这个问题,通常的比较基础是:

  • 基础统计数据
    您的数据分布;

  • 特征之间的关系
    包含您的数据(它们是
    独立的——即,什么是
    协方差矩阵看起来像);以及

  • 你的坐标空间
    已获取数据。

如果您事先不了解数据采样的分布,则至少有一项(有据可查且详尽)研究结论是欧氏距离是最佳选择。

YEuclidean 度量标准用于超大规模网络推荐引擎以及当前的学术研究。欧氏距离计算具有直观的意义和计算尺度,即无论两点是在二维空间还是在二十二维空间,欧氏距离的计算方式都是相同的。

它对我来说只失败了几次,每一次欧几里得距离都失败了,因为底层(笛卡尔)坐标系是一个糟糕的选择。你通常会认识到这一点,因为例如路径长度(距离)不再是可加的 - 例如,当度量空间是棋盘时,曼哈顿距离比欧几里得距离更好,同样,当度量空间是地球并且距离是反式时- 大陆航班,适合极坐标系的距离度量是个好主意(例如,伦敦到维也纳是 2.5 小时,维也纳到圣彼得堡又是 3 小时,大致相同)方向,但伦敦到圣彼得堡不是 5.5 小时,而是 3 小时多一点。)

但除了数据属于非笛卡尔坐标系的情况外,距离度量的选择通常不是材料。 (请参阅计算机科学学生的这篇博客文章,通过检查来比较几个距离指标它们对kNN分类器的影响——卡方给出了最好的结果,但差异并不大;更全面的研究在学术论文中,比较最近邻距离函数的研究——马哈拉诺比斯(本质上是欧几里得归一化,以考虑维度协方差)是这项研究中最好的

一个重要条件:为了使距离度量计算有意义,您必须<。 em>重新缩放您的数据——如果不这样做,几乎不可能构建 kNN 模型来生成准确的预测。例如,如果您正在构建一个 kNN 模型来预测运动表现,并且您的期望变量是身高 (cm)、体重 (kg)、体脂 (%) 和静息脉搏(每分钟心跳数),那么典型的数据点可能是看起来像这样:[ 180.4, 66.1, 11.3, 71 ]。显然,距离计算将主要由身高决定,而体脂百分比的贡献几乎可以忽略不计。换句话说,如果数据报告方式不同,体重以克为单位而不是公斤,那么原始值 86.1 将是 86,100,这会对您的结果产生很大影响,这正是您所做的不想。最常见的缩放技术可能是减去平均值并除以标准差(平均值和标准差是指为每列或该数据集中的特征单独计算的;X 指数据行中的单个条目/单元格):

X_new = (X_old - mu) / sigma

二.数据结构

如果您关心 kd-tree 结构的性能,Voronoi Tessellation 是一个概念上简单的容器,但它将大大提高性能,并且比 kd-Tree 具有更好的扩展性。< br>
dat

这不是保存 kNN 训练数据的最常见方法,尽管为此应用了 VT目的以及随之而来的性能优势都有详细记录(例如,请参阅此 Microsoft 研究报告) 。这样做的实际意义在于,如果您使用的是“主流”语言(例如,在 TIOBE Index)那么你应该找到一个库来执行VT。我知道在 Python 和 R 中,每种语言都有多个选项(例如,R 的 voronoi 包可在 上找到CRAN

使用 VT 进行 kNN 的工作原理如下::

从您的数据中,随机选择 w 点 - 这些是您的 Voronoi 中心。沃罗诺伊单元封装了距离每个中心最近的所有相邻点。想象一下,如果您为每个 Voronoi 中心分配不同的颜色,则分配给给定中心的每个点都会涂上该颜色。只要你有足够的密度,这样做就能很好地显示每个 Voronoi 中心的边界(作为分隔两种颜色的边界。

如何选择 Voronoi 中心?我使用两个正交准则。随机选择 w 点后,计算接下来检查分配给每个 Voronoi 中心的数据点的数量 - 这些值应该大致相同(在二维数据空间中给定均匀的点密度),这将导致 VT 具有平铺。的这是第一个规则,这是第二个规则 - 通过迭代选择 w - 使用 w 作为可变参数运行 kNN 算法,并测量性能(通过查询 VT 返回预测所需的时间)

。一百万个数据点.....如果这些点保存在普通的 2D 数据结构或 kd 树中,您将为每个新数据点平均执行几百万次距离计算您希望预测其响应变量。当然,这些计算是在单个数据集上执行的。使用 V/T,最近邻搜索是针对两个不同的数据群体分两步依次执行的——首先针对 Voronoi 中心,然后一旦找到最近的中心,单元内的点对应于搜索该中心以找到实际的最近邻居(通过连续的距离计算) 结合起来,这两个查找比单个强力查找要快得多。这很容易看出:对于 1M 数据点,假设您选择 250 个 Voronoi 中心来细分数据空间。平均而言,每个 Voronoi 单元将有 4,000 个数据点。因此,您执行的距离计算要少得多,平均仅为 125 + 2,000,而不是平均执行 500,000 次距离计算(强力)。

III。计算结果(预测响应变量)

从一组 kNN 训练数据计算预测值有两个步骤。第一个是识别 n,或用于此计算的最近邻居的数量。第二个是如何权衡它们对预测值的贡献

对于第一个组件,您可以通过解决优化问题来确定 n 的最佳值(与最小二乘优化非常相似)。这就是理论;实际上,大多数人只使用 n=3。无论如何,在 n=1、n=2、n=3 等的一组测试实例上运行 kNN 算法(以计算预测值)并将误差绘制为 n 的函数是很简单的。如果您只是想要一个合理的 n 值来开始,请再次使用 n = 3。

第二个部分是如何对每个邻居的贡献进行加权(假设 n > 1)。

最简单的加权技术只是将每个邻居乘以一个加权系数,该系数只是 1/(dist * K),或者从该邻居到测试实例的距离的倒数,通常乘以一些经验得出的常数 K。我不喜欢这种技术,因为它经常过度加权最近的邻居(并同时低估较远的邻居);这样做的意义在于,给定的预测几乎完全依赖于单个邻居,这反过来又增加了算法对噪声的敏感性。

一个必须更好的加权函数,它基本上避免了这个限制,它是高斯函数,在Python中,它看起来像这样:

def weight_gauss(dist, sig=2.0) :
    return math.e**(-dist**2/(2*sig**2))

要使用kNN代码计算预测值,你可以确定与您希望预测其响应变量的数据点(“测试实例”)的 n 个最近邻居,然后为 n 个邻居中的每一个调用一次weight_gauss 函数,并传入测试点的每个邻居之间的距离。函数将返回每个邻居的权重,然后将其用作加权平均计算中的邻居系数。

I. The Distance Metric

First, the number of features (columns) in a data set is not a factor in selecting a distance metric for use in kNN. There are quite a few published studies directed to precisely this question, and the usual bases for comparison are:

  • the underlying statistical
    distribution of your data;

  • the relationship among the features
    that comprise your data (are they
    independent--i.e., what does the
    covariance matrix look like); and

  • the coordinate space from which your
    data was obtained.

If you have no prior knowledge of the distribution(s) from which your data was sampled, at least one (well documented and thorough) study concludes that Euclidean distance is the best choice.

YEuclidean metric used in mega-scale Web Recommendation Engines as well as in current academic research. Distances calculated by Euclidean have intuitive meaning and the computation scales--i.e., Euclidean distance is calculated the same way, whether the two points are in two dimension or in twenty-two dimension space.

It has only failed for me a few times, each of those cases Euclidean distance failed because the underlying (cartesian) coordinate system was a poor choice. And you'll usually recognize this because for instance path lengths (distances) are no longer additive--e.g., when the metric space is a chessboard, Manhattan distance is better than Euclidean, likewise when the metric space is Earth and your distances are trans-continental flights, a distance metric suitable for a polar coordinate system is a good idea (e.g., London to Vienna is is 2.5 hours, Vienna to St. Petersburg is another 3 hrs, more or less in the same direction, yet London to St. Petersburg isn't 5.5 hours, instead, is a little over 3 hrs.)

But apart from those cases in which your data belongs in a non-cartesian coordinate system, the choice of distance metric is usually not material. (See this blog post from a CS student, comparing several distance metrics by examining their effect on kNN classifier--chi square give the best results, but the differences are not large; A more comprehensive study is in the academic paper, Comparative Study of Distance Functions for Nearest Neighbors--Mahalanobis (essentially Euclidean normalized by to account for dimension covariance) was the best in this study.

One important proviso: for distance metric calculations to be meaningful, you must re-scale your data--rarely is it possible to build a kNN model to generate accurate predictions without doing this. For instance, if you are building a kNN model to predict athletic performance, and your expectation variables are height (cm), weight (kg), bodyfat (%), and resting pulse (beats per minute), then a typical data point might look something like this: [ 180.4, 66.1, 11.3, 71 ]. Clearly the distance calculation will be dominated by height, while the contribution by bodyfat % will be almost negligible. Put another way, if instead, the data were reported differently, so that bodyweight was in grams rather than kilograms, then the original value of 86.1, would be 86,100, which would have a large effect on your results, which is exactly what you don't want. Probably the most common scaling technique is subtracting the mean and dividing by the standard deviation (mean and sd refer calculated separately for each column, or feature in that data set; X refers to an individual entry/cell within a data row):

X_new = (X_old - mu) / sigma

II. The Data Structure

If you are concerned about performance of the kd-tree structure, A Voronoi Tessellation is a conceptually simple container but that will drastically improve performance and scales better than kd-Trees.

dat

This is not the most common way to persist kNN training data, though the application of VT for this purpose, as well as the consequent performance advantages, are well-documented (see e.g. this Microsoft Research report). The practical significance of this is that, provided you are using a 'mainstream' language (e.g., in the TIOBE Index) then you ought to find a library to perform VT. I know in Python and R, there are multiple options for each language (e.g., the voronoi package for R available on CRAN)

Using a VT for kNN works like this::

From your data, randomly select w points--these are your Voronoi centers. A Voronoi cell encapsulates all neighboring points that are nearest to each center. Imagine if you assign a different color to each of Voronoi centers, so that each point assigned to a given center is painted that color. As long as you have a sufficient density, doing this will nicely show the boundaries of each Voronoi center (as the boundary that separates two colors.

How to select the Voronoi Centers? I use two orthogonal guidelines. After random selecting the w points, calculate the VT for your training data. Next check the number of data points assigned to each Voronoi center--these values should be about the same (given uniform point density across your data space). In two dimensions, this would cause a VT with tiles of the same size.That's the first rule, here's the second. Select w by iteration--run your kNN algorithm with w as a variable parameter, and measure performance (time required to return a prediction by querying the VT).

So imagine you have one million data points..... If the points were persisted in an ordinary 2D data structure, or in a kd-tree, you would perform on average a couple million distance calculations for each new data points whose response variable you wish to predict. Of course, those calculations are performed on a single data set. With a V/T, the nearest-neighbor search is performed in two steps one after the other, against two different populations of data--first against the Voronoi centers, then once the nearest center is found, the points inside the cell corresponding to that center are searched to find the actual nearest neighbor (by successive distance calculations) Combined, these two look-ups are much faster than a single brute-force look-up. That's easy to see: for 1M data points, suppose you select 250 Voronoi centers to tesselate your data space. On average, each Voronoi cell will have 4,000 data points. So instead of performing on average 500,000 distance calculations (brute force), you perform far lesss, on average just 125 + 2,000.

III. Calculating the Result (the predicted response variable)

There are two steps to calculating the predicted value from a set of kNN training data. The first is identifying n, or the number of nearest neighbors to use for this calculation. The second is how to weight their contribution to the predicted value.

W/r/t the first component, you can determine the best value of n by solving an optimization problem (very similar to least squares optimization). That's the theory; in practice, most people just use n=3. In any event, it's simple to run your kNN algorithm over a set of test instances (to calculate predicted values) for n=1, n=2, n=3, etc. and plot the error as a function of n. If you just want a plausible value for n to get started, again, just use n = 3.

The second component is how to weight the contribution of each of the neighbors (assuming n > 1).

The simplest weighting technique is just multiplying each neighbor by a weighting coefficient, which is just the 1/(dist * K), or the inverse of the distance from that neighbor to the test instance often multiplied by some empirically derived constant, K. I am not a fan of this technique because it often over-weights the closest neighbors (and concomitantly under-weights the more distant ones); the significance of this is that a given prediction can be almost entirely dependent on a single neighbor, which in turn increases the algorithm's sensitivity to noise.

A must better weighting function, which substantially avoids this limitation is the gaussian function, which in python, looks like this:

def weight_gauss(dist, sig=2.0) :
    return math.e**(-dist**2/(2*sig**2))

To calculate a predicted value using your kNN code, you would identify the n nearest neighbors to the data point whose response variable you wish to predict ('test instance'), then call the weight_gauss function, once for each of the n neighbors, passing in the distance between each neighbor the the test point.This function will return the weight for each neighbor, which is then used as that neighbor's coefficient in the weighted average calculation.

Bonjour°[大白 2024-11-09 22:57:07

您所面临的问题被称为维度诅咒。有时运行 PCA 或 ICA 等算法会很有用,以确保您确实需要所有 21 个维度,并可能找到一个线性变换,该线性变换允许您使用少于 21 个维度,但结果质量大致相同。

更新:
我在 Rangayyan 的一本名为《生物医学信号处理》的书中遇到了它们(我希望我没记错)。 ICA 不是一项简单的技术,但它是由芬兰的研究人员开发的,我认为它的 Matlab 代码可以公开下载。 PCA 是一种使用更广泛的技术,我相信您应该能够找到它的 R 或其他软件实现。 PCA 通过迭代求解线性方程来执行。我已经做过很久了,记不清是怎么做的了。 = )的

想法是将信号分解为独立的特征向量(实际上是离散特征函数)及其特征值,在您的情况下为 21 。每个特征值显示每个特征函数为每个测量提供的贡献量。如果特征值很小,您可以非常接近地表示信号,而无需使用其相应的特征函数,这就是消除维度的方法。

What you are facing is known as the curse of dimensionality. It is sometimes useful to run an algorithm like PCA or ICA to make sure that you really need all 21 dimensions and possibly find a linear transformation which would allow you to use less than 21 with approximately the same result quality.

Update:
I encountered them in a book called Biomedical Signal Processing by Rangayyan (I hope I remember it correctly). ICA is not a trivial technique, but it was developed by researchers in Finland and I think Matlab code for it is publicly available for download. PCA is a more widely used technique and I believe you should be able to find its R or other software implementation. PCA is performed by solving linear equations iteratively. I've done it too long ago to remember how. = )

The idea is that you break up your signals into independent eigenvectors (discrete eigenfunctions, really) and their eigenvalues, 21 in your case. Each eigenvalue shows the amount of contribution each eigenfunction provides to each of your measurements. If an eigenvalue is tiny, you can very closely represent the signals without using its corresponding eigenfunction at all, and that's how you get rid of a dimension.

我三岁 2024-11-09 22:57:07

最热门的答案很好,但很旧,所以我想添加一个2016 年的答案


如前所述,在高维空间中,维数诅咒潜伏在拐角处,使得传统方法(例如流行的 kd 树)像暴力方法一样缓慢。因此,我们将兴趣转向近似最近邻搜索(ANNS),它有利于一定的准确性,可以加快进程。您可以得到精确神经网络的良好近似值,并且具有良好的概率。


可能值得关注的热门话题:

  1. LSH 的现代方法,例如 Razenshteyn 的方法。
  2. RKD 森林:随机 kd 树 (RKD) 的森林,如 FLANN,
    或者在我参与的最近的方法中,kd-GeRaF
  3. LOPQ 代表本地优化产品量化,如此处。它与 Babenko+Lemptitsky 的新方法非常相似。

您还可以查看我的相关答案:

  1. 两组高维点:寻找另一组中的最近邻
  2. Nearest的运行时比较不同数据结构上的邻居查询
  3. PCL kd-tree 实现速度极慢

Top answers are good but old, so I'd like to add up a 2016 answer.


As said, in a high dimensional space, the curse of dimensionality lurks around the corner, making the traditional approaches, such as the popular k-d tree, to be as slow as a brute force approach. As a result, we turn our interest in Approximate Nearest Neighbor Search (ANNS), which in favor of some accuracy, speedups the process. You get a good approximation of the exact NN, with a good propability.


Hot topics that might be worthy:

  1. Modern approaches of LSH, such as Razenshteyn's.
  2. RKD forest: Forest(s) of Randomized k-d trees (RKD), as described in FLANN,
    or in a more recent approach I was part of, kd-GeRaF.
  3. LOPQ which stands for Locally Optimized Product Quantization, as described here. It is very similar to the new Babenko+Lemptitsky's approach.

You can also check my relevant answers:

  1. Two sets of high dimensional points: Find the nearest neighbour in the other set
  2. Comparison of the runtime of Nearest Neighbor queries on different data structures
  3. PCL kd-tree implementation extremely slow
丑疤怪 2024-11-09 22:57:07

一一回答您的问题:

  • 不,欧氏距离在高维空间中是一个不好的度量。基本上在高维度中,数据点之间存在很大差异。这减少了给定数据点与其最近和最远邻居之间距离的相对差异。
  • 很多论文/研究都涉及高维数据,但大多数内容都需要大量的数学复杂性。
  • KD 树对于高维数据来说是不利的……无论如何都要避免它

这是一篇很好的论文,可以帮助您朝着正确的方向开始。 “什么时候最近邻有意义?”拜尔等人。

我处理尺寸为 20K 及以上的文本数据。如果您需要一些与文本相关的建议,我也许可以帮助您。

To answer your questions one by one:

  • No, euclidean distance is a bad metric in high dimensional space. Basically in high dimensions, data points have large differences between each other. That decreases the relative difference in the distance between a given data point and its nearest and farthest neighbour.
  • Lot of papers/research are there in high dimension data, but most of the stuff requires a lot of mathematical sophistication.
  • KD tree is bad for high dimensional data ... avoid it by all means

Here is a nice paper to get you started in the right direction. "When in Nearest Neighbour meaningful?" by Beyer et all.

I work with text data of dimensions 20K and above. If you want some text related advice, I might be able to help you out.

蓝咒 2024-11-09 22:57:07

余弦相似度是比较高维向量的常用方法。请注意,由于它是相似度而不是距离,因此您希望最大化它而不是最小化它。您还可以使用特定于域的方法来比较数据,例如,如果您的数据是 DNA 序列,您可以使用考虑突变概率等的序列相似性。

要使用的最近邻的数量取决于数据类型、噪声有多少等。没有通用规则,您只需通过尝试一定范围内的所有值来找到最适合您的特定数据和问题的方法。人们有一个直观的认识,即数据越多,需要的邻居就越少。在假设的情况下,您拥有所有可能的数据,您只需要查找单个最近邻即可进行分类。

众所周知,k 最近邻方法的计算成本很高。这是人们转向支持向量机等其他算法的主要原因之一。

Cosine similarity is a common way to compare high-dimension vectors. Note that since it's a similarity not a distance, you'd want to maximize it not minimize it. You can also use a domain-specific way to compare the data, for example if your data was DNA sequences, you could use a sequence similarity that takes into account probabilities of mutations, etc.

The number of nearest neighbors to use varies depending on the type of data, how much noise there is, etc. There are no general rules, you just have to find what works best for your specific data and problem by trying all values within a range. People have an intuitive understanding that the more data there is, the fewer neighbors you need. In a hypothetical situation where you have all possible data, you only need to look for the single nearest neighbor to classify.

The k Nearest Neighbor method is known to be computationally expensive. It's one of the main reasons people turn to other algorithms like support vector machines.

佞臣 2024-11-09 22:57:07

kd 树确实不能很好地处理高维数据。因为修剪步骤不再有很大帮助,因为最近的边缘(一维偏差)几乎总是小于已知最近邻居的全维偏差。

但此外,据我所知,kd 树仅适用于 Lp 范数,并且存在距离集中效应,使得基于距离的算法随着维度的增加而退化。

有关更多信息,您可能需要阅读维数诅咒及其各种变体(它不止一侧!)

我不相信盲目逼近欧几里得最近邻有多大用处例如使用 LSH 或随机投影。首先可能需要使用更精细的距离函数!

kd-trees indeed won't work very well on high-dimensional data. Because the pruning step no longer helps a lot, as the closest edge - a 1 dimensional deviation - will almost always be smaller than the full-dimensional deviation to the known nearest neighbors.

But furthermore, kd-trees only work well with Lp norms for all I know, and there is the distance concentration effect that makes distance based algorithms degrade with increasing dimensionality.

For further information, you may want to read up on the curse of dimensionality, and the various variants of it (there is more than one side to it!)

I'm not convinced there is a lot use to just blindly approximating Euclidean nearest neighbors e.g. using LSH or random projections. It may be necessary to use a much more fine tuned distance function in the first place!

奢华的一滴泪 2024-11-09 22:57:07

很大程度上取决于您为什么想了解最近的邻居。您可以研究均值平移算法 http://en.wikipedia.org/wiki/Mean-shift 如果您真正想要的是找到数据集的模式。

A lot depends on why you want to know the nearest neighbors. You might look into the mean shift algorithm http://en.wikipedia.org/wiki/Mean-shift if what you really want is to find the modes of your data set.

我的黑色迷你裙 2024-11-09 22:57:07

我认为布尔功能的 tf-idf 上的余弦对于大多数人来说效果很好问题。这是因为其经过时间验证的启发式方法已在许多搜索引擎(如 Lucene)中使用。根据我的经验,欧几里得距离对于任何类似文本的数据都显示出糟糕的结果。选择不同的权重和 k 示例可以通过训练数据和强力参数选择来完成。

I think cosine on tf-idf of boolean features would work well for most problems. That's because its time-proven heuristic used in many search engines like Lucene. Euclidean distance in my experience shows bad results for any text-like data. Selecting different weights and k-examples can be done with training data and brute-force parameter selection.

墨洒年华 2024-11-09 22:57:07

iDistance 可能是高维数据中精确 knn 检索的最佳选择。您可以将其视为近似的 Voronoi 曲面细分。

iDistance is probably the best for exact knn retrieval in high-dimensional data. You can view it as an approximate Voronoi tessalation.

电影里的梦 2024-11-09 22:57:07

我也遇到过同样的问题,可以说以下内容。

  1. 欧几里得距离是一个很好的距离度量,但它的计算成本比曼哈顿距离要昂贵,有时会产生稍差的结果,因此,我会选择后者。

  2. k的值可以根据经验找到。您可以尝试不同的值并检查生成的 ROC 曲线 或其他一些精度/召回率测量,以便找到一个可接受的值。

  3. 欧几里得距离和曼哈顿距离都遵循三角不等式,因此您可以在度量树中使用它们。事实上,当数据超过 10 个维度时,KD 树的性能会严重下降(我自己也遇到过这个问题)。我发现 VP-trees 是一个更好的选择。

I've experienced the same problem and can say the following.

  1. Euclidean distance is a good distance metric, however it's computationally more expensive than the Manhattan distance, and sometimes yields slightly poorer results, thus, I'd choose the later.

  2. The value of k can be found empirically. You can try different values and check the resulting ROC curves or some other precision/recall measure in order to find an acceptable value.

  3. Both Euclidean and Manhattan distances respect the Triangle inequality, thus you can use them in metric trees. Indeed, KD-trees have their performance severely degraded when the data have more than 10 dimensions (I've experienced that problem myself). I found VP-trees to be a better option.

涙—继续流 2024-11-09 22:57:07

KD 树在 21 维上工作得很好,如果你早点退出,
看完之后说是所有点的5%。
FLANN 可以做到这一点(以及其他加速)
匹配 128 维 SIFT 向量。 (不幸的是 FLANN 仅采用欧几里得度量,
以及快速而扎实的
scipy.spatial.cKDTree
仅执行 Lp 指标;
这些可能适合也可能不适合的数据。)
当然,这里需要权衡速度和准确性。

(如果你能描述你的 Ndata、Nquery、数据分布,
这可能会帮助人们尝试类似的数据。)

4 月 26 日添加了我的旧 mac ppc 上带有截止的 cKDTree 的运行时间,以给出可行性的非常粗略的想法:

kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131  max 0.253

kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131  max 0.245

KD Trees work fine for 21 dimensions, if you quit early,
after looking at say 5 % of all the points.
FLANN does this (and other speedups)
to match 128-dim SIFT vectors. (Unfortunately FLANN does only the Euclidean metric,
and the fast and solid
scipy.spatial.cKDTree
does only Lp metrics;
these may or may not be adequate for your data.)
There is of course a speed-accuracy tradeoff here.

(If you could describe your Ndata, Nquery, data distribution,
that might help people to try similar data.)

Added 26 April, run times for cKDTree with cutoff on my old mac ppc, to give a very rough idea of feasibility:

kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=1000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.1 % of the 1000000 points, 0.31 % of 188315 boxes; better 0.0042 0.014 0.1 %
3.5 sec to query 1000 points
distances to 2 nearest: av 0.131  max 0.253

kdstats.py p=2 dim=21 N=1000000 nask=1000 nnear=2 cutoff=5000 eps=0 leafsize=10 clustype=uniformp
14 sec to build KDtree of 1000000 points
kdtree: 1000 queries looked at av 0.48 % of the 1000000 points, 1.1 % of 188315 boxes; better 0.0071 0.026 0.5 %
15 sec to query 1000 points
distances to 2 nearest: av 0.131  max 0.245
清浅ˋ旧时光 2024-11-09 22:57:07

你可以尝试z顺序曲线。 3维很容易。

You could try a z order curve. It's easy for 3 dimension.

栖竹 2024-11-09 22:57:07

不久前我也有类似的问题。为了快速近似最近邻搜索,您可以使用来自Spotify的 annoy 库: https://github.com/spotify/annoy

这是 Python API 的一些示例代码,它在 C++ 中进行了优化。

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

它们提供不同的距离测量。您想要应用哪种距离测量很大程度上取决于您的个人问题。还应首先考虑对某些维度进行预缩放(即加权)以确保其重要性。这些维度或特征重要性权重可能是通过熵损失之类的东西来计算的,或者如果您有监督学习问题基尼不纯度增益或平均平均损失,您可以在其中检查机器学习模型的表现有多差,如果您打乱这些维度值。

通常向量的方向比其绝对值更重要。例如,在文本文档的语义分析中,我们希望文档向量在语义相似时接近,而不是长度相似。因此,我们可以将这些向量归一化为单位长度,也可以使用角距离(即余弦相似度)作为距离测量。

希望这有帮助。

I had a similar question a while back. For fast Approximate Nearest Neighbor Search you can use the annoy library from spotify: https://github.com/spotify/annoy

This is some example code for the Python API, which is optimized in C++.

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

They provide different distance measurements. Which distance measurement you want to apply depends highly on your individual problem. Also consider prescaling (meaning weighting) certain dimensions for importance first. Those dimension or feature importance weights might be calculated by something like entropy loss or if you have a supervised learning problem gini impurity gain or mean average loss, where you check how much worse your machine learning model performs, if you scramble this dimensions values.

Often the direction of the vector is more important than it's absolute value. For example in the semantic analysis of text documents, where we want document vectors to be close when their semantics are similar, not their lengths. Thus we can either normalize those vectors to unit length or use angular distance (i.e. cosine similarity) as a distance measurement.

Hope this is helpful.

腹黑女流氓 2024-11-09 22:57:07

欧几里得距离是首先找到最近邻居的良好指标吗?如果没有,我有什么选择?

我建议软子空间聚类,这是当今非常常见的方法,其中计算特征权重以找到最相关的维度。例如,您可以在使用欧氏距离时使用这些权重。有关常见问题,请参阅维数诅咒,本文也可以以某种方式启发您:

A k-means type clustering algorithm for subspace clustering of mix numeric and
分类数据集

Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?

I would suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are calculated to find the most relevant dimensions. You can use these weights when using euclidean distance, for example. See curse of dimensionality for common problems and also this article can enlighten you somehow:

A k-means type clustering algorithm for subspace clustering of mixed numeric and
categorical datasets

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文