关于维度诅咒
我的问题是关于我已经读过一些内容的这个主题。基本上我的理解是,在更高的维度中,所有点最终都彼此非常接近。
我的疑问是这是否意味着以通常的方式(例如欧几里德)计算距离是否有效。如果它仍然有效,这将意味着在比较高维向量时,最相似的两个向量与第三个向量不会有太大差异,即使第三个向量可能完全不相关。
这是正确的吗?那么在这种情况下,你如何判断你是否匹配呢?
My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other.
The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that when comparing vectors in high dimensions, the two most similar wouldn't differ much from a third one even when this third one could be completely unrelated.
Is this correct? Then in this case, how would you be able to tell whether you have a match or not?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
基本上,距离测量仍然是正确的,但是,当您拥有嘈杂的“现实世界”数据时,距离测量就变得毫无意义。
我们在这里讨论的效果是,一个维度中两点之间的大距离很快就会被所有其他维度中的小距离所掩盖。这就是为什么最终所有点的距离都相同。有一个很好的例子:
假设我们想根据数据在每个维度的值对数据进行分类。我们只是说我们将每个维度除一次(其范围为 0..1)。 [0, 0.5) 中的值为正,[0.5, 1] 中的值为负。根据这条规则,在 3 个维度中,12.5% 的空间被覆盖。在5个维度中,仅为3.1%。在10个维度中,小于0.1%。
所以在每个维度上我们仍然允许整体取值范围的一半!这是相当多的。但所有这些最终都只占总空间的 0.1%——这些数据点之间的差异在每个维度上都很大,但在整个空间上可以忽略不计。
您可以更进一步说,在每个维度中您只削减了范围的 10%。因此您允许 [0, 0.9) 中的值。最终,您仍然只有不到 35% 的整个空间被 10 个维度覆盖。在 50 维中,它是 0.5%。所以你会看到,每个维度的大量数据都被塞进了搜索空间的一小部分。
这就是为什么你需要降维,你基本上忽略信息较少的轴上的差异。
Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.
The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:
Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.
So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.
You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.
That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.
这里用外行人的话做一个简单的解释。
我尝试用下面所示的简单插图来说明这一点。
您有一些数据特征x1和x2(您可以假设它们是血压和血糖水平)并且您想要执行K近邻分类。如果我们以二维方式绘制数据,我们可以很容易地看到数据很好地组合在一起,每个点都有一些可以用于计算的近邻。
现在假设我们决定考虑新的第三个特征 x3(例如年龄)进行分析。
情况(b)显示了我们之前的所有数据都来自同龄人的情况。您可以看到它们都位于年龄 (x3) 轴上的同一级别。
现在我们可以很快看到,如果我们想在分类时考虑年龄,则沿 Age(x3) 轴有很多空白空间。
我们目前仅拥有该年龄的单一水平的数据。如果我们想对不同年龄(红点)的人进行预测,会发生什么?
正如您所看到的,没有足够的数据点靠近该点来计算距离并找到一些邻居。因此,如果我们想利用这个新的第三个特征进行良好的预测,我们必须从不同年龄段的人那里收集更多数据,以填补年龄轴上的空白。
(C) 本质上表现出相同的概念。这里假设我们的初始数据是从不同年龄段的人收集的。 (即我们不关心之前的 2 个特征分类任务中的年龄,并且可能假设该特征对我们的分类没有影响)。
在这种情况下,假设我们的二维数据来自不同年龄的人(第三个特征)。现在,如果我们将相对较近的 2D 数据绘制为 3D 数据,会发生什么情况?如果我们以 3D 形式绘制它们,我们可以看到它们现在在我们新的高维空间 (3D) 中彼此距离更远(更稀疏)。因此,找到邻居变得更加困难,因为我们没有足够的数据来表示新的第三个特征的不同值。
您可以想象,随着我们添加更多维度,数据变得越来越分离。 (换句话说,如果你想避免数据稀疏,我们需要越来越多的数据)
Here is a simple explanation in layman terms.
I tried to illustrate this with a simple illustration shown below.
Suppose you have some data features x1 and x2 (you can assume they are blood pressure and blood sugar levels) and you want to perform K-nearest neighbor classification. If we plot the data in 2D, we can easily see that the data nicely group together, each point has some close neighbors that we can use for our calculations.
Now let's say we decide to consider a new third feature x3 (say age) for our analysis.
Case (b) shows a situation where all of our previous data comes from people the same age. You can see that they are all located at the same level along the age (x3) axis.
Now we can quickly see that if we want to consider age for our classification, there is a lot of empty space along the age(x3) axis.
The data that we currently have only over a single level for the age. What happens if we want to make a prediction for someone that has a different age(red dot)?
As you can see there are not enough data points close this point to calculate the distance and find some neighbors. So, If we want to have good predictions with this new third feature, we have to go and gather more data from people of different ages to fill the empty space along the age axis.
(C) It is essentially showing the same concept. Here assume our initial data, were gathered from people of different ages. (i.e we did not care about the age in our previous 2 feature classification task and might have assumed that this feature does not have an effect on our classification).
In this case , assume our 2D data come from people of different ages ( third feature). Now, what happens to our relatively closely located 2d data, if we plot them in 3D? If we plot them in 3D, we can see that now they are more distant from each other,(more sparse) in our new higher dimension space(3D). As a result, finding the neighbors becomes harder since we don't have enough data for different values along our new third feature.
You can imagine that as we add more dimensions the data become more and more apart. (In other words, we need more and more data if you want to avoid having sparsity in our data)