使用 Numpy 求一组点的平均距离
我在未知维度空间中有一个点数组,例如:
data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])
我想找到所有点之间的平均欧几里德距离。
请注意,我有超过 20,000 点积分,因此我希望尽可能高效地完成此操作。
谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您有权访问 scipy,您可以尝试以下操作:
scipy .spatial.distance.cdist(数据,数据)
If you have access to scipy, you could try the following:
scipy.spatial.distance.cdist(data,data)
好吧,我不认为有一种超级快速的方法可以做到这一点,但这应该可以做到:
Well, I don't think that there is a super fast way to do this, but this should do it:
现在您已经说明了查找异常值的目标,您可能最好计算样本均值以及样本方差,因为这两个操作都会为您提供 O(nd) 操作。这样,您应该能够找到异常值(例如,排除距离平均值比标准偏差的某些分数更远的点),并且该过滤过程应该可以在 O(nd) 时间内执行,总共 O(和)。
您可能有兴趣回顾一下切比雪夫不等式。
Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).
You might be interested in a refresher on Chebyshev's inequality.
如果没有可行的解决方案,是否值得进行优化?此外,在整个数据集上计算距离矩阵很少需要很快,因为您只需执行一次 - 当您需要知道两点之间的距离时,您只需查找它,它已经计算好了。
因此,如果您没有地方可以开始,这里是一个。如果您想在 Numpy 中执行此操作,而不需要编写任何内联 fortran 或 C,那应该没问题,尽管您可能希望包含这个名为“numexpr" (在 PyPI 上可用,安装很简单),在这种情况下,与单独使用 Numpy 相比,性能提升了 5 倍。
下面我计算了 2D 空间中 10,000 个点的距离矩阵(一个 10K x 10k 矩阵给出了所有 10k 点之间的距离)。我的 MBP 花了 59 秒。
Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.
So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.
Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.
无法回避评估的数量:
Sum[ni, {i, 0, n} ] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
但是,如果您可以使用 近似结果。这取决于您的需求。
如果您要计算平均值,我建议您在计算之前不要尝试将所有值放入数组中。只需计算总和(如果还需要标准差,则计算平方和)并在计算时丢弃每个值。
自
There's no getting around the number of evaluations:
Sum[n-i, {i, 0, n}] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
But you can save yourself the expense of all those square roots if you can get by with an approximate result. It depends on your needs.
If you're going to calculate an average, I would advise you to not try putting all the values into an array before calculating. Just calculate the sum (and sum of squares if you need standard deviation as well) and throw away each value as you calculate it.
Since
and
, I don't know if this means you have to multiply by two somewhere.
如果您想要快速且不精确的解决方案,您可以采用快速多极方法算法。
相距较小距离的点对最终平均距离的贡献较小,因此将点分组为簇并比较簇距离是有意义的。
If you want a fast and inexact solution, you could probably adapt the Fast Multipole Method algorithm.
Points that are separated by a small distance have a smaller contribution to the final average distance, so it would make sense to group points into clusters and compare the clusters distances.
在水平轴 (1D) 上的一组点中,“欧几里得距离”只是点之间的差异,您可以使用 np.diff 非常轻松地计算它们的平均值:
打印结果:
37.2
如果您想考虑结果平均值中的负差异,则可以排除
abs
。祝你好运。in just a set of points on a horizontal axis (1D), the "euclidean distance" is simply the difference between points, and you can use
np.diff
to calculate their mean very easily:which prints:
37.2
you can exclude
abs
if you want to consider negative differences in the resulted mean. good luck.