如何使用 NumPy 计算欧氏距离?
我在 3D 空间中有两个点:
a = (ax, ay, az)
b = (bx, by, bz)
我想计算它们之间的距离:
dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)
如何使用 NumPy 执行此操作?我有:
import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))
I have two points in 3D space:
a = (ax, ay, az)
b = (bx, by, bz)
I want to calculate the distance between them:
dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)
How do I do this with NumPy? I have:
import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(26)
您只需减去向量,然后减去内积即可。
按照你的例子,
You can just subtract the vectors and then innerproduct.
Following your example,
我喜欢
np.dot
(点积):I like
np.dot
(dot product):自 Python 3.8
自 Python 3.8 以来,
math
模块包含函数math.dist()
。请参阅此处 https://docs.python.org/3.8/library/math .html#math.dist。
Since Python 3.8
Since Python 3.8 the
math
module includes the functionmath.dist()
.See here https://docs.python.org/3.8/library/math.html#math.dist.
使用 Python 3.8,这非常容易。
https://docs.python.org/3/library/math.html #math.dist
With Python 3.8, it's very easy.
https://docs.python.org/3/library/math.html#math.dist
定义了
a
和b
后,您还可以使用:Having
a
andb
as you defined them, you can use also:下面是 Python 中欧几里得距离的一些简洁代码,给出了在 Python 中表示为列表的两个点。
Here's some concise code for Euclidean distance in Python given two points represented as lists in Python.
计算多维空间的欧几里得距离:
Calculate the Euclidean distance for multidimensional space:
其他答案适用于浮点数,但不能正确计算容易上溢和下溢的整数数据类型的距离。请注意,即使 scipy.distance.euclidean 也存在此问题:
这很常见,因为许多图像库将图像表示为 dtype="uint8" 的 ndarray。这意味着,如果您有一个由非常深的灰色像素组成的灰度图像(假设所有像素都有颜色
#000001
),并且您将其与黑色图像(#000000
)进行比较code>),您最终可能会在所有单元格中得到由255
组成的xy
,这表明两个图像彼此相距很远。对于无符号整数类型(例如 uint8),您可以安全地计算 numpy 中的距离:对于有符号整数类型,您可以先转换为浮点数:
对于图像数据,您可以使用 opencv 的范数方法:
The other answers work for floating point numbers, but do not correctly compute the distance for integer dtypes which are subject to overflow and underflow. Note that even
scipy.distance.euclidean
has this issue:This is common, since many image libraries represent an image as an ndarray with dtype="uint8". This means that if you have a greyscale image which consists of very dark grey pixels (say all the pixels have color
#000001
) and you're diffing it against black image (#000000
), you can end up withx-y
consisting of255
in all cells, which registers as the two images being very far apart from each other. For unsigned integer types (e.g. uint8), you can safely compute the distance in numpy as:For signed integer types, you can cast to a float first:
For image data specifically, you can use opencv's norm method:
您可以轻松地使用该公式
,该公式实际上无非是使用毕达哥拉斯定理来计算距离,方法是将 Δx、Δy 和 Δz 的平方相加并对结果求根。
You can easily use the formula
which does actually nothing more than using Pythagoras' theorem to calculate the distance, by adding the squares of Δx, Δy and Δz and rooting the result.
首先将列表更改为numpy array,然后执行以下操作:
print(np.linalg.norm(np.array(a) - np.array(b)))
。直接从 python 列表中获取的第二种方法为: print(np.linalg.norm(np.subtract(a,b)))You first change list to numpy array and do like this:
print(np.linalg.norm(np.array(a) - np.array(b)))
. Second method directly from python list as:print(np.linalg.norm(np.subtract(a,b)))
如果你想要更明确的东西,你可以轻松地编写如下公式:
即使有 10_000_000 个元素的数组,它在我的机器上仍然以 0.1 秒的速度运行。
If you want something more explicit you can easily write the formula like this:
Even with arrays of 10_000_000 elements this still runs at 0.1s on my machine.
首先求两个矩阵的差。然后,使用 numpy 的乘法命令应用元素乘法。然后,求元素相乘的新矩阵的总和。最后,求总和的平方根。
Find difference of two matrices first. Then, apply element wise multiplication with numpy's multiply command. After then, find summation of the element wise multiplied new matrix. Finally, find square root of the summation.
最好的方法是最安全也是最快的
我建议使用hypot来获得可靠的结果,与编写自己的sqroot计算器相比,下溢和溢出的机会非常小
让我们看看math.hypot,np.hypot与vanilla
np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))
速度明智的 math.hypot 看起来更好
下溢
上溢
没有下溢
没有上溢
参考
Well best way would be safest and also the fastest
I would suggest hypot usage for reliable results for chances of underflow and overflow are very little compared to writing own sqroot calculator
Lets see math.hypot, np.hypot vs vanilla
np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))
Speed wise math.hypot look better
Underflow
Overflow
No Underflow
No Overflow
Refer
1. SciPy 的欧几里德距离矩阵矢量化
cdist()
@Nico Schlömer 的基准测试显示 scipy 的
euclidean()
函数比 numpy 函数慢得多。原因是它适用于一对点,而不是一组点;因此没有矢量化。此外,他的基准测试使用代码来查找相等长度数组之间的欧几里得距离。如果您需要计算两个输入集合中每对点之间的欧几里德距离矩阵,那么还有另一个 SciPy 函数 cdist() ,它比 numpy 快得多。
考虑以下示例,其中
a
包含 3 个点,b
包含 2 个点。 SciPy 的cdist()
计算a
中的每个点到b< 中的每个点之间的欧几里得距离/code>,所以在这个例子中,它将返回一个 3x2 矩阵。
如果我们有一个点的集合,并且我们想要找到除自身之外的每个点的最近距离,那么它特别有用;一个常见的用例是自然语言处理。例如,要计算集合中每对点之间的欧几里得距离,
distance.cdist(a, a)
即可完成这项工作。由于点到自身的距离为 0,因此该矩阵的对角线将全部为零。可以使用广播通过纯 numpy 方法执行相同的任务。我们只需向其中一个数组添加另一个维度即可。
如前所述,
cdist()
比 numpy 的对应函数快得多。下面的 perfplot 显示了同样的内容。12. Scikit-learn 的
euclidean_distances()
Scikit-learn 是一个相当大的库,所以除非你不使用对于其他东西,仅将其导入用于欧几里得距离计算没有多大意义,但为了完整性,它还有
euclidean_distances()
、paired_distances()
和 < code>pairwise_distances() 可用于计算欧几里得距离的方法。它还有其他方便的成对距离计算方法值得一试< /a>.scikit-learn 方法的一个有用之处是它可以按原样处理稀疏矩阵,而 scipy/numpy 需要将稀疏矩阵转换为数组才能执行计算,因此根据数据的大小,scikit-learn 的方法可能是唯一运行的函数。
示例:
1 用于生成性能图的代码:
1. SciPy's vectorized
cdist()
for Euclidean distance matrix@Nico Schlömer's benchmarks show scipy's
euclidean()
function to be much slower than its numpy counterparts. The reason is that it's meant to work on a pair of points, not an array of points; thus not vectorized. Also, his benchmark uses code to find the Euclidean distances between arrays of equal length.If you need to compute the Euclidean distance matrix between each pair of points from two collections of inputs, then there is another SciPy function,
cdist()
, that is much faster than numpy.Consider the following example where
a
contains 3 points andb
contains 2 points. SciPy'scdist()
computes the Euclidean distances between every point ina
to every point inb
, so in this example, it would return a 3x2 matrix.It is especially useful if we have a collection of points and we want to find the closest distance to each point other than itself; a common use-case is in natural language processing. For example, to compute the Euclidean distances between every pair of points in a collection,
distance.cdist(a, a)
does the job. Since the distance from a point to itself is 0, the diagonals of this matrix will be all zero.The same task can be performed with numpy-only methods using broadcasting. We simply need to add another dimension to one of the arrays.
As mentioned earlier,
cdist()
is much faster than the numpy counterparts. The following perfplot shows as much.12. Scikit-learn's
euclidean_distances()
Scikit-learn is a pretty big library so unless you're not using it for something else, it doesn't make much sense to import it only for Euclidean distance computation but for completeness, it also has
euclidean_distances()
,paired_distances()
andpairwise_distances()
methods that can be used to compute Euclidean distances. It has other convenient pairwise distance computation methods worth checking out.One useful thing about scikit-learn's methods is that it can handle sparse matrices as is, whereas scipy/numpy will need to have sparse matrices converted into arrays to perform computation so depending on the size of the data, scikit-learn's methods may be the only function that runs.
An example:
1 The code used to produce the perfplot:
对于大量距离,我能想到的最快解决方案是使用 numexpr。在我的机器上,它比使用 numpy einsum 更快:
The fastest solution I could come up with for large number of distances is using numexpr. On my machine it is faster than using numpy einsum:
使用
numpy.linalg.norm
:这是有效的,因为欧几里德距离是l2范数,并且
ord
的默认值 > numpy.linalg.norm 中的参数为 2。有关更多理论,请参阅数据挖掘简介< em>:
Use
numpy.linalg.norm
:This works because the Euclidean distance is the l2 norm, and the default value of the
ord
parameter innumpy.linalg.norm
is 2.For more theory, see Introduction to Data Mining:
使用
scipy.spatial.distance.euclidean
:Use
scipy.spatial.distance.euclidean
:对于有兴趣同时计算多个距离的人,我使用 perfplot (一个小项目)做了一些比较我的)。
第一个建议是组织数据,使数组具有维度
(3, n)
(并且显然是 C 连续的)。如果添加发生在连续的第一维中,则速度会更快,并且如果将sqrt-sum
与axis=0
、linalg 一起使用,也不会太重要.norm
与axis=0
,或者说,它是最快的变体,略有优势。 (这实际上也只适用于一行。)
在第二个轴
axis=1
上求和的变体都慢得多。重现情节的代码:
For anyone interested in computing multiple distances at once, I've done a little comparison using perfplot (a small project of mine).
The first advice is to organize your data such that the arrays have dimension
(3, n)
(and are C-contiguous obviously). If adding happens in the contiguous first dimension, things are faster, and it doesn't matter too much if you usesqrt-sum
withaxis=0
,linalg.norm
withaxis=0
, orwhich is, by a slight margin, the fastest variant. (That actually holds true for just one row as well.)
The variants where you sum up over the second axis,
axis=1
, are all substantially slower.Code to reproduce the plot:
我想通过各种性能说明来阐述简单的答案。 np.linalg.norm 的功能可能会超出您的需要:
首先 - 该函数旨在处理列表并返回所有值,例如比较从
pA
到点集的距离sP
:记住几件事:
所以
并不像看上去那么无辜。
首先 - 每次我们调用它时,我们都必须对“np”进行全局查找,对“linalg”进行范围查找,对“norm”进行范围查找,以及仅仅调用的开销一个函数可以相当于几十条Python指令。
最后,我们浪费了两个操作来存储结果并重新加载它以返回...
首先进行改进:使查找更快,跳过存储
我们得到了更加简化的结果:
不过,函数调用开销仍然需要一些工作。您需要进行基准测试来确定自己是否可以更好地进行数学计算:
在某些平台上,
**0.5
比math.sqrt
更快。您的里程可能会有所不同。**** 高级性能说明。
为什么要计算距离?如果唯一的目的是展示它,
那就继续吧。但如果您要比较距离、进行范围检查等,我想添加一些有用的性能观察结果。
让我们看两种情况:按距离排序或剔除列表以找到满足范围约束的项目。
我们需要记住的第一件事是,我们使用 毕达哥拉斯 来计算距离 (< code>dist = sqrt(x^2 + y^2 + z^2)) 因此我们进行了大量
sqrt
调用。数学 101:简而言之:直到我们真正需要以 X 而不是 X^2 为单位的距离为止,我们可以消除计算中最困难的部分。
太好了,这两个函数不再执行任何昂贵的平方根。这会快得多,但在进一步之前,请检查一下自己:为什么上述两次 sort_things_by_distance 都需要“天真的”免责声明?答案在最底部(*a1)。
我们可以通过将 in_range 转换为生成器来改进它:
如果您正在做类似的事情,这尤其有好处:
但是如果您要做的下一件事需要距离,
请考虑生成元组:
如果您可能链接,这可能特别有用范围检查(“查找 X 附近且 Y 的 Nm 范围内的物体”,因为您不必再次计算距离)。
但是,如果我们正在搜索一个非常大的
事物
列表,并且我们预计其中很多不值得考虑,该怎么办?实际上有一个非常简单的优化:
这是否有用将取决于“事物”的大小。
再次考虑生成 dist_sq。我们的热狗示例就变成了:
(*a1:sort_things_by_distance的排序键为每个项目调用distance_sq,而那个看起来无辜的键是一个lambda,它是必须调用的第二个函数......)
I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:
Firstly - this function is designed to work over a list and return all of the values, e.g. to compare the distance from
pA
to the set of pointssP
:Remember several things:
So
isn't as innocent as it looks.
Firstly - every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.
Lastly, we wasted two operations on to store the result and reload it for return...
First pass at improvement: make the lookup faster, skip the store
We get the far more streamlined:
The function call overhead still amounts to some work, though. And you'll want to do benchmarks to determine whether you might be better doing the math yourself:
On some platforms,
**0.5
is faster thanmath.sqrt
. Your mileage may vary.**** Advanced performance notes.
Why are you calculating distance? If the sole purpose is to display it,
move along. But if you're comparing distances, doing range checks, etc., I'd like to add some useful performance observations.
Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.
The first thing we need to remember is that we are using Pythagoras to calculate the distance (
dist = sqrt(x^2 + y^2 + z^2)
) so we're making a lot ofsqrt
calls. Math 101:In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.
Great, both functions no-longer do any expensive square roots. That'll be much faster, but before you go further, check yourself: why did sort_things_by_distance need a "naive" disclaimer both times above? Answer at the very bottom (*a1).
We can improve in_range by converting it to a generator:
This especially has benefits if you are doing something like:
But if the very next thing you are going to do requires a distance,
consider yielding tuples:
This can be especially useful if you might chain range checks ('find things that are near X and within Nm of Y', since you don't have to calculate the distance again).
But what about if we're searching a really large list of
things
and we anticipate a lot of them not being worth consideration?There is actually a very simple optimization:
Whether this is useful will depend on the size of 'things'.
And again, consider yielding the dist_sq. Our hotdog example then becomes:
(*a1: sort_things_by_distance's sort key calls distance_sq for every single item, and that innocent looking key is a lambda, which is a second function that has to be invoked...)
此问题解决的另一个实例方法:
Another instance of this problem solving method:
从
Python 3.8
开始,数学
< /a> 模块直接提供dist
函数,它返回两点之间的欧几里德距离(以元组或坐标列表形式给出):如果您正在使用列表:
Starting
Python 3.8
, themath
module directly provides thedist
function, which returns the euclidean distance between two points (given as tuples or lists of coordinates):And if you're working with lists:
可以像下面这样完成。我不知道它有多快,但它没有使用 NumPy。
It can be done like the following. I don't know how fast it is, but it's not using NumPy.
一句漂亮的话:
但是,如果速度是一个问题,我建议在您的机器上进行试验。我发现在我的机器上使用
math
库的sqrt
和**
运算符来计算平方比单行 NumPy 快得多解决方案。我使用这个简单的程序运行测试:
在我的机器上,
math_calc_dist
的运行速度比numpy_calc_dist
快得多:1.5 秒与 23.5 秒。为了获得
fastest_calc_dist
和math_calc_dist
之间的可测量差异,我必须将TOTAL_LOCATIONS
增加到 6000。然后fastest_calc_dist
需要大约 50秒,而math_calc_dist
大约需要 60 秒。您还可以尝试使用
numpy.sqrt
和numpy.square
,尽管它们都比我机器上的math
替代方案慢。我的测试是使用 Python 2.6.6 运行的。
A nice one-liner:
However, if speed is a concern I would recommend experimenting on your machine. I've found that using
math
library'ssqrt
with the**
operator for the square is much faster on my machine than the one-liner NumPy solution.I ran my tests using this simple program:
On my machine,
math_calc_dist
runs much faster thannumpy_calc_dist
: 1.5 seconds versus 23.5 seconds.To get a measurable difference between
fastest_calc_dist
andmath_calc_dist
I had to upTOTAL_LOCATIONS
to 6000. Thenfastest_calc_dist
takes ~50 seconds whilemath_calc_dist
takes ~60 seconds.You can also experiment with
numpy.sqrt
andnumpy.square
though both were slower than themath
alternatives on my machine.My tests were run with Python 2.6.6.
我在 matplotlib.mlab 中找到了一个“dist”函数,但我认为它不够方便。
我把它贴在这里仅供参考。
I find a 'dist' function in matplotlib.mlab, but I don't think it's handy enough.
I'm posting it here just for reference.