如何使用 NumPy 计算欧氏距离?

发布于 2024-08-04 11:45:44 字数 413 浏览 9 评论 0原文

我在 3D 空间中有两个点:

a = (ax, ay, az)
b = (bx, by, bz)

我想计算它们之间的距离:

dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)

如何使用 NumPy 执行此操作?我有:

import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))

I have two points in 3D space:

a = (ax, ay, az)
b = (bx, by, bz)

I want to calculate the distance between them:

dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)

How do I do this with NumPy? I have:

import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(26

汹涌人海 2024-08-11 11:45:45

您只需减去向量,然后减去内积即可。

按照你的例子,

a = numpy.array((xa, ya, za))
b = numpy.array((xb, yb, zb))

tmp = a - b
sum_squared = numpy.dot(tmp.T, tmp)
result = numpy.sqrt(sum_squared)

You can just subtract the vectors and then innerproduct.

Following your example,

a = numpy.array((xa, ya, za))
b = numpy.array((xb, yb, zb))

tmp = a - b
sum_squared = numpy.dot(tmp.T, tmp)
result = numpy.sqrt(sum_squared)
几度春秋 2024-08-11 11:45:45

我喜欢 np.dot (点积):

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))

distance = (np.dot(a-b,a-b))**.5

I like np.dot (dot product):

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))

distance = (np.dot(a-b,a-b))**.5
め七分饶幸 2024-08-11 11:45:45

自 Python 3.8

自 Python 3.8 以来,math 模块包含函数 math.dist()
请参阅此处 https://docs.python.org/3.8/library/math .html#math.dist

ma​​th.dist(p1, p2)
返回两点 p1 和 p2 之间的欧几里得距离,
每个都以坐标序列(或可迭代)的形式给出。

import math
print( math.dist( (0,0),   (1,1)   )) # sqrt(2) -> 1.4142
print( math.dist( (0,0,0), (1,1,1) )) # sqrt(3) -> 1.7321

Since Python 3.8

Since Python 3.8 the math module includes the function math.dist().
See here https://docs.python.org/3.8/library/math.html#math.dist.

math.dist(p1, p2)
Return the Euclidean distance between two points p1 and p2,
each given as a sequence (or iterable) of coordinates.

import math
print( math.dist( (0,0),   (1,1)   )) # sqrt(2) -> 1.4142
print( math.dist( (0,0,0), (1,1,1) )) # sqrt(3) -> 1.7321
云胡 2024-08-11 11:45:45

使用 Python 3.8,这非常容易。

https://docs.python.org/3/library/math.html #math.dist

math.dist(p, q)

返回给定的两点 p 和 q 之间的欧几里得距离
作为坐标序列(或可迭代)。两点必须有
相同的维度。

大致相当于:

sqrt(sum((px - qx) ** 2.0 for px, qx in zip(p, q)))

With Python 3.8, it's very easy.

https://docs.python.org/3/library/math.html#math.dist

math.dist(p, q)

Return the Euclidean distance between two points p and q, each given
as a sequence (or iterable) of coordinates. The two points must have
the same dimension.

Roughly equivalent to:

sqrt(sum((px - qx) ** 2.0 for px, qx in zip(p, q)))

柒夜笙歌凉 2024-08-11 11:45:45

定义了 ab 后,您还可以使用:

distance = np.sqrt(np.sum((a-b)**2))

Having a and b as you defined them, you can use also:

distance = np.sqrt(np.sum((a-b)**2))
冰雪梦之恋 2024-08-11 11:45:45

下面是 Python 中欧几里得距离的一些简洁代码,给出了在 Python 中表示为列表的两个点。

def distance(v1,v2): 
    return sum([(x-y)**2 for (x,y) in zip(v1,v2)])**(0.5)

Here's some concise code for Euclidean distance in Python given two points represented as lists in Python.

def distance(v1,v2): 
    return sum([(x-y)**2 for (x,y) in zip(v1,v2)])**(0.5)
轻拂→两袖风尘 2024-08-11 11:45:45
import math

dist = math.hypot(math.hypot(xa-xb, ya-yb), za-zb)
import math

dist = math.hypot(math.hypot(xa-xb, ya-yb), za-zb)
吃颗糖壮壮胆 2024-08-11 11:45:45

计算多维空间的欧几里得距离:

 import math

 x = [1, 2, 6] 
 y = [-2, 3, 2]

 dist = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(x, y)]))
 5.0990195135927845

Calculate the Euclidean distance for multidimensional space:

 import math

 x = [1, 2, 6] 
 y = [-2, 3, 2]

 dist = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(x, y)]))
 5.0990195135927845
唯憾梦倾城 2024-08-11 11:45:45

其他答案适用于浮点数,但不能正确计算容易上溢和下溢的整数数据类型的距离。请注意,即使 scipy.distance.euclidean 也存在此问题:

>>> a1 = np.array([1], dtype='uint8')
>>> a2 = np.array([2], dtype='uint8')
>>> a1 - a2
array([255], dtype=uint8)
>>> np.linalg.norm(a1 - a2)
255.0
>>> from scipy.spatial import distance
>>> distance.euclidean(a1, a2)
255.0

这很常见,因为许多图像库将图像表示为 dtype="uint8" 的 ndarray。这意味着,如果您有一个由非常深的灰色像素组成的灰度图像(假设所有像素都有颜色#000001),并且您将其与黑色图像(#000000)进行比较code>),您最终可能会在所有单元格中得到由 255 组成的 xy,这表明两个图像彼此相距很远。对于无符号整数类型(例如 uint8),您可以安全地计算 numpy 中的距离:

np.linalg.norm(np.maximum(x, y) - np.minimum(x, y))

对于有符号整数类型,您可以先转换为浮点数:

np.linalg.norm(x.astype("float") - y.astype("float"))

对于图像数据,您可以使用 opencv 的范数方法:

import cv2
cv2.norm(x, y, cv2.NORM_L2)

The other answers work for floating point numbers, but do not correctly compute the distance for integer dtypes which are subject to overflow and underflow. Note that even scipy.distance.euclidean has this issue:

>>> a1 = np.array([1], dtype='uint8')
>>> a2 = np.array([2], dtype='uint8')
>>> a1 - a2
array([255], dtype=uint8)
>>> np.linalg.norm(a1 - a2)
255.0
>>> from scipy.spatial import distance
>>> distance.euclidean(a1, a2)
255.0

This is common, since many image libraries represent an image as an ndarray with dtype="uint8". This means that if you have a greyscale image which consists of very dark grey pixels (say all the pixels have color #000001) and you're diffing it against black image (#000000), you can end up with x-y consisting of 255 in all cells, which registers as the two images being very far apart from each other. For unsigned integer types (e.g. uint8), you can safely compute the distance in numpy as:

np.linalg.norm(np.maximum(x, y) - np.minimum(x, y))

For signed integer types, you can cast to a float first:

np.linalg.norm(x.astype("float") - y.astype("float"))

For image data specifically, you can use opencv's norm method:

import cv2
cv2.norm(x, y, cv2.NORM_L2)
睡美人的小仙女 2024-08-11 11:45:45
import numpy as np
from scipy.spatial import distance
input_arr = np.array([[0,3,0],[2,0,0],[0,1,3],[0,1,2],[-1,0,1],[1,1,1]]) 
test_case = np.array([0,0,0])
dst=[]
for i in range(0,6):
    temp = distance.euclidean(test_case,input_arr[i])
    dst.append(temp)
print(dst)
import numpy as np
from scipy.spatial import distance
input_arr = np.array([[0,3,0],[2,0,0],[0,1,3],[0,1,2],[-1,0,1],[1,1,1]]) 
test_case = np.array([0,0,0])
dst=[]
for i in range(0,6):
    temp = distance.euclidean(test_case,input_arr[i])
    dst.append(temp)
print(dst)
沉溺在你眼里的海 2024-08-11 11:45:45

您可以轻松地使用该公式

distance = np.sqrt(np.sum(np.square(a-b)))

,该公式实际上无非是使用毕达哥拉斯定理来计算距离,方法是将 Δx、Δy 和 Δz 的平方相加并对结果求根。

You can easily use the formula

distance = np.sqrt(np.sum(np.square(a-b)))

which does actually nothing more than using Pythagoras' theorem to calculate the distance, by adding the squares of Δx, Δy and Δz and rooting the result.

冷清清 2024-08-11 11:45:45
import numpy as np
# any two python array as two points
a = [0, 0]
b = [3, 4]

首先将列表更改为numpy array,然后执行以下操作:print(np.linalg.norm(np.array(a) - np.array(b)))。直接从 python 列表中获取的第二种方法为: print(np.linalg.norm(np.subtract(a,b)))

import numpy as np
# any two python array as two points
a = [0, 0]
b = [3, 4]

You first change list to numpy array and do like this: print(np.linalg.norm(np.array(a) - np.array(b))). Second method directly from python list as: print(np.linalg.norm(np.subtract(a,b)))

失去的东西太少 2024-08-11 11:45:45

如果你想要更明确的东西,你可以轻松地编写如下公式:

np.sqrt(np.sum((a-b)**2))

即使有 10_000_000 个元素的数组,它在我的机器上仍然以 0.1 秒的速度运行。

If you want something more explicit you can easily write the formula like this:

np.sqrt(np.sum((a-b)**2))

Even with arrays of 10_000_000 elements this still runs at 0.1s on my machine.

狼性发作 2024-08-11 11:45:45

首先求两个矩阵的差。然后,使用 numpy 的乘法命令应用元素乘法。然后,求元素相乘的新矩阵的总和。最后,求总和的平方根。

def findEuclideanDistance(a, b):
    euclidean_distance = a - b
    euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
    euclidean_distance = np.sqrt(euclidean_distance)
    return euclidean_distance

Find difference of two matrices first. Then, apply element wise multiplication with numpy's multiply command. After then, find summation of the element wise multiplied new matrix. Finally, find square root of the summation.

def findEuclideanDistance(a, b):
    euclidean_distance = a - b
    euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
    euclidean_distance = np.sqrt(euclidean_distance)
    return euclidean_distance
茶花眉 2024-08-11 11:45:45

使用 NumPy 或一般的 Python 执行此操作的最佳方法是什么?我有:

最好的方法是最安全也是最快的

我建议使用hypot来获得可靠的结果,与编写自己的sqroot计算器相比,下溢和溢出的机会非常小

让我们看看math.hypot,np.hypot与vanilla np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

速度明智的 math.hypot 看起来更好

%%timeit
math.hypot(i, j, k)
# 100 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%%timeit
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# 6.41 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

下溢

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

上溢

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

没有下溢

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

没有上溢

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

参考

What's the best way to do this with NumPy, or with Python in general? I have:

Well best way would be safest and also the fastest

I would suggest hypot usage for reliable results for chances of underflow and overflow are very little compared to writing own sqroot calculator

Lets see math.hypot, np.hypot vs vanilla np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

Speed wise math.hypot look better

%%timeit
math.hypot(i, j, k)
# 100 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%%timeit
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# 6.41 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

Refer

清浅ˋ旧时光 2024-08-11 11:45:45

1. SciPy 的欧几里德距离矩阵矢量化 cdist()

@Nico Schlömer 的基准测试显示 scipy 的euclidean() 函数比 numpy 函数慢得多。原因是它适用于一对点,而不是一组点;因此没有矢量化。此外,他的基准测试使用代码来查找相等长度数组之间的欧几里得距离。

如果您需要计算两个输入集合中每对点之间的欧几里德距离矩阵,那么还有另一个 SciPy 函数 cdist() ,它比 numpy 快得多。

考虑以下示例,其中 a 包含 3 个点,b 包含 2 个点。 SciPy 的 cdist() 计算 a 中的每个点到 b< 中的每个点之间的欧几里得距离/code>,所以在这个例子中,它将返回一个 3x2 矩阵。

import numpy as np
from scipy.spatial import distance

a = [(1, 2, 3), (3, 4, 5), (2, 3, 6)]
b = [(1, 2, 3), (4, 5, 6)]

dsts1 = distance.cdist(a, b)

# array([[0.        , 5.19615242],
#        [3.46410162, 1.73205081],
#        [3.31662479, 2.82842712]])

如果我们有一个点的集合,并且我们想要找到除自身之外的每个点的最近距离,那么它特别有用;一个常见的用例是自然语言处理。例如,要计算集合中每对点之间的欧几里得距离,distance.cdist(a, a) 即可完成这项工作。由于点到自身的距离为 0,因此该矩阵的对角线将全部为零。

可以使用广播通过纯 numpy 方法执行相同的任务。我们只需向其中一个数组添加另一个维度即可。

# using `linalg.norm`
dsts2 = np.linalg.norm(np.array(a)[:, None] - b, axis=-1)

# using a `sqrt` + `sum` + `square`
dsts3 = np.sqrt(np.sum((np.array(a)[:, None] - b)**2, axis=-1))

# equality check
np.allclose(dsts1, dsts2) and np.allclose(dsts1, dsts3)        # True

如前所述,cdist() 比 numpy 的对应函数快得多。下面的 perfplot 显示了同样的内容。1

result


2. Scikit-learn 的 euclidean_distances()

Scikit-learn 是一个相当大的库,所以除非你不使用对于其他东西,仅将其导入用于欧几里得距离计算没有多大意义,但为了完整性,它还有 euclidean_distances()paired_distances() 和 < code>pairwise_distances() 可用于计算欧几里得距离的方法。它还有其他方便的成对距离计算方法值得一试< /a>.

scikit-learn 方法的一个有用之处是它可以按原样处理稀疏矩阵,而 scipy/numpy 需要将稀疏矩阵转换为数组才能执行计算,因此根据数据的大小,scikit-learn 的方法可能是唯一运行的函数。

示例:

from scipy import sparse
from sklearn.metrics import pairwise

A = sparse.random(1_000_000, 3)
b = [(1, 2, 3), (4, 5, 6)]

dsts = pairwise.euclidean_distances(A, b)

1 用于生成性能图的代码:

import numpy as np
from scipy.spatial import distance
import perfplot
import matplotlib.pyplot as plt

def sqrt_sum(arr):
    return np.sqrt(np.sum((arr[:, None] - arr) ** 2, axis=-1))

def linalg_norm(arr):
    return np.linalg.norm(arr[:, None] - arr, axis=-1)

def scipy_cdist(arr):
    return distance.cdist(arr, arr)

perfplot.plot(
    setup=lambda n: np.random.rand(n, 3),
    n_range=[2 ** k for k in range(14)],
    kernels=[sqrt_sum, scipy_cdist, linalg_norm],
    title="Euclidean distance between arrays of 3-D points",
    xlabel="len(x), len(y)",
    equality_check=np.allclose
);

1. SciPy's vectorized cdist() for Euclidean distance matrix

@Nico Schlömer's benchmarks show scipy's euclidean() function to be much slower than its numpy counterparts. The reason is that it's meant to work on a pair of points, not an array of points; thus not vectorized. Also, his benchmark uses code to find the Euclidean distances between arrays of equal length.

If you need to compute the Euclidean distance matrix between each pair of points from two collections of inputs, then there is another SciPy function, cdist(), that is much faster than numpy.

Consider the following example where a contains 3 points and b contains 2 points. SciPy's cdist() computes the Euclidean distances between every point in a to every point in b, so in this example, it would return a 3x2 matrix.

import numpy as np
from scipy.spatial import distance

a = [(1, 2, 3), (3, 4, 5), (2, 3, 6)]
b = [(1, 2, 3), (4, 5, 6)]

dsts1 = distance.cdist(a, b)

# array([[0.        , 5.19615242],
#        [3.46410162, 1.73205081],
#        [3.31662479, 2.82842712]])

It is especially useful if we have a collection of points and we want to find the closest distance to each point other than itself; a common use-case is in natural language processing. For example, to compute the Euclidean distances between every pair of points in a collection, distance.cdist(a, a) does the job. Since the distance from a point to itself is 0, the diagonals of this matrix will be all zero.

The same task can be performed with numpy-only methods using broadcasting. We simply need to add another dimension to one of the arrays.

# using `linalg.norm`
dsts2 = np.linalg.norm(np.array(a)[:, None] - b, axis=-1)

# using a `sqrt` + `sum` + `square`
dsts3 = np.sqrt(np.sum((np.array(a)[:, None] - b)**2, axis=-1))

# equality check
np.allclose(dsts1, dsts2) and np.allclose(dsts1, dsts3)        # True

As mentioned earlier, cdist() is much faster than the numpy counterparts. The following perfplot shows as much.1

result


2. Scikit-learn's euclidean_distances()

Scikit-learn is a pretty big library so unless you're not using it for something else, it doesn't make much sense to import it only for Euclidean distance computation but for completeness, it also has euclidean_distances(), paired_distances() and pairwise_distances() methods that can be used to compute Euclidean distances. It has other convenient pairwise distance computation methods worth checking out.

One useful thing about scikit-learn's methods is that it can handle sparse matrices as is, whereas scipy/numpy will need to have sparse matrices converted into arrays to perform computation so depending on the size of the data, scikit-learn's methods may be the only function that runs.

An example:

from scipy import sparse
from sklearn.metrics import pairwise

A = sparse.random(1_000_000, 3)
b = [(1, 2, 3), (4, 5, 6)]

dsts = pairwise.euclidean_distances(A, b)

1 The code used to produce the perfplot:

import numpy as np
from scipy.spatial import distance
import perfplot
import matplotlib.pyplot as plt

def sqrt_sum(arr):
    return np.sqrt(np.sum((arr[:, None] - arr) ** 2, axis=-1))

def linalg_norm(arr):
    return np.linalg.norm(arr[:, None] - arr, axis=-1)

def scipy_cdist(arr):
    return distance.cdist(arr, arr)

perfplot.plot(
    setup=lambda n: np.random.rand(n, 3),
    n_range=[2 ** k for k in range(14)],
    kernels=[sqrt_sum, scipy_cdist, linalg_norm],
    title="Euclidean distance between arrays of 3-D points",
    xlabel="len(x), len(y)",
    equality_check=np.allclose
);
瑕疵 2024-08-11 11:45:45

对于大量距离,我能想到的最快解决方案是使用 numexpr。在我的机器上,它比使用 numpy einsum 更快:

import numexpr as ne
import numpy as np

a_min_b=a-b
np.sqrt(ne.evaluate("sum((a_min_b)**2,axis=1)"))

The fastest solution I could come up with for large number of distances is using numexpr. On my machine it is faster than using numpy einsum:

import numexpr as ne
import numpy as np

a_min_b=a-b
np.sqrt(ne.evaluate("sum((a_min_b)**2,axis=1)"))
暖树树初阳… 2024-08-11 11:45:44

使用 numpy.linalg.norm

dist = numpy.linalg.norm(a-b)

这是有效的,因为欧几里德距离l2范数,并且ord的默认值 > numpy.linalg.norm 中的参数为 2。
有关更多理论,请参阅数据挖掘简介< em>:

在此处输入图像描述

Use numpy.linalg.norm:

dist = numpy.linalg.norm(a-b)

This works because the Euclidean distance is the l2 norm, and the default value of the ord parameter in numpy.linalg.norm is 2.
For more theory, see Introduction to Data Mining:

enter image description here

倾`听者〃 2024-08-11 11:45:44

使用 scipy.spatial.distance.euclidean

from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)

Use scipy.spatial.distance.euclidean:

from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)
偷得浮生 2024-08-11 11:45:44

对于有兴趣同时计算多个距离的人,我使用 perfplot (一个小项目)做了一些比较我的)。

第一个建议是组织数据,使数组具有维度 (3, n) (并且显然是 C 连续的)。如果添加发生在连续的第一维中,则速度会更快,并且如果将 sqrt-sumaxis=0linalg 一起使用,也不会太重要.normaxis=0,或者

a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->j', a_min_b, a_min_b))

说,它是最快的变体,略有优势。 (这实际上也只适用于一行。)

在第二个轴 axis=1 上求和的变体都慢得多。

输入图像描述这里


重现情节的代码:

import numpy
import perfplot
from scipy.spatial import distance


def linalg_norm(data):
    a, b = data[0]
    return numpy.linalg.norm(a - b, axis=1)


def linalg_norm_T(data):
    a, b = data[1]
    return numpy.linalg.norm(a - b, axis=0)


def sqrt_sum(data):
    a, b = data[0]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=1))


def sqrt_sum_T(data):
    a, b = data[1]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=0))


def scipy_distance(data):
    a, b = data[0]
    return list(map(distance.euclidean, a, b))


def sqrt_einsum(data):
    a, b = data[0]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->i", a_min_b, a_min_b))


def sqrt_einsum_T(data):
    a, b = data[1]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->j", a_min_b, a_min_b))


def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    out0 = numpy.array([a, b])
    out1 = numpy.array([a.T, b.T])
    return out0, out1


b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(22)],
    kernels=[
        linalg_norm,
        linalg_norm_T,
        scipy_distance,
        sqrt_sum,
        sqrt_sum_T,
        sqrt_einsum,
        sqrt_einsum_T,
    ],
    xlabel="len(x), len(y)",
)
b.save("norm.png")

For anyone interested in computing multiple distances at once, I've done a little comparison using perfplot (a small project of mine).

The first advice is to organize your data such that the arrays have dimension (3, n) (and are C-contiguous obviously). If adding happens in the contiguous first dimension, things are faster, and it doesn't matter too much if you use sqrt-sum with axis=0, linalg.norm with axis=0, or

a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->j', a_min_b, a_min_b))

which is, by a slight margin, the fastest variant. (That actually holds true for just one row as well.)

The variants where you sum up over the second axis, axis=1, are all substantially slower.

enter image description here


Code to reproduce the plot:

import numpy
import perfplot
from scipy.spatial import distance


def linalg_norm(data):
    a, b = data[0]
    return numpy.linalg.norm(a - b, axis=1)


def linalg_norm_T(data):
    a, b = data[1]
    return numpy.linalg.norm(a - b, axis=0)


def sqrt_sum(data):
    a, b = data[0]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=1))


def sqrt_sum_T(data):
    a, b = data[1]
    return numpy.sqrt(numpy.sum((a - b) ** 2, axis=0))


def scipy_distance(data):
    a, b = data[0]
    return list(map(distance.euclidean, a, b))


def sqrt_einsum(data):
    a, b = data[0]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->i", a_min_b, a_min_b))


def sqrt_einsum_T(data):
    a, b = data[1]
    a_min_b = a - b
    return numpy.sqrt(numpy.einsum("ij,ij->j", a_min_b, a_min_b))


def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    out0 = numpy.array([a, b])
    out1 = numpy.array([a.T, b.T])
    return out0, out1


b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(22)],
    kernels=[
        linalg_norm,
        linalg_norm_T,
        scipy_distance,
        sqrt_sum,
        sqrt_sum_T,
        sqrt_einsum,
        sqrt_einsum_T,
    ],
    xlabel="len(x), len(y)",
)
b.save("norm.png")
陌上青苔 2024-08-11 11:45:44

我想通过各种性能说明来阐述简单的答案。 np.linalg.norm 的功能可能会超出您的需要:

dist = numpy.linalg.norm(a-b)

首先 - 该函数旨在处理列表并返回所有值,例如比较从 pA 到点集的距离sP

sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.)  # 'distances' is a list

记住几件事:

  • Python 函数调用的开销很大。
  • [常规] Python 不缓存名称查找。

所以

def distance(pointA, pointB):
    dist = np.linalg.norm(pointA - pointB)
    return dist

并不像看上去那么无辜。

>>> dis.dis(distance)
  2           0 LOAD_GLOBAL              0 (np)
              2 LOAD_ATTR                1 (linalg)
              4 LOAD_ATTR                2 (norm)
              6 LOAD_FAST                0 (pointA)
              8 LOAD_FAST                1 (pointB)
             10 BINARY_SUBTRACT
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (dist)

  3          16 LOAD_FAST                2 (dist)
             18 RETURN_VALUE

首先 - 每次我们调用它时,我们都必须对“np”进行全局查找,对“linalg”进行范围查找,对“norm”进行范围查找,以及仅仅调用的开销一个函数可以相当于几十条Python指令。

最后,我们浪费了两个操作来存储结果并重新加载它以返回...

首先进行改进:使查找更快,跳过存储

def distance(pointA, pointB, _norm=np.linalg.norm):
    return _norm(pointA - pointB)

我们得到了更加简化的结果:

>>> dis.dis(distance)
  2           0 LOAD_FAST                2 (_norm)
              2 LOAD_FAST                0 (pointA)
              4 LOAD_FAST                1 (pointB)
              6 BINARY_SUBTRACT
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

不过,函数调用开销仍然需要一些工作。您需要进行基准测试来确定自己是否可以更好地进行数学计算:

def distance(pointA, pointB):
    return (
        ((pointA.x - pointB.x) ** 2) +
        ((pointA.y - pointB.y) ** 2) +
        ((pointA.z - pointB.z) ** 2)
    ) ** 0.5  # fast sqrt

在某些平台上,**0.5math.sqrt 更快。您的里程可能会有所不同。

**** 高级性能说明。

为什么要计算距离?如果唯一的目的是展示它,

 print("The target is %.2fm away" % (distance(a, b)))

那就继续吧。但如果您要比较距离、进行范围检查等,我想添加一些有用的性能观察结果。

让我们看两种情况:按距离排序或剔除列表以找到满足范围约束的项目。

# Ultra naive implementations. Hold onto your hat.

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance(origin, thing))

def in_range(origin, range, things):
    things_in_range = []
    for thing in things:
        if distance(origin, thing) <= range:
            things_in_range.append(thing)

我们需要记住的第一件事是,我们使用 毕达哥拉斯 来计算距离 (< code>dist = sqrt(x^2 + y^2 + z^2)) 因此我们进行了大量 sqrt 调用。数学 101:

dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M

简而言之:直到我们真正需要以 X 而不是 X^2 为单位的距离为止,我们可以消除计算中最困难的部分。

# Still naive, but much faster.

def distance_sq(left, right):
    """ Returns the square of the distance between left and right. """
    return (
        ((left.x - right.x) ** 2) +
        ((left.y - right.y) ** 2) +
        ((left.z - right.z) ** 2)
    )

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance_sq(origin, thing))

def in_range(origin, range, things):
    things_in_range = []

    # Remember that sqrt(N)**2 == N, so if we square
    # range, we don't need to root the distances.
    range_sq = range**2

    for thing in things:
        if distance_sq(origin, thing) <= range_sq:
            things_in_range.append(thing)

太好了,这两个函数不再执行任何昂贵的平方根。这会快得多,但在进一步之前,请检查一下自己:为什么上述两次 sort_things_by_distance 都需要“天真的”免责声明?答案在最底部(*a1)。

我们可以通过将 in_range 转换为生成器来改进它:

def in_range(origin, range, things):
    range_sq = range**2
    yield from (thing for thing in things
                if distance_sq(origin, thing) <= range_sq)

如果您正在做类似的事情,这尤其有好处:

if any(in_range(origin, max_dist, things)):
    ...

但是如果您要做的下一件事需要距离,

for nearby in in_range(origin, walking_distance, hotdog_stands):
    print("%s %.2fm" % (nearby.name, distance(origin, nearby)))

请考虑生成元组:

def in_range_with_dist_sq(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = distance_sq(origin, thing)
        if dist_sq <= range_sq: yield (thing, dist_sq)

如果您可能链接,这可能特别有用范围检查(“查找 X 附近且 Y 的 Nm 范围内的物体”,因为您不必再​​次计算距离)。

但是,如果我们正在搜索一个非常大的事物列表,并且我们预计其中很多不值得考虑,该怎么办?

实际上有一个非常简单的优化:

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = (origin.x - thing.x) ** 2
        if dist_sq <= range_sq:
            dist_sq += (origin.y - thing.y) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing

这是否有用将取决于“事物”的大小。

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    if len(things) >= 4096:
        for thing in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2
                if dist_sq <= range_sq:
                    dist_sq += (origin.z - thing.z) ** 2
                    if dist_sq <= range_sq:
                        yield thing
    elif len(things) > 32:
        for things in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing
    else:
        ... just calculate distance and range-check it ...

再次考虑生成 dist_sq。我们的热狗示例就变成了:

# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
    print("%s %.2fm" % (stand, dist))

(*a1:sort_things_by_distance的排序键为每个项目调用distance_sq,而那个看起来无辜的键是一个lambda,它是必须调用的第二个函数......)

I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:

dist = numpy.linalg.norm(a-b)

Firstly - this function is designed to work over a list and return all of the values, e.g. to compare the distance from pA to the set of points sP:

sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.)  # 'distances' is a list

Remember several things:

  • Python function calls are expensive.
  • [Regular] Python doesn't cache name lookups.

So

def distance(pointA, pointB):
    dist = np.linalg.norm(pointA - pointB)
    return dist

isn't as innocent as it looks.

>>> dis.dis(distance)
  2           0 LOAD_GLOBAL              0 (np)
              2 LOAD_ATTR                1 (linalg)
              4 LOAD_ATTR                2 (norm)
              6 LOAD_FAST                0 (pointA)
              8 LOAD_FAST                1 (pointB)
             10 BINARY_SUBTRACT
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (dist)

  3          16 LOAD_FAST                2 (dist)
             18 RETURN_VALUE

Firstly - every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.

Lastly, we wasted two operations on to store the result and reload it for return...

First pass at improvement: make the lookup faster, skip the store

def distance(pointA, pointB, _norm=np.linalg.norm):
    return _norm(pointA - pointB)

We get the far more streamlined:

>>> dis.dis(distance)
  2           0 LOAD_FAST                2 (_norm)
              2 LOAD_FAST                0 (pointA)
              4 LOAD_FAST                1 (pointB)
              6 BINARY_SUBTRACT
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

The function call overhead still amounts to some work, though. And you'll want to do benchmarks to determine whether you might be better doing the math yourself:

def distance(pointA, pointB):
    return (
        ((pointA.x - pointB.x) ** 2) +
        ((pointA.y - pointB.y) ** 2) +
        ((pointA.z - pointB.z) ** 2)
    ) ** 0.5  # fast sqrt

On some platforms, **0.5 is faster than math.sqrt. Your mileage may vary.

**** Advanced performance notes.

Why are you calculating distance? If the sole purpose is to display it,

 print("The target is %.2fm away" % (distance(a, b)))

move along. But if you're comparing distances, doing range checks, etc., I'd like to add some useful performance observations.

Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.

# Ultra naive implementations. Hold onto your hat.

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance(origin, thing))

def in_range(origin, range, things):
    things_in_range = []
    for thing in things:
        if distance(origin, thing) <= range:
            things_in_range.append(thing)

The first thing we need to remember is that we are using Pythagoras to calculate the distance (dist = sqrt(x^2 + y^2 + z^2)) so we're making a lot of sqrt calls. Math 101:

dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M

In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.

# Still naive, but much faster.

def distance_sq(left, right):
    """ Returns the square of the distance between left and right. """
    return (
        ((left.x - right.x) ** 2) +
        ((left.y - right.y) ** 2) +
        ((left.z - right.z) ** 2)
    )

def sort_things_by_distance(origin, things):
    return things.sort(key=lambda thing: distance_sq(origin, thing))

def in_range(origin, range, things):
    things_in_range = []

    # Remember that sqrt(N)**2 == N, so if we square
    # range, we don't need to root the distances.
    range_sq = range**2

    for thing in things:
        if distance_sq(origin, thing) <= range_sq:
            things_in_range.append(thing)

Great, both functions no-longer do any expensive square roots. That'll be much faster, but before you go further, check yourself: why did sort_things_by_distance need a "naive" disclaimer both times above? Answer at the very bottom (*a1).

We can improve in_range by converting it to a generator:

def in_range(origin, range, things):
    range_sq = range**2
    yield from (thing for thing in things
                if distance_sq(origin, thing) <= range_sq)

This especially has benefits if you are doing something like:

if any(in_range(origin, max_dist, things)):
    ...

But if the very next thing you are going to do requires a distance,

for nearby in in_range(origin, walking_distance, hotdog_stands):
    print("%s %.2fm" % (nearby.name, distance(origin, nearby)))

consider yielding tuples:

def in_range_with_dist_sq(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = distance_sq(origin, thing)
        if dist_sq <= range_sq: yield (thing, dist_sq)

This can be especially useful if you might chain range checks ('find things that are near X and within Nm of Y', since you don't have to calculate the distance again).

But what about if we're searching a really large list of things and we anticipate a lot of them not being worth consideration?

There is actually a very simple optimization:

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    for thing in things:
        dist_sq = (origin.x - thing.x) ** 2
        if dist_sq <= range_sq:
            dist_sq += (origin.y - thing.y) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing

Whether this is useful will depend on the size of 'things'.

def in_range_all_the_things(origin, range, things):
    range_sq = range**2
    if len(things) >= 4096:
        for thing in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2
                if dist_sq <= range_sq:
                    dist_sq += (origin.z - thing.z) ** 2
                    if dist_sq <= range_sq:
                        yield thing
    elif len(things) > 32:
        for things in things:
            dist_sq = (origin.x - thing.x) ** 2
            if dist_sq <= range_sq:
                dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
                if dist_sq <= range_sq:
                    yield thing
    else:
        ... just calculate distance and range-check it ...

And again, consider yielding the dist_sq. Our hotdog example then becomes:

# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
    print("%s %.2fm" % (stand, dist))

(*a1: sort_things_by_distance's sort key calls distance_sq for every single item, and that innocent looking key is a lambda, which is a second function that has to be invoked...)

魂ガ小子 2024-08-11 11:45:44

此问题解决的另一个实例方法

def dist(x,y):   
    return numpy.sqrt(numpy.sum((x-y)**2))

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)

Another instance of this problem solving method:

def dist(x,y):   
    return numpy.sqrt(numpy.sum((x-y)**2))

a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)
梦中的蝴蝶 2024-08-11 11:45:44

Python 3.8 开始,数学< /a> 模块直接提供 dist 函数,它返回两点之间的欧几里德距离(以元组或坐标列表形式给出):

from math import dist

dist((1, 2, 6), (-2, 3, 2)) # 5.0990195135927845

如果您正在使用列表:

dist([1, 2, 6], [-2, 3, 2]) # 5.0990195135927845

Starting Python 3.8, the math module directly provides the dist function, which returns the euclidean distance between two points (given as tuples or lists of coordinates):

from math import dist

dist((1, 2, 6), (-2, 3, 2)) # 5.0990195135927845

And if you're working with lists:

dist([1, 2, 6], [-2, 3, 2]) # 5.0990195135927845
没有你我更好 2024-08-11 11:45:44

可以像下面这样完成。我不知道它有多快,但它没有使用 NumPy。

from math import sqrt
a = (1, 2, 3) # Data point 1
b = (4, 5, 6) # Data point 2
print sqrt(sum( (a - b)**2 for a, b in zip(a, b)))

It can be done like the following. I don't know how fast it is, but it's not using NumPy.

from math import sqrt
a = (1, 2, 3) # Data point 1
b = (4, 5, 6) # Data point 2
print sqrt(sum( (a - b)**2 for a, b in zip(a, b)))
笑红尘 2024-08-11 11:45:44

一句漂亮的话:

dist = numpy.linalg.norm(a-b)

但是,如果速度是一个问题,我建议在您的机器上进行试验。我发现在我的机器上使用 math 库的 sqrt** 运算符来计算平方比单行 NumPy 快得多解决方案。

我使用这个简单的程序运行测试:

#!/usr/bin/python
import math
import numpy
from random import uniform

def fastest_calc_dist(p1,p2):
    return math.sqrt((p2[0] - p1[0]) ** 2 +
                     (p2[1] - p1[1]) ** 2 +
                     (p2[2] - p1[2]) ** 2)

def math_calc_dist(p1,p2):
    return math.sqrt(math.pow((p2[0] - p1[0]), 2) +
                     math.pow((p2[1] - p1[1]), 2) +
                     math.pow((p2[2] - p1[2]), 2))

def numpy_calc_dist(p1,p2):
    return numpy.linalg.norm(numpy.array(p1)-numpy.array(p2))

TOTAL_LOCATIONS = 1000

p1 = dict()
p2 = dict()
for i in range(0, TOTAL_LOCATIONS):
    p1[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
    p2[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))

total_dist = 0
for i in range(0, TOTAL_LOCATIONS):
    for j in range(0, TOTAL_LOCATIONS):
        dist = fastest_calc_dist(p1[i], p2[j]) #change this line for testing
        total_dist += dist

print total_dist

在我的机器上,math_calc_dist 的运行速度比 numpy_calc_dist 快得多:1.5 秒与 23.5 秒。

为了获得 fastest_calc_distmath_calc_dist 之间的可测量差异,我必须将 TOTAL_LOCATIONS 增加到 6000。然后 fastest_calc_dist 需要大约 50秒,而 math_calc_dist 大约需要 60 秒。

您还可以尝试使用 numpy.sqrtnumpy.square,尽管它们都比我机器上的 math 替代方案慢。

我的测试是使用 Python 2.6.6 运行的。

A nice one-liner:

dist = numpy.linalg.norm(a-b)

However, if speed is a concern I would recommend experimenting on your machine. I've found that using math library's sqrt with the ** operator for the square is much faster on my machine than the one-liner NumPy solution.

I ran my tests using this simple program:

#!/usr/bin/python
import math
import numpy
from random import uniform

def fastest_calc_dist(p1,p2):
    return math.sqrt((p2[0] - p1[0]) ** 2 +
                     (p2[1] - p1[1]) ** 2 +
                     (p2[2] - p1[2]) ** 2)

def math_calc_dist(p1,p2):
    return math.sqrt(math.pow((p2[0] - p1[0]), 2) +
                     math.pow((p2[1] - p1[1]), 2) +
                     math.pow((p2[2] - p1[2]), 2))

def numpy_calc_dist(p1,p2):
    return numpy.linalg.norm(numpy.array(p1)-numpy.array(p2))

TOTAL_LOCATIONS = 1000

p1 = dict()
p2 = dict()
for i in range(0, TOTAL_LOCATIONS):
    p1[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
    p2[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))

total_dist = 0
for i in range(0, TOTAL_LOCATIONS):
    for j in range(0, TOTAL_LOCATIONS):
        dist = fastest_calc_dist(p1[i], p2[j]) #change this line for testing
        total_dist += dist

print total_dist

On my machine, math_calc_dist runs much faster than numpy_calc_dist: 1.5 seconds versus 23.5 seconds.

To get a measurable difference between fastest_calc_dist and math_calc_dist I had to up TOTAL_LOCATIONS to 6000. Then fastest_calc_dist takes ~50 seconds while math_calc_dist takes ~60 seconds.

You can also experiment with numpy.sqrt and numpy.square though both were slower than the math alternatives on my machine.

My tests were run with Python 2.6.6.

苏辞 2024-08-11 11:45:44

我在 matplotlib.mlab 中找到了一个“dist”函数,但我认为它不够方便。

我把它贴在这里仅供参考。

import numpy as np
import matplotlib as plt

a = np.array([1, 2, 3])
b = np.array([2, 3, 4])

# Distance between a and b
dis = plt.mlab.dist(a, b)

I find a 'dist' function in matplotlib.mlab, but I don't think it's handy enough.

I'm posting it here just for reference.

import numpy as np
import matplotlib as plt

a = np.array([1, 2, 3])
b = np.array([2, 3, 4])

# Distance between a and b
dis = plt.mlab.dist(a, b)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文