两个不同 Numpy 数组中的点之间的最小欧氏距离，不在范围内

发布于 2024-08-14 05:25:32 字数 805 浏览 6 评论 0原文

我有两个 x-y 坐标数组，我想找到一个数组中每个点之间的最小欧几里得距离 另一个数组中的所有点。数组的大小不一定相同。例如：

xy1=numpy.array(
[[  243,  3173],
[  525,  2997]])

xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])

我当前的方法循环遍历 xy1 中的每个坐标 xy 并计算该坐标与其他坐标之间的距离。

mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))

for i,xy in enumerate(xy1):
    dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
    mindist[i],minid[i]=dists.min(),dists.argmin()

有没有办法消除 for 循环并以某种方式在两个数组之间进行逐元素计算？我设想生成一个距离矩阵，我可以在其中找到每行或每列中的最小元素。

另一种看待问题的方式。假设我将 xy1（长度 m）和 xy2（长度 p）连接成 xy （长度n），我存储原始数组的长度。理论上，我应该能够从这些坐标生成一个 nx n 距离矩阵，从中我可以获取 mx p 子矩阵。有没有办法有效地生成这个子矩阵？

原文

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:

xy1=numpy.array(
[[  243,  3173],
[  525,  2997]])

xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])

My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.

mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))

for i,xy in enumerate(xy1):
    dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
    mindist[i],minid[i]=dists.min(),dists.argmin()

Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.

Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恋竹姑娘 2024-08-21 05:25:32

（几个月后）
scipy.spatial.distance.cdist(X, Y)
给出所有距离对，
对于 X 和 Y 2 暗淡、3 暗淡 ...
它还做了22种不同的规范，详细
此处。

# cdist example: (nx,dim) (ny,dim) -> (nx,ny)

from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist

#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1

    # change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
    exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )

title = "%s  dim %d  nx %d  ny %d  metric %s" % (
        __file__, dim, nx, ny, metric )
print "\n", title

#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric )  # -> (nx, ny) distances
#...............................................................................

print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
        X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
        dist[0,3], cdist( [X[0]], [Y[3]] ))


# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim      ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2      ~ .4 +- .2/sqrt dim

(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .

# cdist example: (nx,dim) (ny,dim) -> (nx,ny)

from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist

#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1

    # change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
    exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )

title = "%s  dim %d  nx %d  ny %d  metric %s" % (
        __file__, dim, nx, ny, metric )
print "\n", title

#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric )  # -> (nx, ny) distances
#...............................................................................

print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
        X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
        dist[0,3], cdist( [X[0]], [Y[3]] ))


# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim      ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2      ~ .4 +- .2/sqrt dim

回复收藏 0 原文

暮光沉寂 2024-08-21 05:25:32

要计算 m × p 距离矩阵，这应该可行：

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

.outer 调用生成两个这样的矩阵（沿两个轴的标量差），.hypot 调用将它们转换为相同形状的矩阵（标量欧氏距离）。

To compute the m by p matrix of distances, this should work:

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).

回复收藏 0 原文

独行侠 2024-08-21 05:25:32

接受的答案并未完全解决该问题，该问题要求找到两组点之间的最小距离，而不是两组中每个点之间的距离。

尽管对原始问题的直接解决方案确实包括计算每对之间的距离，然后找到最小值，但如果只对最小值感兴趣，则没有必要这样做距离。对于后一个问题存在一个更快的解决方案。

所有建议的解决方案的运行时间都为 m*p = len(xy1)*len(xy2)。这对于小型数据集来说是可以的，但是可以编写一个缩放为 m*log(p) 的最佳解决方案，从而为大型 xy2 数据集节省大量成本。

这种最佳执行时间缩放可以使用 scipy.spatial 来实现.KDTree 如下，

import numpy as np
from scipy import spatial

xy1 = np.array(
    [[243,  3173],
     [525,  2997]])

xy2 = np.array(
    [[682, 2644],
     [277, 2651],
     [396, 2640]])

# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)

# This solution by @denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)

其中 mindist 是 xy1 中每个点与 xy2 中点集之间的最小距离

The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.

Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.

All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.

This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows

import numpy as np
from scipy import spatial

xy1 = np.array(
    [[243,  3173],
     [525,  2997]])

xy2 = np.array(
    [[682, 2644],
     [277, 2651],
     [396, 2640]])

# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)

# This solution by @denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)

where mindist is the minimum distance between each point in xy1 and the set of points in xy2

回复收藏 0 原文

笔芯 2024-08-21 05:25:32

import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)

import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)

回复收藏 0 原文

秋意浓 2024-08-21 05:25:32

对于您想要执行的操作：

dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)

编辑：您可以使用numpy.hypot，而不是调用sqrt、做平方等：

dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])

For what you're trying to do:

dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)

Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:

dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])

回复收藏 0 原文

夜深人未静 2024-08-21 05:25:32

我认为以下功能也有效。

import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
    Y = X if Y is None else Y
    xx = (X ** 2).sum(axis = 1)[:, None]
    yy = (Y ** 2).sum(axis = 1)[:, None]
    return xx + yy.T - 2 * (X @ Y.T)

说明

假设每一行X和Y都是两组点的坐标。

设它们的大小分别为 m X p 和 p X n。

结果将生成一个大小为 m X n 的 numpy 数组，其中第 (i, j) 条目是 i 之间的距离分别是 X 和 Y 的第 code> 行和第 j 行。

I think the following function also works.

import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
    Y = X if Y is None else Y
    xx = (X ** 2).sum(axis = 1)[:, None]
    yy = (Y ** 2).sum(axis = 1)[:, None]
    return xx + yy.T - 2 * (X @ Y.T)

Explanation

Suppose each row of X and Y are coordinates of the two sets of points.

Let their sizes be m X p and p X n respectively.

The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.

回复收藏 0 原文

预谋 2024-08-21 05:25:32

我强烈建议使用一些内置方法来计算平方，并且根是为优化计算方式而定制的，并且非常安全，可以防止溢出。

@alex 下面的答案在溢出方面是最安全的，而且也应该非常快。另外，对于单点，您可以使用 math.hypot，它现在支持超过 2 个维度。

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

安全问题

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200

np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

overflow/underflow/speeds

I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.

@alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.

>>> def distances(xy1, xy2):
...   d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
...   d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
...   return numpy.hypot(d0, d1)

Safety concerns

i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200

np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square

overflow/underflow/speeds

回复收藏 0 原文

万劫不复 2024-08-21 05:25:32

我认为最直接和高效的解决方案是这样做：

distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance 
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.

I think that the most straightforward and efficient solution is to do it like this:

distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance 
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.

回复收藏 0 原文

情绪 2024-08-21 05:25:32

虽然这里的很多答案都很棒，但是还有另一种方法这里没有提到，使用 numpy 的向量化/广播属性来计算每个点之间的距离两个不同长度的不同数组（以及，如果需要，最接近的匹配）。我在这里发布它是因为它可以非常方便地掌握广播，并且它还优雅地解决了这个问题，同时保持非常高效。

假设您有两个像这样的数组：

# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)

您无法执行操作 ab：numpy 抱怨 操作数无法与形状一起广播 (6,2) (4,2).允许广播的技巧是手动添加 numpy 广播的维度。通过将维度 2 保留在两个重构数组中，numpy 知道它必须在此维度上执行操作。

deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points 
distance_matrix = (deltas ** 2).sum(axis=2)

distance_matrix 的形状为 (6,4)：对于 a 中的每个点，到 b 中所有点的距离code> 被计算。然后，如果您想要“一个数组中的每个点与另一个数组中的所有点之间的最小欧几里得距离”，您可以这样做：

distance_matrix.argmin(axis=1)

这将返回 b 中最接近的点的索引a 的每个点。

Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.

Assuming you have two arrays like so:

# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)

You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.

deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points 
distance_matrix = (deltas ** 2).sum(axis=2)

The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :