如何根据相对于已选择样本的距离来选择数字样本 (Python)

发布于 2025-01-15 01:12:37 字数 1834 浏览 3 评论 0原文

我在形状 (500,2) 的 2D 数组中有一些随机测试数据，如下所示：

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])

从这个数组中，我首先选择 10 个随机样本，要选择第 11 个样本，我想选择距离最远的样本原始10个选定样本的总称，我使用欧氏距离来做到这一点。我需要继续这样做，直到挑选到一定数量为止。这是我这样做的尝试。

# Function to get the distance between samples
def get_dist(a, b):

    return np.sqrt(np.sum(np.square(a - b)))


# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []


# This selects 10 random samples and appends them to selected_xy
for i in range(10):
    idx = np.random.randint(len(xy))
    starting_10 = xy[idx, :]
    selected_xy.append(starting_10)
    starting.append(starting_10)
    xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)


# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Get the distance between each already selected sample, and every other unselected sample
        dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
        # Apply some kind of penalty function - this is the key
        dists_[dists_ < 90] -= 25000
        # Sum dists_ onto dists
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

关键是这一行 - 惩罚函数

dists_[dists_ < 90] -= 25000

该惩罚函数的存在是为了防止代码通过人为缩短靠近的值来仅选取空间边缘的样本环。然而，这最终会崩溃，选择开始聚类，如图所示。您可以清楚地看到，在需要任何类型的聚类之前，代码可以做出更好的选择。我觉得一种衰减指数函数最适合这个，但我不知道如何实现它。所以我的问题是；我如何改变当前的惩罚函数以获得我想要的东西？

原文

I have some random test data in a 2D array of shape (500,2) as such:

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])

From this array, I first select 10 random samples, to select the 11th sample, I would like to pick the sample that is the furthest away from the original 10 selected samples collectively, I am using the euclidean distance to do this. I need to keep doing this until a certain amount have been picked. Here is my attempt at doing this.

# Function to get the distance between samples
def get_dist(a, b):

    return np.sqrt(np.sum(np.square(a - b)))


# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []


# This selects 10 random samples and appends them to selected_xy
for i in range(10):
    idx = np.random.randint(len(xy))
    starting_10 = xy[idx, :]
    selected_xy.append(starting_10)
    starting.append(starting_10)
    xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)


# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Get the distance between each already selected sample, and every other unselected sample
        dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
        # Apply some kind of penalty function - this is the key
        dists_[dists_ < 90] -= 25000
        # Sum dists_ onto dists
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

The key to this is this line - the penalty function

dists_[dists_ < 90] -= 25000

This penalty function exists to prevent the code from just picking a ring of samples at the edge of the space, by artificially shortening values that are close together.
However, this eventually breaks down, and the selection starts clustering, as shown in the image. You can clearly see that there are much better selections that the code can make before any kind of clustering is necessary. I feel that a kind of decaying exponential function would be best for this, but I do not know how to implement it.
So my question is; how would I change the current penalty function to get what I'm looking for?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尤怨 2025-01-22 01:12:37

从您的问题中，我了解到您正在寻找的是周期性边界条件（PBC）。这意味着位于空间左边缘的点紧邻右端侧。因此，沿着一个轴可以获得的最大距离由盒子的一半（即边缘和中心之间）给出。

为了考虑 PBC，您需要计算每个轴上的距离并减去盒子的一半：
例如，如果一个点 x1 = 100，第二个点 x2 = 900，则使用 PBC，它们相距 200 个单位：|x1 - x2| - 500。一般情况下，给定 2 个坐标和一半大小的框，您最终会得到：

$\Delta x = |x_1 - x_2| -&空格;\frac{1}{2 }&空间;\left[|x_1&空间;-&空间;x_2|&空间;%&空间;box_\mathrm{size}\right]&空间;&空间;box_% 5Cmathrm{大小}$

在您的情况下，这简化为：

delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500

为了总结它，我使用新的 distance 函数重写了您的代码（请注意，我删除了一些不必要的 for 循环）：

import numpy as np

def distance(p, arr, 500):
    delta_x = np.abs(p[0] - arr[:,0])
    delta_y = np.abs(p[1] - arr[:,1])
    delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
    delta_y[delta_y > 500] = delta_y[delta_y > 500] - 500
    return np.sqrt(delta_x**2 + delta_y**2)

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
idx = np.random.randint(500, size=10)
selected_xy = list(xy[idx])
_initial_selected = xy[idx]
xy = np.delete(xy, idx, axis = 0)
n_xy_to_select = 120


for i in range(n_xy_to_select - 1):
    # Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Compute the distance taking into account the PBC
        dists_ = distance(selected_xy_, xy)
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

实际上它创建了集群，这是正常的，因为您倾向于创建彼此距离最大的点簇。不仅如此，由于边界条件，我们设置沿一个轴的 2 个点之间的最大距离为 500。因此，两个簇之间的最大距离也是 500！正如您在图像中看到的那样，情况确实如此。

此外，选择更多数字将开始绘制连接不同簇的线，从中心簇开始，如下所示：

From your question, I understand that what you are looking for are Periodic Boundary Conditions (PBC). Meaning that a point which at the left edge of your space is just next to the on the right end side. Thus, the maximal distance you can get along one axis is given by the half of the box (i.e. between the edge and the center).

To take into account the PBC you need to compute the distance on each axis and subtract the half of the box to that:
For example, if you have a point with x1 = 100 and a second one with x2 = 900, using the PBC they are 200 units apart : |x1 - x2| - 500. In the general case, given 2 coordinates and the half size box, you end up by having:

$\Delta x = |x_1 - x_2| - \frac{1}{2} \left[|x_1 - x_2| % box_\mathrm{size}\right] box_\mathrm{size}$

In your case this simplifies to:

delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500

To wrap it up, I rewrote your code using a new distance function (note that I removed some unnecessary for loops):

import numpy as np

def distance(p, arr, 500):
    delta_x = np.abs(p[0] - arr[:,0])
    delta_y = np.abs(p[1] - arr[:,1])
    delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
    delta_y[delta_y > 500] = delta_y[delta_y > 500] - 500
    return np.sqrt(delta_x**2 + delta_y**2)

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
idx = np.random.randint(500, size=10)
selected_xy = list(xy[idx])
_initial_selected = xy[idx]
xy = np.delete(xy, idx, axis = 0)
n_xy_to_select = 120


for i in range(n_xy_to_select - 1):
    # Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Compute the distance taking into account the PBC
        dists_ = distance(selected_xy_, xy)
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

And indeed it creates clusters, and this is normal as you will tend to create points clusters that are at the maximal distance from each others. More than that, due to the boundary conditions, we set that the maximal distance between 2 points along one axis is given by 500. The maximal distance between two clusters is thus also 500 ! And as you can see on the image, it is the case.

More over, picking more numbers will start to draws line to connect the different clusters, starting from the central one as you can see here :

回复收藏 0 原文