数字聚类/划分算法

发布于 2024-12-15 20:44:19 字数 641 浏览 1 评论 0原文

我有一个有序的一维数字数组。数组长度和数组中数字的值都是任意的。我想根据数值将数组划分为 k 个分区，例如，假设我想要 4 个分区，分布为 30% / 30% / 20% / 20%，即首先是前 30% 的值，接下来的 30%之后，等等。我可以选择 k 和分布的百分比。此外，如果相同的数字在数组中出现多次，则不应将其包含在两个不同的分区中。这意味着上面的分配百分比并不严格，而是“目标”或“起点”（如果您愿意）。

例如，假设我的数组是 ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]。

我选择 k = 4，数字应按百分比 pA = pB = pC = pD = 25% 分配到分区 A、B、C 和 D。

鉴于我上面给出的约束，生成的分区应该是：

A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]

结果（实现/修正）百分比pcA = 10%，pcB = 20%，pcC = 20%，pcD = 50%

在我看来，我需要一个修改后的 k 均值算法，因为标准算法不能保证遵守我的百分比和/或相同值不能出现在多个集群/分区中的要求。

那么，有没有一种算法可以实现这种聚类呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱你是孤单的心事 2024-12-22 20:44:19

聚类算法用于多维数据。对于一维数据，您应该简单地使用排序算法。

对数据进行排序。然后按照您的示例，从数组底部到顶部对数据集进行线性分区。

回复收藏 0 原文

雨夜星沙 2024-12-22 20:44:19

这是一个动态规划解决方案，它找到一个分区，使零件尺寸误差的平方和最小。因此，在 [1, 5, 5, 6, 7, 8, 8, 8, 8, 8] 的示例中，您需要大小为 (2.5, 2.5, 2.5, 2.5) 的部分，并且此代码给出的结果是 ( 9.0, (1, 2, 2, 5))。这意味着选择的分区大小为 1、2、2 和 5，总误差为 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5- 5)^2。

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})

Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})

回复收藏 0 原文