当前位置：文江博客话题详情

对对象进行分组以实现所有组的相似平均属性

发布于 2024-10-07 14:49:29 字数 352 浏览 9 评论 0原文

我有一个对象集合，每个对象都有一个数字“重量”。我想创建这些对象的组，使得每个组的对象权重的算术平均值大致相同。

这些组不一定具有相同数量的成员，但组的大小将在彼此之内。就数量而言，将有 50 到 100 个对象，最大组大小约为 5。

这是一个众所周知的问题类型吗？这看起来有点像背包或分区问题。是否已知有效的算法可以解决该问题？

第一步，我创建了一个 python 脚本，通过按权重对对象进行排序、对这些对象进行分组，然后将每个子组的成员分配到最终组之一，实现平均权重的非常粗略的等效。

我很擅长使用 python 编程，因此如果存在现有的包或模块来实现部分此功能，我将很高兴听到它们。

感谢您的帮助和建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

审判长 2024-10-14 14:49:29

下面的程序是一个低成本的启发式程序。它的作用是将值分布在“桶”中，通过在一轮中从排序列表的一端选择值，并在下一轮中从另一端选择值，将大值与小值放在一起。以循环方式进行分配可确保满足有关每个桶的元素数量的规则。它是一种启发式方法，而不是一种算法，因为它往往会产生良好的解决方案，但不能保证不存在更好的解决方案。

理论上，如果有足够的值，并且它们是均匀分布或正态分布的，那么将这些值随机放置在桶中很可能会导致桶的均值相似。假设数据集很小，这种启发式方法可以提高获得良好解决方案的机会。更多地了解数据集的大小和统计分布将有助于设计更好的启发式算法。

from random import randint, seed
from itertools import cycle,chain

def chunks(q, n):
    q = list(q)
    for i in range(0, len(q), n):
       yield q[i:i+n]

def shuffle(q, n):
    q = list(q)
    m = len(q)//2
    left =  list(chunks(q[:m],n))
    right = list(chunks(reversed(q[m:]),n)) + [[]]
    return chain(*(a+b for a,b in zip(left, right)))

def listarray(n):
    return [list() for _ in range(n)]

def mean(q):
    return sum(q)/len(q)

def report(q):
    for x in q:
        print mean(x), len(x), x

SIZE = 5
COUNT= 37

#seed(SIZE)
data = [randint(1,1000) for _ in range(COUNT)]
data = sorted(data)
NBUCKETS = (COUNT+SIZE-1) // SIZE

order = shuffle(range(COUNT), NBUCKETS)
posts = cycle(range(NBUCKETS))
buckets = listarray(NBUCKETS)
for o in order:
    i = posts.next()
    buckets[i].append(data[o])
report(buckets)
print mean(data)

由于排序步骤，复杂性是对数的。这些是示例结果：

439 5 [15, 988, 238, 624, 332]
447 5 [58, 961, 269, 616, 335]
467 5 [60, 894, 276, 613, 495]
442 5 [83, 857, 278, 570, 425]
422 5 [95, 821, 287, 560, 347]
442 4 [133, 802, 294, 542]
440 4 [170, 766, 301, 524]
418 4 [184, 652, 326, 512]
440

请注意，对存储桶大小的要求占主导地位，这意味着如果原始数据的方差很大，则均值不会接近。您可以尝试使用此数据集：

data = sorted(data) + [100000]

包含 100000 的存储桶将至少获得另外 3 个数据。

我想出了这个启发式的想法，如果一群孩子递上一包不同面额的钞票，并被告知按照游戏规则分享它们，他们就会这么做。它在统计上是合理的，并且 O(log(N))。

The program that follows is a low-cost heuristic. What it does is distribute the values among "buckets" placing large values along with small ones by choosing the values from one end of a sorted list on one round, and from the other end on the next. Doing the distribution in round-robin guarantees that the rules about the number of elements per bucket are met. It is a heuristic and not an algorithm because it tends produce good solutions, but without guarantee that better ones don't exist.

In theory, if there are enough values, and they are uniformly or normally distributed, then chances are that just randomly placing the values in the buckets will result in alike means for the buckets. Assuming that that the dataset is small, this heuristic improves the chances of a good solution. Knowing more about the size and statistical distribution of the datasets would help devise a better heuristic, or an algorithm.

from random import randint, seed
from itertools import cycle,chain

def chunks(q, n):
    q = list(q)
    for i in range(0, len(q), n):
       yield q[i:i+n]

def shuffle(q, n):
    q = list(q)
    m = len(q)//2
    left =  list(chunks(q[:m],n))
    right = list(chunks(reversed(q[m:]),n)) + [[]]
    return chain(*(a+b for a,b in zip(left, right)))

def listarray(n):
    return [list() for _ in range(n)]

def mean(q):
    return sum(q)/len(q)

def report(q):
    for x in q:
        print mean(x), len(x), x

SIZE = 5
COUNT= 37

#seed(SIZE)
data = [randint(1,1000) for _ in range(COUNT)]
data = sorted(data)
NBUCKETS = (COUNT+SIZE-1) // SIZE

order = shuffle(range(COUNT), NBUCKETS)
posts = cycle(range(NBUCKETS))
buckets = listarray(NBUCKETS)
for o in order:
    i = posts.next()
    buckets[i].append(data[o])
report(buckets)
print mean(data)

Complexity is logarithmic because of the sorting step. These are sample results:

439 5 [15, 988, 238, 624, 332]
447 5 [58, 961, 269, 616, 335]
467 5 [60, 894, 276, 613, 495]
442 5 [83, 857, 278, 570, 425]
422 5 [95, 821, 287, 560, 347]
442 4 [133, 802, 294, 542]
440 4 [170, 766, 301, 524]
418 4 [184, 652, 326, 512]
440

Note that the requirement on the size of the buckets dominates, which means that the means won't be close if the variance in the original data is large. You can try with this dataset:

data = sorted(data) + [100000]

The bucket containing 100000 will get at least another 3 datums.

I came up with this heuristic thinking that it's what a group of kids would do if handed a pack of bills of different denominations and told to share them according to this game's rules. It's statistically reasonable, and O(log(N)).

回复收藏 0 原文

淡墨 2024-10-14 14:49:29

您可以尝试使用 k-means 聚类：

import scipy.cluster.vq as vq
import collections
import numpy as np

def auto_cluster(data,threshold=0.1,k=1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    data=np.asarray(data)
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=vq.kmeans(data,k)
        k+=1   
    code,dist=vq.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups

np.random.seed(784789)
N=20
weights=100*np.random.random(N)
groups=auto_cluster(weights,threshold=1.5,k=N//5)
for index,data in enumerate(sorted(groups.values(),key=lambda d: np.mean(d))):
    print('{i}: {d}'.format(i=index,d=data))

上面的代码生成 N 个权重的随机序列。
它使用 scipy.cluster.vq.kmeans 将序列划分为 k 个靠近的数字簇。如果失真高于阈值，则重新计算 kmeans，并增加 k 。重复此过程，直到失真低于给定阈值。

它会产生如下所示的聚类：

0: [4.9062151907551366]
1: [13.545565038022112, 12.283828883935065]
2: [17.395300245930066]
3: [28.982058040201832, 30.032607500871023, 31.484125759701588]
4: [35.449637591061979]
5: [43.239840915978043, 48.079844689518424, 40.216494950261506]
6: [52.123246083619755, 53.895726546070463]
7: [80.556052179748079, 80.925071671718413, 75.211470587171803]
8: [86.443868931310249, 82.474064251040375, 84.088655128258964]
9: [93.525705849369416]

请注意，k 均值聚类算法使用随机猜测来最初选择 k 组的中心。这意味着重复运行相同的代码可能会产生不同的结果，特别是如果权重没有将自己分成明显不同的组。

您还必须调整阈值参数以生成所需数量的组。

You might try using k-means clustering:

import scipy.cluster.vq as vq
import collections
import numpy as np

def auto_cluster(data,threshold=0.1,k=1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    data=np.asarray(data)
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=vq.kmeans(data,k)
        k+=1   
    code,dist=vq.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups

np.random.seed(784789)
N=20
weights=100*np.random.random(N)
groups=auto_cluster(weights,threshold=1.5,k=N//5)
for index,data in enumerate(sorted(groups.values(),key=lambda d: np.mean(d))):
    print('{i}: {d}'.format(i=index,d=data))

The code above generates a random sequence of N weights.
It uses scipy.cluster.vq.kmeans to partition the sequence into k clusters of numbers which are close together. If the distortion is above a threshold, the kmeans is recomputed with k increased. This repeats until the distortion is below the given threshold.

It yields clusters such as this:

0: [4.9062151907551366]
1: [13.545565038022112, 12.283828883935065]
2: [17.395300245930066]
3: [28.982058040201832, 30.032607500871023, 31.484125759701588]
4: [35.449637591061979]
5: [43.239840915978043, 48.079844689518424, 40.216494950261506]
6: [52.123246083619755, 53.895726546070463]
7: [80.556052179748079, 80.925071671718413, 75.211470587171803]
8: [86.443868931310249, 82.474064251040375, 84.088655128258964]
9: [93.525705849369416]

Note that the k-means clustering algorithm uses random guesses to initially pick centers of the k groups. This means that repeated runs of the same code can produce different results, particularly if the weights do not separate themselves into clearly distinct groups.

You'll also have to twiddle the threshold parameter to produce the desired number of groups.

回复收藏 0 原文