数字聚类/划分算法
我有一个有序的一维数字数组。数组长度和数组中数字的值都是任意的。我想根据数值将数组划分为 k 个分区,例如,假设我想要 4 个分区,分布为 30% / 30% / 20% / 20%,即首先是前 30% 的值,接下来的 30%之后,等等。我可以选择 k 和分布的百分比。此外,如果相同的数字在数组中出现多次,则不应将其包含在两个不同的分区中。这意味着上面的分配百分比并不严格,而是“目标”或“起点”(如果您愿意)。
例如,假设我的数组是 ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]。
我选择 k = 4
,数字应按百分比 pA = pB = pC = pD = 25%
分配到分区 A、B、C 和 D。
鉴于我上面给出的约束,生成的分区应该是:
A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]
结果(实现/修正)百分比pcA = 10%,pcB = 20%,pcC = 20%,pcD = 50%
在我看来,我需要一个修改后的 k 均值算法,因为标准算法不能保证遵守我的百分比和/或相同值不能出现在多个集群/分区中的要求。
那么,有没有一种算法可以实现这种聚类呢?
I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.
For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
.
I choose k = 4
and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%
.
Given the constraints I gave above, the resulting partitions should be:
A = [1]
B = [5, 5]
C = [6, 7]
D = [8, 8, 8, 8, 8]
with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%
It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.
So, is there an algorithm for this kind of clustering?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
聚类算法用于多维数据。对于一维数据,您应该简单地使用排序算法。
对数据进行排序。然后按照您的示例,从数组底部到顶部对数据集进行线性分区。
Clustering algorithms are used on multi-dimensional data. For one-dimensional data, you should simply use a sorting algorithm.
Sort the data. Then partition the data-set linearly working from the bottom of the array to the top, as per your example.
这是一个动态规划解决方案,它找到一个分区,使零件尺寸误差的平方和最小。因此,在 [1, 5, 5, 6, 7, 8, 8, 8, 8, 8] 的示例中,您需要大小为 (2.5, 2.5, 2.5, 2.5) 的部分,并且此代码给出的结果是 ( 9.0, (1, 2, 2, 5))。这意味着选择的分区大小为 1、2、2 和 5,总误差为 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5- 5)^2。
Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.
简单的方法是这样的:
假设 p1...pk 是分区的百分比 (p1+...+pk = 1)
假设数组中有 N 个元素
初始边界(其中有 k+1 个,包括数组结束,因为你有 k 个分区)是:
0、p1*N、(p1+p2)*N、...、N(需要进行一些舍入操作)。
为了移动边界,您可以查看边界两侧的两个数组元素(对于可以移动的 k-1 个边界)。如果两个元素相等,则需要移动到边界(左侧或右侧),至少直到满足约束为止。一种简单的方法是从左侧开始并进行最小的调整(只需将约束调整到导致移动最少的一侧,并且不要进一步移动边界)。
但该算法并未覆盖整个分区空间。它只是给你一种解决方案。为了找到最佳解决方案,您需要对整个分区空间进行强力搜索,并进行某种修剪(例如动态编程,您可以记住初始数组的子数组的最佳分区)。
The naive approach would go like this:
Say p1...pk are the percentages for your partitions (p1+...+pk = 1)
Say you have N elements in the array
The initial boundaries (there's k+1 of them, including the array ends, since you have k partitions)are:
0, p1*N, (p1+p2)*N, ..., N (there'll be some rounding to do).
For moving the boundaries, you look at the two array elements on each side of a boundary (for the k-1 boundaries that you can move). If the two elements are equal, you need to move to boundary, either left of right, at least until the constraint is satisfied. A naive approach would be to start on the left and do minimal adjustments (just adjust the constraint to the side that causes the least movement, and don't move the boundary any further).
This algorithm doesn't cover the whole space of partitions though. It just gives you one solution. To find the best solution, you'd need to do a brute-force search on the entire partition space, with some kind of pruning (e.g. dynamic programming, where you remember the best partitioning for a subarray of the initial array).