使用什么算法将数字序列分割成n个子集,以最小化每个子集中数字之和的标准偏差

发布于 2024-08-19 12:04:55 字数 953 浏览 5 评论 0原文

我正在寻找一种算法将正数序列分割成 n 个子序列,从而使每个子集中的数字总和的标准偏差最小化。

每个子序列中数字的顺序需要与原始序列中的顺序相同

例如:

假设我有一个想要分割的序列 {1,1,1,1,1,1,10,1}分成 2 个子序列。
我相信最佳解决方案是 {1,1,1,1,1,1}, {10,1} 。

第一个子序列的和是 6,第二个子序列的和是 11
这两个数字的标准差约为 3.5,我认为这是可能的最低值。

假设我有一个序列 {4,1,1,1,1,6},我想将其分割成 3 个子序列。
我相信最佳解决方案是 {4}, {1,1,1,1}, {6}
子序列之和为 4、4 和 6。
这 3 个数字的标准差约为 1.15,我认为这是可能的最低值。

我能想到的最好的算法是找到序列中每个数字的累积和,并以 [totalSum/numSubsequences] 的每个间隔对序列进行分段。

例如,给定序列 {4,1,1,1,1,6} ,每个序列的数字的累积和为 {4,5,6,7,8,14}。序列中所有数字的总数为 14,因此,假设我想要 3 个子序列,当总数达到 14/3 = 4.66 和 2 * 14/3 = 9.333333 时,我应该对序列进行分段。

然而,序列中没有实际位置的累计总数等于 4.66 - 第一个累计总数是 4,下一个累计总数是 5。那么我应该向上舍入还是向下舍入?在本例中,向下舍入到 4 可给出最佳解决方案,但情况并非总是如此。我能想到的最好的方法是尝试向上和向下舍入的每种组合,但这会导致 O(2^numSubsequences) 复杂性。

这似乎是一种需要应用预先存在的算法的事情,但是我的谷歌搜索失败了。我知道 分区问题,它是 NP 完全的,但涉及无序集,并且不是有序的序列。

任何帮助将不胜感激。

I'm looking for an algorithm to segment a sequence of positive numbers into n subsequences, such that the standard deviation of the sum of the numbers in each subset is minimized.

The ordering of the numbers in each subsequence needs to be the same as the ordering in the original sequence

For example:

Suppose I have a sequence {1,1,1,1,1,1,10,1} that i wanted to segment into 2 subsequences.
I believe the optimal solution would be {1,1,1,1,1,1}, {10,1} .

The sum of the 1st subsequence is 6, the sum of the 2nd subsequence is 11
The standard deviation of the two numbers is ~3.5, which i believe is the lowest possible.

Suppose I have a sequence {4,1,1,1,1,6} that i wanted to segment into 3 subsequences.
I believe the optimal solution would be {4}, {1,1,1,1}, {6}
The sum of the subsequences is 4, 4, and 6.
The standard deviation of the 3 numbers is ~1.15, which i believe is the lowest possible.

The best algorithm i was able to come up with was to find the cumulative sum of each of the numbers in the sequence, and segment the sequence at each interval of [totalSum/numSubsequences].

For example, given the sequence {4,1,1,1,1,6} , the cumulative sums of the numbers of each sequence is {4,5,6,7,8,14}. The total of all numbers in the sequence is 14, so, given that i want 3 subsequences, i should segment the sequence when the total reaches 14/3 = 4.66 and 2 * 14/3 = 9.333333.

However, there is no actual place in the sequence where the cumulative total is equal to 4.66 - the first cumulative total is 4, and next cumulative total is 5. So should i round up or should i round down? In this case, rounding down to 4 gives the optimal solution, but that isn't always the case. The best I can think of is to try every combination of rounding up and down, but that results in O(2^numSubsequences) complexity.

This seems to be the type of thing that would have a preexisting algorithm to apply, however my Googling has failed me. I am aware of the Partition Problem, which is NP-complete, but that deals with unordered sets, and not ordered sequences.

Any help would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

追我者格杀勿论 2024-08-26 12:04:55

假设原始序列的长度为L,子序列的数量为N

您可以简化标准差的表达式得到sqrt(E[X^ 2] - E[X]^2),其中 E 表示期望/平均值,X 表示随机变量 - 在您的情况下,是的子序列。 (类似的公式适用于“样本标准差”。)请注意,E[X] 并不取决于您如何分割序列,因为它始终是总和除以 N。因此,我们只想最小化 E[X^2] 或等价的 X^2 之和(它们相差 N > 根据平均值的定义)。

至此,我们可以看到这个问题可以用动态规划来解决。设f(i,j),对于i,从0Mj1N,是将序列的前 i 元素拆分为 i 的子序列之和的最小平方和code>j 子序列。然后我们看到,f(i,j) 可以根据所有 f(i',j') 进行计算,其中 i' <= ij < j'。更具体地说,如果您的序列是从 0 索引到 M-1a[k]

f(i,1) = sum( a[k] for 0 <= k < i )^2
f(i,j) = minimum of  f(l,j-1)+sum( a[k] for l < k < i )^2  for l from 0 to i

最小化 f(N,L ),您可以使用标准动态编程技术来恢复分割。特别是,您可以存储最小化 f(i,j)l

此解决方案的运行时间为 O(L^2 N),因为您计算 O(LN) 不同的 f 值最小值超过O(L)l的不同值。

下面是 Perl 中的一个简单实现:

#!/usr/bin/perl

use strict;
use warnings;

local $\ = $/;
print join ", ", map {"@$_"} best( 2, qw(1 1 1 1 1 1 10 1) );
# prints "1 1 1 1 1 1, 10 1"

print join ", ", map {"@$_"} best( 3, qw(4 1 1 1 1 6) );
# prints "4, 1 1 1 1, 6"

sub best {
    my( $N, @a ) = @_;

    my( @f, @g, $i, $j, $k, $sum );

    # DP base case
    $sum = 0;
    $f[0][1] = $g[0][1] = 0;
    for $i ( 1 .. @a ) {
        $sum += $a[$i-1];
        $f[$i][1] = $sum * $sum;
        $g[$i][1] = 0;
    }

    # DP recurrence
    for $j ( 2 .. $N ) {
        $f[0][$j] = $g[0][$j] = 0;
        for $i ( 1 .. @a ) {
            $sum = 0;
            $f[$i][$j] = $f[$i][$j-1];
            $g[$i][$j] = $i;
            for $k ( reverse 0 .. $i-1 ) {
                $sum += $a[$k];
                if( $f[$i][$j] > $f[$k][$j-1] + $sum * $sum ) {
                    $f[$i][$j] = $f[$k][$j-1] + $sum * $sum;
                    $g[$i][$j] = $k;
                }
            }
        }
    }

    # Extract best expansion
    my( @result );
    $i = @a; $j = $N;

    while( $j ) {
        $k = $g[$i][$j];
        unshift @result, [@a[$k .. $i-1]];
        $i = $k;
        $j--;
    }

    return @result;
}

Suppose the length of the original sequence is L and the number of subsequences is N.

You may simplify the expression for standard deviation to get sqrt(E[X^2] - E[X]^2), where E denotes expectation/average and X denotes your random variable -- in your case, the sum of the subsequences. (A similar formula applies for the "sample standard deviation".) Note that E[X] does not depend on how you split your sequence, because it will always be the total sum divided by N. Thus, we just want to minimize E[X^2] or equivalently, the sum of X^2 (they differ by a factor of N by the definition of average).

At this point, we can see that this problem can be solved with dynamic programming. Let f(i,j), for i from 0 to M and j from 1 to N, be the minimal sum of squares of sums of subsequences from the split of the first i elements of your sequence into j subsequences. Then we see that f(i,j) may be computed in terms of all the f(i',j') with i' <= i and j < j'. More specifically, if your sequence is a[k] indexed from 0 to M-1:

f(i,1) = sum( a[k] for 0 <= k < i )^2
f(i,j) = minimum of  f(l,j-1)+sum( a[k] for l < k < i )^2  for l from 0 to i

Having minimized f(N,L), you can use standard dynamic programming techniques to recover the splits. In particular, you can store the l that minimizes f(i,j).

The runtime of this solution is O(L^2 N) because you compute O(L N) different values of f and the minimum is over O(L) different values of l.

Here's a straightforward implementation in Perl:

#!/usr/bin/perl

use strict;
use warnings;

local $\ = $/;
print join ", ", map {"@$_"} best( 2, qw(1 1 1 1 1 1 10 1) );
# prints "1 1 1 1 1 1, 10 1"

print join ", ", map {"@$_"} best( 3, qw(4 1 1 1 1 6) );
# prints "4, 1 1 1 1, 6"

sub best {
    my( $N, @a ) = @_;

    my( @f, @g, $i, $j, $k, $sum );

    # DP base case
    $sum = 0;
    $f[0][1] = $g[0][1] = 0;
    for $i ( 1 .. @a ) {
        $sum += $a[$i-1];
        $f[$i][1] = $sum * $sum;
        $g[$i][1] = 0;
    }

    # DP recurrence
    for $j ( 2 .. $N ) {
        $f[0][$j] = $g[0][$j] = 0;
        for $i ( 1 .. @a ) {
            $sum = 0;
            $f[$i][$j] = $f[$i][$j-1];
            $g[$i][$j] = $i;
            for $k ( reverse 0 .. $i-1 ) {
                $sum += $a[$k];
                if( $f[$i][$j] > $f[$k][$j-1] + $sum * $sum ) {
                    $f[$i][$j] = $f[$k][$j-1] + $sum * $sum;
                    $g[$i][$j] = $k;
                }
            }
        }
    }

    # Extract best expansion
    my( @result );
    $i = @a; $j = $N;

    while( $j ) {
        $k = $g[$i][$j];
        unshift @result, [@a[$k .. $i-1]];
        $i = $k;
        $j--;
    }

    return @result;
}
凡尘雨 2024-08-26 12:04:55

我想到的一个想法是使用 A* 搜索算法。

更多相关信息:

http://en.wikipedia.org/wiki/A*_search_algorithm

值得阅读的好书:

Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig

您可以用于 A* 的一些内容:

  • 初始状态:将序列拆分为 n 个相等(尽可能多)的子序列
  • 下一个状态:对于每个子集添加左侧或右侧的数字 (子集 i-1 的最后一个数(如果 i != 0)或子集 i+1 的第一个数(如果 i != n))到它(创建当前状态节点的所有降序节点)
  • 启发式:最优值将是所有值的平均值。它是可接受的,因此可以与 A* 一起使用。

我不确定这真的能帮助你解决你的问题,因为我还没有再次解决这个问题,但我认为它可能会做得很好。它也可能不是针对这个特定问题的最复杂的解决方案,但它肯定比任何“尝试所有组合”方法都要好。它也是健全和完整的(因为可接受的启发式)。

如果您对此还有更多疑问,请询问,我会尽力帮助您。

One idea that comes to my mind is to use the A* search algorithm.

More about that:

http://en.wikipedia.org/wiki/A*_search_algorithm

Good book to read about that:

Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig

Some things you could use for the A*:

  • Initial State: Split the sequence into n equal (as much as possible) subsequences
  • Next State: For each subset add the left or right number (the last number of subset i-1 (if i != 0) or the first number of subset i+1 (if i != n)) to it (to create all descending nodes of the current state node)
  • Heuristic: The optimal value would be the mean of all values. Its admissible so it can be used with A*.

Im not sure that will really help you with your problem, since i've not solved this problem again, but i think it might do pretty well. It also may not be the most sophisticated solution for this specific problem, but it is surely better than any "try all combinations" approach. It also is sound and complete (because of the admissible heuristic).

If you have more questions about this ask and i will try my best to help you.

豆芽 2024-08-26 12:04:55

我认为你的意思是分成连续的块,或者换句话说,找到 n-1 个位置将序列切成碎片。 (如果您确实打算允许交错的子序列来创建主序列,您可能只需对序列进行排序,解决块问题,然后跟踪各个数字的来源以提供交错的子序列)。

我认为您可以使用动态规划在与序列长度的 n 倍成比例的时间内解决这个问题。从左到右填充 bestCost[i][j] 和 lastCut[i][j] 数组,其中 i 沿着序列运行,j 从 0 运行到 n-1。 bestCost[i][j] 是将序列从 0 到 i 分割成 j 个块的最佳方法的成本。 lastCut[i][j] 是产生 bestCost[i][j] 的剪切的最近剪切的位置。 bestCost[i + 1][j] = min_k 标准偏差(i + 1 到 k)+ bestCost[k - 1][j - 1]。然后lastCut[i + 1][j] = k。最后,您以相同的方式计算出 n 次切割的最佳答案的成本,然后使用 lastCut[][] 回溯以找到其他切割。

I think you mean divide into contiguous chunks, or in other words find n-1 places at which to cut the sequence into pieces. (If you really mean to allow subsequences that interleave to create the main sequence, you could probably just sort the sequence, solve the chunk problem, and then track where the individual numbers came from to provide interleaved subsequences).

I think you can solve this in time proportional to n times the sequence length using dynamic programming. Work from left to right to fill in arrays of bestCost[i][j] and lastCut[i][j], where i runs along the sequence and j runs from 0 to n-1. bestCost[i][j] is the cost of the best way of cutting up the sequence from 0 to i into j chunks. lastCut[i][j] is the position of the most recent cut for the cut that produces bestCost[i][j]. bestCost[i + 1][j] = min_k std deviation(i + 1 to k) + bestCost[k - 1][j - 1]. and then lastCut[i + 1][j] = k. At the end you work out the cost of the best answer for n cuts in the same way and then use lastCut[][] to trace your way back to find the other cuts.

把回忆走一遍 2024-08-26 12:04:55

我同意动态编程可能是最好的方法 - 我会排除的一种技术是非线性优化。无论您是最小化平方根还是仅最小化平方差之和,您都有一个非线性目标函数。您还有一个整数变量作为约束集的一部分 - 将成员分配给集合需要一些整数变量,无论您的公式如何。使用整数变量的非线性优化通常是非常困难的,即使不是不可能最优地解决。如果您只需要近似解,遗传算法可能是一个好方法,其中遗传字符串表示成员到集合的分配。

就在不到一秒的时间内完成这一切......祝你好运!

I agree that dynamic programming may be the best approach - one technique that I would rule out is non-linear optimization. You have a non-linear objective function whether you're minimizing the square root or just the sum of the squared differences. You also have a integer variables as a part of your constraint set - assigning members to sets requires some integer variables regardless of your formulation. A non-linear optimization with integer variables is generally very difficult if not impossible to solve optimally. If you only need an approximate solutions a genetic algorithm might be a good approach where the genetic string is a representation of the assignment of a member to a set.

As far as doing all this in less than a second.... Good luck !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文