分段最小二乘动态规划算法
几天来我一直在尝试用 Python 实现这个算法。我不断地回到这个问题上,然后放弃并感到沮丧。我不知道发生了什么事。我没有可以向任何人求助,也没有地方可以寻求帮助,所以我来到了这里。
Mahesh Viswanathan - CS 473ug:算法 - 动态规划
PDF 警告:http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec10.pdf
我认为它没有明确解释,我肯定不明白。
我对发生的事情的理解是这样的:
我们有一组点 (x1,y1), (x2,y2).. 我们想要找到一些最适合这些数据的线。我们可以有多条直线,这些直线来自 a 和 b (y = ax +b) 的给定论坛。
现在,算法从末尾 (?) 开始,并假设点 p(x_i, y_i) 是线段的一部分。然后注释说最优解是“是 {p1,... 的最优解”。 。 。 pi−1} 加(最佳)线穿过 {pi , . 。 。 pn}'。这对我来说意味着我们到达点 p(x_i, y_i) 并向后移动并通过其余点找到另一条线段。现在最优解就是这两条线段。
然后它需要一个我无法遵循的逻辑跳转,并说“假设最后一个点 pn 是从 p_i 开始的段的一部分。如果 Opt(j) 表示前 j 点的成本,并且 e(j,k)这 通过点 j 到 k 的最佳线的误差,然后 Opt(n) = e(i, n) + C + Opt(i − 1)"
然后是该算法的伪代码,我不明白。我明白我们想要遍历点列表并找到最小化 OPT(n) 公式的点,但我只是不遵循它,这让我感觉很愚蠢,
我知道这个问题很痛苦 。回答这个问题并不容易,但我只是在寻找一些理解该算法的指导,我对 PDF 表示歉意,但我没有更好的方法来向读者提供关键信息,
感谢您花时间阅读本文。 ,我很欣赏。
I've been trying to implement this algorithm in Python for a few days now. I keep coming back to it and just giving up and getting frustrated. I don't know what's going on. I don't have anyone to ask or anywhere to go for help so I've come here.
Mahesh Viswanathan - CS 473ug: Algorithms - Dynamic Programming
PDF Warning: http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec10.pdf
I don't think its a clearly explained, I sure don't understand.
My understanding of what's happening is this:
We have a set of of points (x1,y1), (x2,y2).. and we want to find some lines that best fit this data. We can have multiple straight lines, and these lines come from the given forums for a and b (y = ax +b).
Now the algorithm starts at the end (?) and assumes that a point p(x_i, y_i) is part of the line segment. Then the notes say that the optimal solution is 'is optimal solution for {p1, . . . pi−1} plus (best) line through {pi , . . . pn}'. Which just means to me, that we go to the point p(x_i, y_i) and go backwards and find another line segment through the rest of the points. Now the optimal solution is both these line segments.
Then it takes a logical jump I can't follow, and says "Suppose the last point pn is part of a segment that starts at p_i. If Opt(j) denotes the cost of the first j points and e(j,k) the
error of the best line through points j to k then Opt(n) = e(i, n) + C + Opt(i − 1)"
Then there is the pseudocode for the algorithm, which I don't understand. I understand that we want to iterate through the list of points and find the points which minimize the OPT(n) formula, but I just don't follow it. It's making me feel stupid.
I know this question is a pain in the ass and that it's not easy to answer but I'm just looking for some guidance towards understanding this algorithm. I apologize for the PDF but I don't have a neater way of getting the crucial information to the reader.
Thank you for your time and reading this, I appreciate it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
最小二乘问题要求找到一条最适合给定点的直线,并且有一个很好的封闭形式解决方案。该解用作求解分段最小二乘问题的原语。
在分段最小二乘问题中,我们可以有任意数量的线段来拟合给定的点,并且我们必须为使用的每个新线段支付成本。如果使用新线段的成本为 0,我们可以通过在每个点上传递一条单独的线来完美地拟合所有点。
现在假设我们试图找到最适合 n 个给定点的线段集。如果我们知道 n-1 个子问题的最佳解决方案:第一个点的最佳拟合,前 2 个点的最佳拟合,...,前 n-1 个点的最佳拟合,那么我们可以计算原始问题的最佳解决方案n 个点如下:
第 n 个点位于单个线段上,该线段从某个较早的点 i 开始,对于某些 i = 1, 2, ..., n。我们已经解决了所有较小的子问题,即少于 n 个点,因此我们可以找到 n 个点的最佳拟合,从而最小化:
前 i-1 点的最佳拟合成本(已计算)+
最适合点 i 到 n 的单线成本(使用最小二乘问题)+
使用新线段的成本
最小化上述数量的 i 值给我们提供了解决方案:使用子问题 i-1 和通过点 i 到 n 的线段的最佳拟合。
如果您需要更多帮助,我已经编写了该算法的解释并在此处提供了 C++ 实现:
The least squares problem asks to find a single line of best fit for the given points and has a nice closed form solution. This solution is used as a primitive to solve the segmented least squares problem.
In the segmented least squares problem, we can have any number of segments to fit the given points and we have to pay a cost for each new segment used. If the cost of using a new segment was 0, we could perfectly fit all points by passing a separate line through each point.
Now suppose we are trying to find the set of segments that is a best fit to the n given points. If we knew the best solutions for n-1 subproblems: best fit for first point, best fit for first 2 points, ..., best fit for first n-1 points, then we could compute the best solution for the original problem with n points as follows:
The nth point lies on a single segment and this segment starts at some earlier point i, for some i = 1, 2, ..., n. We have already solved all smaller subproblems i.e., having less than n points so we can find the best fit for n points that minimizes:
cost of best fit for first i-1 points (already computed) +
cost of single line that is best fit for points i to n (using least squares problem) +
cost of using a new segment
The value of i that minimizes the above quantity gives us the solution: use the best fit to subproblem i-1 and the segment through point i to n.
If you need more help, I've written an explanation of the algorithm and provided C++ implementation here: http://kartikkukreja.wordpress.com/2013/10/21/segmented-least-squares-problem/
棘手的部分,即动态编程部分,是
之前完成的 M[0] = 0 部分,n 是数据点的总数。另外,下划线意味着后面括号中的部分应该加下标。教授很可能使用 OPT 而不是 M,这是在其他一些大学关于此的讲座中完成的,您可以在网上找到(并且看起来几乎相同)。现在,如果您查看上面的代码块和讲座中的代码块,您会发现差异。我使用了
M[i-1]
,你的教授使用了M[j-1]
。这是你教授的伪代码中的一个拼写错误。您甚至可以回顾之前的幻灯片,看看他在错误函数中的正确性。基本上,对于每个 j,您要找到从点 i 绘制一条到 j 的线,这样它的误差加上添加这条额外线的成本 (C),加上使所有线段达到 i 的成本 (已经被最佳选择的)被最小化。
另外,请记住
e(i,i) = 0
以及e(i,i+1)
因为将一条线拟合到一个点不会产生错误,并且只剩下两点。The tricky part, the dynamic programming part, is the section
where
M[0] = 0
is done earlier and n is the total number of data points. Also, the underscore means that the section in parentheses afterwards should be subscripted. The professor could well have used OPT instead of M, and this is done in other some other universities' lectures about this which you can find on the web (and which look almost identical). Now, if you look at my code block above and that in the lecture, you will notice a difference. I usedM[i-1]
and your professor usedM[j-1]
. It's a typo in your professor's pseudocode. You can even look back to the slide before and see that he had it correct in the error function there.Basically, for each j, you are finding the point i to draw a line to j from such that the error of it, plus the cost of adding this extra line (C), plus the cost of making all line segments up to i (which has already been chosen optimally) is minimized.
Also, remember that
e(i,i) = 0
as well ase(i,i+1)
because fitting a line to a point gives no error as well as to just two points.从点 1 开始,直到点 j 的最佳解决方案必须在最后一条线段中包含新的端点 j,因此问题变成我必须在哪里放置最后一个分割以最小化最后一条线的成本 -部分?
幸运的是,成本是根据您试图解决的同一问题的子问题来计算的,幸运的是,当您移动到下一个点 j 时,您已经解决了这些较小的子问题。因此,在新点 j 处,您尝试在点 1 和 j 之间找到最佳点 i,以分割出包含 j 的新线段,并最小化成本:optimal_cost_up_to(i) + cost_of_split + cost_of_lsq_fit(i+1) ,j)。现在令人困惑的部分是,在任何时候,您似乎都只是找到一个分割,但实际上所有先前的分割都是由 optimization_cost_up_to(i) 确定的,并且因为您已经计算了所有这些点的最佳成本直到 j,那么您只需要记住答案,这样算法就不必在每次前进一点时重新计算这些成本。
在Python中,我可能会使用字典来存储结果,尽管对于这种动态编程算法,数组可能会更好...
无论如何...
它不漂亮,而且我还没有检查它(可能存在涉及以下错误)不拆分是最佳成本的情况),但希望它能帮助您走上正确的道路?请注意,lsqFitCost 必须在每次迭代中做很多工作,但是对于像这样的最小二乘线拟合,您可以通过维护计算中使用的运行总和来使成本大大降低......您应该 Google 最小二乘线拟合更多信息。这可以使 lsqFitCost 恒定,因此总时间将为 O(N^2)。
Starting from point 1, the best solution up until a point j, must include the new end-point j in the last line segment, so the problem becomes where do I have to place the last split to minimize the cost of this last line-segment?
Fortunately the cost is calculated in terms of subproblems of the same problem you are trying to solve, and fortunately you've already solved these smaller subproblems by the time you've moved to the next point j. So at the new point j you are trying to find an optimal point i, between points 1 and j, to split off a new line segment that includes j, and minimizes the cost: optimal_cost_up_to(i) + cost_of_split + cost_of_lsq_fit(i+1,j). Now the confusing part is that at any point it might seem like you are just finding a single split, but in reality all the previous splits are determined by optimal_cost_up_to(i), and since you've already calculated the optimal cost for all these points leading up to j, then you just need to memoize the answers so that the algorithm doesn't have to recalculate these costs each time it advances a point.
In python I'd probably use a dictionary to store the results, although for this dynamic programming algorithm an array might be better...
anyways...
it's not pretty, and I haven't checked it (there might be an error involving the case where no split is the best cost), but hopefully it'll help get you on the right path? Note that lsqFitCost has to do a lot of work on each iteration, but for a least squares line fit like this you can make this cost a lot less by maintaining running sums used in the calculation... you should Google least squares line fitting for more info. This could make lsqFitCost constant so the overall time would be O(N^2).
动态编程的基本前提是递归地优化问题(降低“成本”或在本例中为错误),同时存储子问题的成本,这样它们就不会被重新计算(这称为记忆化)。
有点晚了,所以我不会说得太详细,但似乎你遇到最多问题的部分是核心 DP 本身。由于“分段”质量,这里需要 DP。正如您的 PDF 所示,通过最小二乘法找到最佳拟合线很简单,并且不需要 DP。
算法
所以基本上,找到任意两个可能点之间的单行成本,存储在 e
对于 j 在 1 和 n 之间的情况,求 j 之前的最小成本。沿途的值将有助于以后的计算,因此请存储它们!请注意,i 也是 Opt 的参数。
M[n] 是总优化成本。
问您的问题 - 如何确定分段发生的位置?一切结束后,如何使用它和 M 来确定线段集?
希望这有帮助!
The basic premise of dynamic programming is to optimize (lower the 'cost' or in this case, error) of a problem recursively while storing the cost of subproblems as you go so they are not recomputed (this is called memoization).
It's a little late so I'm not gonna get too detailed, but it seems the part you have the most problem with is the core DP itself. DP is needed here because of the 'segmeneted' quality. As your PDF shows, finding the best-fit line by least squares is simple and does not require DP.
Now the algorithm
So basically, find the one-line cost between any two possibly points, store in e
Find the least cost up to j, for j between 1 and n. Values along the way will help with later computation, so store them! Note that i is a parameter to Opt as well.
M[n] is the total optimized cost.
Question for you - How can you determine where segmentation occurs? How can you use this, and M, to detetermine the set of line segments once its all over?
Hope this helps!
以下是分段最小二乘问题的动态规划的公式:
这里,
M[j]
表示在点上拟合的最小误差(回归)线i
到j
,我们可以通过保留后向指针数组和动态编程数组来以最小误差 (MSE) 跟踪起始点i
。另外,c
表示绘制一条线的成本(作为对适合的线数的惩罚)。最优子结构性质由贝尔曼方程确定。这是我在以下带有点
(xs, ys)
的 2D 数据集上对上述 DP 算法的python
实现(散点图如下):现在绘制通过动态规划公式获得的最小二乘线段:
正如预期的那样,DP 找到了 3 条最佳最小二乘线 (
L=3
) 拟合数据。Here is how the dynamic programming for the segmented least squares problem is formulated:
Here,
M[j]
represents the minimum error (regression) line fitted on the pointsi
toj
, we can track the starting pointi
with minimum error (MSE) by keeping a back pointer array along with the dynamic programming array. Also,c
denotes the cost to draw a line (acts as a penalty on number of lines fit). The optimal substructure property is by Bellman equation.Here is my
python
implementation for the above DP algorithm, on the following 2D dataset with points(xs, ys)
(scatter plotted below):Now plot the least square line segments obtained with the dynamic programming formulation:
As expected, the DP found the 3 optimal least-square lines (
L=3
) fitted on the data.