寻找曲线上的最佳权衡点
假设我有一些数据,我想为其拟合参数化模型。我的目标是找到该模型参数的最佳值。
我正在使用 AIC/BIC/MDL奖励低误差模型的标准类型,但也会惩罚高复杂性的模型(可以说,我们正在为这些数据寻找最简单但最令人信服的解释,例如 奥卡姆剃刀)。
按照上述内容,这是我根据三种不同标准得到的结果的示例(其中两个要最小化,一个要最大化):
在视觉上,您可以轻松地看到弯头形状,并且您可以在该区域的某个位置选择参数值。 问题是我正在做大量的实验,我需要一种方法来找到这个值而不需要干预。
我的第一直觉是尝试从拐角处画一条 45 度角的线,并不断移动它直到与曲线相交,但这说起来容易做起来难:)如果曲线有些倾斜,它也可能会错过感兴趣的区域。
关于如何实施这个或更好的想法有什么想法吗?
以下是重现上述图之一所需的示例:
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
plot(1:100, curve)
编辑
我接受了 乔纳斯。基本上,对于曲线上的每个点 p
,我们找到最大距离 d
的点,如下所示:
Say I had some data, for which I want to fit a parametrized model over it. My goal is to find the best value for this model parameter.
I'm doing model selection using a AIC/BIC/MDL type of criterion which rewards models with low error but also penalizes models with high complexity (we're seeking the simplest yet most convincing explanation for this data so to speak, a la Occam's razor).
Following the above, this is an example of the sort of things I get for three different criteria (two are to be minimized, and one to be maximized):
Visually you can easily see the elbow shape and you would pick a value for the parameter somewhere in that region.
The problem is that I'm doing do this for large number of experiments and I need a way to find this value without intervention.
My first intuition was to try to draw a line at 45 degrees angle from the corner and keep moving it until it intersect the curve, but that's easier said than done :) Also it can miss the region of interest if the curve is somewhat skewed.
Any thoughts on how to implement this, or better ideas?
Here's the samples needed to reproduce one of the plots above:
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
plot(1:100, curve)
EDIT
I accepted the solution given by Jonas. Basically, for each point p
on the curve, we find the one with the maximum distance d
given by:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
找到弯头的快速方法是从曲线的第一个点到最后一个点画一条线,然后找到距离该线最远的数据点。
当然,这在某种程度上取决于直线平坦部分的点数,但如果每次测试相同数量的参数,结果应该相当不错。
A quick way of finding the elbow is to draw a line from the first to the last point of the curve and then find the data point that is farthest away from that line.
This is of course somewhat dependent on the number of points you have in the flat part of the line, but if you test the same number of parameters each time, it should come out reasonably ok.
如果有人需要 Jonas< 发布的 Matlab 代码的工作 Python 版本/a> 上面。
In case someone needs a working Python version of the Matlab code posted by Jonas above.
以下是 Jonas 给出的用 R 实现的解决方案:
Here is the solution given by Jonas implemented in R:
首先,快速回顾一下微积分:每个图的一阶导数
f'
表示所绘制的函数f
的变化率。二阶导数f''
表示f'
变化的速率。如果f''
很小,则意味着图形正在以适度的速度改变方向。但如果f''
很大,则意味着图形正在快速改变方向。您想要隔离图域中
f''
最大的点。这些将是为您的最佳模型选择的候选点。您选择哪一点必须取决于您,因为您还没有具体说明您对适应性与复杂性的重视程度。First, a quick calculus review: the first derivative
f'
of each graph represents the rate at which the functionf
being graphed is changing. The second derivativef''
represents the rate at whichf'
is changing. Iff''
is small, it means that the graph is changing direction at a modest pace. But iff''
is large, it means the graph is rapidly changing direction.You want to isolate the points at which
f''
is largest over the domain of the graph. These will be candidate points to select for your optimal model. Which point you pick will have to be up to you, since you haven't specified exactly how much you value fitness versus complexity.信息论模型选择的要点在于它已经考虑了参数的数量。因此,不需要找到弯头,只需要找到最小值即可。
仅当使用拟合时,查找曲线的肘部才有意义。即使这样,您选择选择弯头的方法在某种意义上也是对参数数量设置惩罚。要选择弯头,您需要最小化从原点到曲线的距离。距离计算中两个维度的相对权重将产生固有的惩罚项。信息论标准根据参数的数量和用于估计模型的数据样本的数量来设置此度量。
底线建议:使用 BIC 并取最小值。
The point of information theoretic model selection is that it already accounts for the number of parameters. Therefore, there is no need to find an elbow, you need only find the minimum.
Finding the elbow of the curve is only relevant when using fit. Even then the method that you choose to select the elbow is in a sense setting a penalty for the number of parameters. To select the elbow you will want to minimize the distance from the origin to the curve. The relative weighting of the two dimensions in the distance calculation will create an inherent penalty term. Information theoretic criterion set this metric based on the number of parameters and the number of data samples used to estimate the model.
Bottom line recommendation: Use BIC and take the minimum.
因此,解决此问题的一种方法是在肘部的 L 上两条线。但是,由于曲线的一部分只有几个点(正如我在评论中提到的),除非您检测到哪些点间隔开并在它们之间进行插值以产生更均匀的系列和,否则直线拟合会受到影响然后使用RANSAC找到两条线来拟合L - 有点复杂,但并非不可能。
因此,这里有一个更简单的解决方案 - 由于 MATLAB 的缩放功能(显然),您所绘制的图表看起来就像它们的样子。所以我所做的就是使用比例信息最小化从“原点”到您的点的距离。
请注意:原点估计可以显着改进,但我会将其留给您。
代码如下:
结果:
ALSO for the
Fit(maximize) 曲线,您必须将原点更改为
[x_axis(1) ticks(end)]
。So one way of solving this would be two fit two lines to the L of your elbow. But since there are only a few points in one portion of the curve (as I mentioned in the comment), line fitting takes a hit unless you detect which points are spaced out and interpolate between them to manufacture a more uniform series and then use RANSAC to find two lines to fit the L - a little convoluted but not impossible.
So here's a simpler solution - the graphs you've put up look the way they do thanks to MATLAB's scaling (obviously). So all I did was minimize the distance from the "origin" to your points using the scale information.
Please note: The origin estimation can be improved dramatically, but I'll leave that to you.
Here's the code:
Results:
ALSO for the
Fit(maximize)
curve you'll have to change to origin to[x_axis(1) ticks(end)]
.双导出法。然而,它似乎不适用于噪声数据。对于输出,您只需找到 d2 的最大值即可识别弯头。该实现是在 R 中实现的。
The double derived method. It does, however, not seem to work well for noisy data. For the output you simply find the maximum value of d2 to identify the elbow. This implementation is in R.
简单直观地说,
如果我们从曲线上的任意点到曲线的两个端点画两条线,则这两条线所成角度最小的点就是所需的点。
在这里,两条线可以想象为手臂,该点可以想象为肘点!
In a simple and intuitive way we can say that
If we draw two lines from any point on the curve to both of the end points of the curve, the point at which these two lines make the smallest angle in degrees is the desired point.
Here, the two lines can be visualized as the arms and the point as the elbow point!
我从事膝盖/肘点检测已有一段时间了。无论如何,我不是专家。
一些可能与此问题相关的方法。
DFDT 代表动态一阶导数阈值。它计算一阶导数并使用阈值算法来检测膝/肘点。 DSDT 类似,但使用二阶导数,我的评估表明它们具有相似的性能。
S-方法是L-方法的扩展。 L 法将两条直线拟合到曲线上,两条直线之间的交点就是膝点/肘点。通过循环整体点、拟合线条并评估 MSE(均方误差)来找到最佳拟合。 S方法拟合3条直线,这提高了精度,但也需要更多的计算。
我的所有代码都可在 GitHub 上公开获取。此外,这篇文章可以帮助您找到有关该主题的更多信息。它只有四页长,所以应该很容易阅读。您可以使用该代码,如果您想讨论任何方法,请随意这样做。
I have been working on Knee/Elbow point detection for some time. By no means, I am an expert.
Some methods that may relevant to this problem.
DFDT stands for Dynamic First Derivate Threshold. It computes the first derivative and uses a Thresholding algorithm to detect the knee/elbow point. DSDT is similar but uses the second derivative, my evaluation shows that they have similar performances.
S-method is an extension of the L-method. The L-method fits two straight lines to your curve, the interception between the two lines is the knee/elbow point. The best fit is found by looping overall points, fitting the lines and evaluate the MSE (Mean Square Error). The S-method fits 3 straight lines, this improves the accuracy but also requires some more computation.
All of my code is publicly available on GitHub. Furthermore, this article can help you found more information about the topic. It is only four pages long so it should be easy to read. You can use the code, and if you want to discuss any of the methods feel free to do so.
如果您愿意,我已将其翻译为 R 作为我自己的练习(请原谅我的非优化编码风格)。
*应用它来找到 k 均值上的最佳簇数 - 工作得很好。
函数的输入 x 应该是包含您的值的列表/向量
If you want, I have translated it to R as an exercise for myself (pardon my non-optimized coding style).
*Applied it to find the optimum clusters number on k-means - worked pretty fine.
the input x of the function should be a list/vector with your values
不要忽视模型选择的 k 折交叉验证,它是 AIC/BIC 的绝佳替代方案。还要考虑您正在建模的基本情况,并且您可以使用领域知识来帮助选择模型。
Don't neglect k-fold cross-validation for model selection, an excellent alternative to AIC/BIC. Also think about the underlying situation you are modeling and you are allowed to use domain knowledge to help select a model.