简单的多维曲线拟合

发布于 2024-07-12 23:47:34 字数 281 浏览 12 评论 0原文

我有一堆数据,一般是这样的形式 a, b, c, ..., y

其中 y = f(a, b, c...)

其中大多数是三个和四个变量,并且有 10k - 10M 条记录。 我的一般假设是它们本质上是代数的,类似于:

y = P1 a^E1 + P2 b^E2 + P3 c^E3

不幸的是,我上一次统计分析课是 20 年前。 获得 f 的良好近似值的最简单方法是什么? 学习曲线极小的开源工具(即我可以在一个小时左右获得一个不错的近似值)将是理想的。 谢谢!

I have a bunch of data, generally in the form
a, b, c, ..., y

where y = f(a, b, c...)

Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:

y = P1 a^E1 + P2 b^E2 + P3 c^E3

Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

情愿 2024-07-19 23:47:34

如果它有用,这里有一个 Numpy/Scipy (Python) 模板来完成你想要的事情:

from numpy import array
from scipy.optimize import leastsq

def __residual(params, y, a, b, c):
    p0, e0, p1, e1, p2, e2 = params
    return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y

# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual,  array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)

但是,如果你真的想了解正在发生的事情,你将不得不投入时间来扩展某些工具的学习曲线或编程环境 - 我真的不认为有任何办法可以解决这个问题。 人们通常不会编写专门的工具来专门执行诸如三项幂回归之类的事情。

In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:

from numpy import array
from scipy.optimize import leastsq

def __residual(params, y, a, b, c):
    p0, e0, p1, e1, p2, e2 = params
    return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y

# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual,  array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)

If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.

风吹雪碎 2024-07-19 23:47:34

我花了一个多星期的时间试图做同样的事情。 我尝试了一大堆优化方法来微调系数,但基本上没有成功,然后我发现有一个封闭形式的解决方案,而且效果非常好。

免责声明:我试图用固定的最大数量级来拟合数据。 如果您的 E1、E2 等值没有限制,那么这对您不起作用。

现在我已经花时间学习这些东西了,我实际上发现,如果我理解某些答案,它们会给出很好的提示。 距离上一次统计和线性代数课也已经有一段时间了。

因此,如果还有其他人缺乏线性代数知识,这就是我所做的。

尽管这不是您想要拟合的线性函数,但事实证明这仍然是一个线性回归问题。 维基百科有一篇关于线性回归的非常好的文章。 我建议慢慢阅读:https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20线性%20regression%20is,as%20dependent%20and%20independent%20variables< /a>)。 它还链接了许多其他相关的优秀文章。

如果您不知道如何使用矩阵解决简单(单变量)线性回归问题,请花一些时间学习如何做到这一点。

一旦您学习了如何进行简单线性回归,就可以尝试一些多元线性回归。 基本上,要进行多变量线性回归,您需要创建一个 X 矩阵,其中每个输入数据项都有一行,每行包含该数据条目的所有变量值(加上最后一列中使用的 1)多项式末尾的常量值(称为截距))。 然后创建一个 Y 矩阵,该矩阵是单列,每个数据项占一行。 然后求解 B = (XTX)-1XTY。 然后 B 就成为多项式的所有系数。

对于多变量多项式回归,其想法相同,现在您有一个巨大的多变量线性回归,其中每个回归量(您正在执行回归的变量)都是您的巨型多项式表达式的系数。

因此,如果您的输入数据如下所示:

输入输出
a1, b1,y1
a2, b2,y2
......
aN, bN,yN

并且您想要拟合形式为 y = c1 的二阶多项式a^2b^2 + c2a^2b + c3a^2 + c4ab^2 + c5 ab + c6a + c7b^2 + c8b + c9,那么你的 X 矩阵将如下所示:

a1^2*b1^2a1^2*b1a1^2a1*b1^2a1*b1a1b1^2b11
a2^2*b2^2a2^2*b2a2^2a2*b1^2a2 *b2a2b2^2b21
...........................
aN^2*bN^2aN^2*bNaN^2aN*bN^2aN*bNaNbN^2bN1

你的 Y 矩阵很简单:

y1
y2
...
yN

然后你执行 B = (XTX)-1XTY 然后 B 将等于

c1
c2
c3
c4
c5
c6
c7
c8
c9

请注意,系数总数将为 (o + 1)V,其中 o 是多项式的阶数,V 是变量的数量,因此它增长得很快。

如果您使用良好的矩阵代码,那么我相信运行时复杂度将为 O(((o+1)V)3 + ((o + 1)V)2N),其中 V 是变量的数量,o 是多项式的阶数,N 是您拥有的数据输入的数量。 最初这听起来很糟糕,但在大多数情况下,o 和 V 可能不会很高,因此这只是相对于 N 呈线性。

请注意,如果您正在编写自己的矩阵代码,那么重要的是使确保您的反演代码使用类似于 LU 分解 的内容。 如果您使用朴素的反演方法(就像我一开始所做的那样),那么 ((o+1)V)3 就会变成 ((o+1)V )!,情况更糟。 在进行更改之前,我预测我的 5 阶 3 变量多项式大约需要 400 google millennia 才能完成。 使用LU分解后,大约需要7秒。

另一个免责声明

这种方法要求 (XTX) 不是奇异矩阵(换句话说,它可以逆)。 我的线性代数有点粗糙,所以我不知道会发生这种情况的所有情况,但我知道当输入变量之间存在完美的多重共线性时,就会发生这种情况。 这意味着一个变量只是另一个因素乘以一个常数(例如,一个输入是完成项目的小时数,另一个输入是完成项目的美元,但美元仅基于每小时费率乘以小时数)。

好消息是,当存在完美的多重共线性时,您就会知道。 当您反转矩阵时,您最终会被零除或其他结果。

更大的问题是当存在不完美的多重共线性时。 当你有两个密切相关但不完全相关的变量时(例如温度和高度,或速度和马赫数)。 在这些情况下,这种方法在理论上仍然有效,但它变得非常敏感,以至于小的浮点错误可能会导致结果相差很大。

然而,根据我的观察,结果要么非常好,要么非常糟糕,因此您可以为均方误差设置一些阈值,如果超过该阈值,则说“无法计算多项式”。

I spent over a week trying to do essentially the same thing. I tried a whole bunch of optimization stuff to fine tune the coefficients with basically no success, then I found out that there is a closed form solution and it works really well.

Disclaimer: I was trying to fit data with a fixed maximum order of magnitude. If there is no limit to your E1, E2, etc values, then this won't work for you.

Now that I've taken the time to learn this stuff, I actually see that some of the answers would have given good hints if I understood them. It had also been a while since my last statistics and linear algebra class.

So if there are other people out there who are lacking the linear algebra knowledge, here's what I did.

Even though this is not a linear function you are trying to fit, it turns out that this is still a linear regression problem. Wikipedia has a really good article on linear regression. I recommend reading it slowly: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables). It also links a lot of other good related articles.

If you don't know how to do a simple (single variable) linear regression problem using matrices, take some time to learn how to do that.

Once you learn how to do simple linear regression, then try some multivariable linear regression. Basically, to do multi variable linear regression, you create an X matrix where there is a row for each of your input data items and each row contains all of the variable values for that data entry (plus a 1 in the last column which is used for the constant value at the end of your polynomial (called an intercept)). Then you create a Y matrix that is a single column with a row for each data item. Then you solve B = (XTX)-1XTY. B then becomes all of the coefficients for your polynomial.

For multi-variable polynomial regression, its the same idea, just now you have a huge multi-variable linear regression where each regressor (variable you're doing regression on) is a coefficient for your giant polynomial expression.

So if your input data looks like this:

InputsOutput
a1, b1,y1
a2, b2,y2
......
aN, bN,yN

And you want to fit a 2nd order polynomial of the form y = c1a^2b^2 + c2a^2b + c3a^2 + c4ab^2 + c5ab + c6a + c7b^2 + c8b + c9, then your X matrix will look like:

a1^2*b1^2a1^2*b1a1^2a1*b1^2a1*b1a1b1^2b11
a2^2*b2^2a2^2*b2a2^2a2*b1^2a2*b2a2b2^2b21
...........................
aN^2*bN^2aN^2*bNaN^2aN*bN^2aN*bNaNbN^2bN1

Your Y matrix is simply:

y1
y2
...
yN

Then you do B = (XTX)-1XTY and then B will equal

c1
c2
c3
c4
c5
c6
c7
c8
c9

Note that the total number of coefficients will be (o + 1)V where o is the order of the polynomial and V is the number of variables, so it grows pretty quickly.

If you are using good matrix code, then I believe the runtime complexity will be O(((o+1)V)3 + ((o + 1)V)2N), where V is the number of variables, o is the order of the polynomial, and N is the number of data inputs you have. Initially this sounds pretty terrible, but in most cases, o and V are probably not going to be high, so then this just becomes linear with respect to N.

Note that if you are writing your own matrix code, then it is important to make sure that your inversion code uses something like an LU decomposition. If you use a naïve inversion approach (like I did at first) then that ((o+1)V)3 becomes ((o+1)V)!, which was way worse. Before I made that change, I predict that my 5th order 3 variable polynomial would take roughly 400 google millennia to complete. After using LU decomposition, it takes about 7 seconds.

Another disclaimer

This approach requires that (XTX) not be a singular matrix (in other words, it can be inverted). My linear algebra is a little rough so I don't know all of the cases where that would occur, but I know that it occurs when there is perfect multi-collinearity between input variables. That means one variable is just another factor multiplied by a constant (for example, one input is number of hours to complete a project and another is dollars to complete a project, but the dollars are just based on an hourly rate times the number of hours).

The good news is that when there is perfect multi-collinearity, you'll know. You'll end up with a divide by zero or something when you are inverting the matrix.

The bigger problem is when you have imperfect multi-collinearity. That's when you have two closely related but not perfectly related variables (such as temperature and altitude, or speed and mach number). In those cases, this approach still works in theory, but it becomes so sensitive that small floating point errors can cause the result to be WAY off.

In my observations, however, the result is either really good or really bad, so you could just set some threshold on your mean squared error and if its over that, then say "couldn't compute a polynomial".

不美如何 2024-07-19 23:47:34

数据拟合的基础知识包括假设解的一般形式,猜测常量的一些初始值,然后迭代以最小化猜测解的误差,以找到特定的解(通常是在最小二乘意义上)。

查看 ROctave 用于开源工具。 它们都能够进行最小二乘分析,只需通过 Google 搜索即可找到几个教程。

编辑:用于估计二阶多项式系数的八度代码

x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;

% Add noise to y data
y = y + randn(size(y))*0.1;

% Estimate coefficients of polynomial
p = polyfit(x,y,2)

在我的机器上,我得到:

ans =

   5.0886   3.9050   2.9577

The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.

Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.

Edit: Octave code for estimating the coefficients for a 2nd order polynomial

x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;

% Add noise to y data
y = y + randn(size(y))*0.1;

% Estimate coefficients of polynomial
p = polyfit(x,y,2)

On my machine, I get:

ans =

   5.0886   3.9050   2.9577
年少掌心 2024-07-19 23:47:34

您知道您想将多项式限制为多少次方吗?

如果没有限制,那么您始终可以通过将 N 个点与具有 N 个系数的多项式匹配来获得 N 个点的精确匹配。 为此,您将 N 个不同的点插入方程,产生 N 个方程和 N 个未知数(系数),然后您可以使用简单的高中代数或矩阵来求解未知数。

Do you know to what power you want to limit your polynomial?

If there is no limit, then you can always get an exact match for N points by matching it to a polynomial that has N coefficients. To do this, you plug N different points into your equation, yielding N equations and N unknowns (the coefficients), which you can then use either simple high school algebra or a matrix to solve for the unknowns.

薄暮涼年 2024-07-19 23:47:34

如果您猜测 f,[*] 的形式,您需要一个最小化器来找到最佳参数。 Scottie T 建议的工具可以使用,ROOT,等等。

如果你不知道 f 可能采取什么形式,那么你确实陷入了大麻烦。


[*] 也就是说,您知道

f = f(x,y,z,w,...;p1,p2,p3...)

其中 p 是参数和坐标是xy...

If you have a guess at the form of f,[*] you need a minimizer to find the optimal parameters. The tools Scottie T suggests would work, as would ROOT, and many others.

If you don't have a clue what form f might take you are in deep trouble indeed.


[*] That is, you know that

f = f(x,y,z,w,...;p1,p2,p3...)

where the ps are parameters and the coordinates are x, y...

薆情海 2024-07-19 23:47:34

简短的回答:事情没那么简单。 考虑对数据子集采用非参数方法。

您需要决定 2 个主要问题 (1) 您是否真的关心函数的参数,即您的 P1、E1...,或者您是否只需要估计均值函数即可 (2) 您愿意吗?真的需要估计所有数据的函数吗?

我要提到的第一件事是您指定的函数是非线性的(在要估计的参数中),因此普通最小二乘法不起作用。 假设您指定了一个线性函数。 10M 值仍然存在问题。 可以使用 QR 分解以有效的方式执行线性回归,但您仍然需要 O(p * n^2) 算法,其中 p 是您尝试估计的参数数量。 如果你想估计非线性均值函数,情况会变得更糟。

能够估计如此大的数据集中的任何内容的唯一方法是使用子集来执行估计。 基本上,您随机选择一个子集并使用它来估计函数。

如果您不关心参数值,而只想估计均值函数,那么使用非参数估计技术可能会更好。

希望这有帮助。

莱夫

Short Answer: it isn't so simple. Consider a non-parametric approach on data sub-sets.

There are 2 main issues you need to decide about (1) Do you actually care about the parameters of the function, i.e. your P1, E1, ..., or would you be okay with just estimating the mean function (2) do you really need to estimate the function on all of the data?

The first thing I'll mention is that your specified function is non-linear (in the parameters to be estimated), so ordinary least squares won't work. Let's pretend that you specified a linear function. You'd still have a problem with the 10M values. Linear regression can be performed in an efficient way using QR factorization, but you are still left with an O(p * n^2) algorithm, where p is the number of parameters you are trying to estimate. If you want to estimate the non-linear mean function it gets much worse.

The only way you are going to be able to estimate anything in such a large data set is by using a subset to perform the estimation. Basically, you randomly select a subset and use that to estimate the function.

If you don't care about your parameter values, and just want to estimate the mean function you will probably be better off using a non-parametric estimation technique.

Hopefully this helps.

leif

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文