使用 R 对分箱 GAM 结果进行均方根偏差

发布于 2024-09-06 05:10:07 字数 1867 浏览 10 评论 0 原文

背景

PostgreSQL 数据库使用 PL/R 调用 R 函数。用于计算 Spearman 相关性的 R 调用如下所示:

cor( rank(x), rank(y) )

同样在 R 中,拟合广义加性模型 (GAM) 的简单计算:

data.frame( x, fitted( gam( y ~ s(x) ) ) )

这里 x 表示从 1900 年到 2009 年的年份,y< /code> 是当年的平均测量值(例如最低温度)。

问题

(使用 GAM)相当准确,如下图所示:

拟合的趋势线 问题是相关性(如左下角所示)不能准确反映模型与数据的拟合程度。

可能的解决方案

提高相关性准确性的一种方法是对分箱数据使用均方根误差 (RMSE) 计算。

问题

Q.1. 您将如何对分箱数据实施 RMSE 计算,以获得 R 中 GAM 与测量值的拟合相关性(0 到 1 之间)语言?

Q.2是否有更好的方法来确定 GAM 对数据的拟合精度?如果有,它是什么(例如均方根偏差)?

尝试的解决方案 1

  1. 使用观测量和模型 (GAM) 量调用 PL/R 函数:
    correlation_rmse := Climate.plr_corr_rmse( v_amount, v_model ); 
  2. 定义 plr_corr_rmse 如下(其中 om 表示观察到的和建模的数据):
    CREATE OR替换函数climate.plr_corr_rmse(
    o 双精度[], m 双精度[])
    返回双精度 AS
    $身体$
    sqrt( 平均值 ( o - m ) ^ 2 )
    $身体$
    语言 'plr' 易失性严格
    成本 100;
    

o - m 是错误的。我想通过计算每 5 个数据点(最多 110 个数据点)的平均值来对两个数据集进行分箱。例如:

omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )

然后将 RMSE 计算更正为:

sqrt( mean( omean - mmean ) ^ 2 )

如何计算任意值的 c(mean(o[1:5]),mean(o[6:10]), ... )适当数量的箱中的长度向量(例如,对于仅 67 个测量值,5 可能并不理想)?

我认为 hist 不适合这里,是吗?

尝试的解决方案 2

以下代码将解决该问题,但它会从列表末尾删除数据点(以使列表可被 5 整除)。这个解决方案并不理想,因为数字“5”相当神奇。

while( length(o) %% 5 != 0 ) {
  o <- o[-length(o)]
}

omean <- apply( matrix(o, 5), 2, mean )

还有哪些其他选择?

提前致谢。

Background

A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:

cor( rank(x), rank(y) )

Also in R, a naïve calculation of a fitted generalized additive model (GAM):

data.frame( x, fitted( gam( y ~ s(x) ) ) )

Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.

Problem

The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:

The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.

Possible Solution

One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.

Questions

Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?

Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?

Attempted Solution 1

  1. Call the PL/R function using the observed amounts and the model (GAM) amounts:
    correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
  2. Define plr_corr_rmse as follows (where o and m represent the observed and modelled data):
    CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
    o double precision[], m double precision[])
    RETURNS double precision AS
    $BODY$
    sqrt( mean( o - m ) ^ 2 )
    $BODY$
    LANGUAGE 'plr' VOLATILE STRICT
    COST 100;
    

The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:

omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )

Then correct the RMSE calculation as:

sqrt( mean( omean - mmean ) ^ 2 )

How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?

I don't think hist is suitable here, is it?

Attempted Solution 2

The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.

while( length(o) %% 5 != 0 ) {
  o <- o[-length(o)]
}

omean <- apply( matrix(o, 5), 2, mean )

What other options are available?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

﹏雨一样淡蓝的深情 2024-09-13 05:10:07

你说:

问题在于相关性(如左下角所示)无法准确反映模型与数据的拟合程度。

您可以计算拟合值和测量值之间的相关性:

cor(y,fitted(gam(y ~ s(x))))

我不明白为什么要对数据进行装箱,但您可以按如下方式进行操作:

mean.binned <- function(y,n = 5){
  apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
        2,
        function(x)mean(x,na.rm = TRUE))
}

它看起来有点难看,但它应该处理长度不等于的向量分箱长度的倍数(即示例中的 5)。

你还说:

提高准确率的一种方法
相关性是使用根均值
平方误差 (RMSE) 计算
分箱数据。

我不明白你的意思。相关性是确定均方误差的一个因素 - 例如,请参见 Murphy(1988 年,每月天气评论,第 116 卷,第 2417-2424 页)。但请解释一下你的意思。

You say that:

The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.

You could calculate the correlation between the fitted values and the measured values:

cor(y,fitted(gam(y ~ s(x))))

I don't see why you want to bin your data, but you could do it as follows:

mean.binned <- function(y,n = 5){
  apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
        2,
        function(x)mean(x,na.rm = TRUE))
}

It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).

You also say that:

One way to improve the accuracy of the
correlation is to use a root mean
square error (RMSE) calculation on
binned data.

I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文