我的位置尺度估计器函数不适用于多项式均值

发布于 2025-01-17 08:40:40 字数 2209 浏览 1 评论 0原文

我正在构建自己的最大似然估计器,用于估计与平均值和标准差相关的参数。在模拟数据上,当真实平均值是线性函数且标准差恒定时,我的函数起作用。但是,如果平均结构是多项式,我的函数无法恢复真实参数。有人能指出我的解决方案吗?

我知道有很多用于估计均值和标准差的现有函数。我对它们不感兴趣,我感兴趣的是为什么我的功能不起作用。

下面是一个可重现的示例,其中我的模型无法恢复真实的标准差(true sd = 1.648,mysd = 4.184123)

*编辑:添加了库()

library(tidyverse)
my_poly_loglik <- function(pars, #parameters 
                           outcome, #outcome variable
                           poly_mean){ #data frame of polynomials
  
  #modelling the mean - adding intercept column
  mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]

  #modelling the standard deviation on exponential scale
  sd_mdl = exp(pars[length(pars)])
  
  #computing log likelihood
  sum_log_likelihood <- sum(dnorm(outcome,
                                  mean = mean_mdl,
                                  sd = sd_mdl, 
                                  log = TRUE), 
                            na.rm = TRUE)
  #since optim() is minimizing we want the -log likelihood
  return(-sum_log_likelihood)
}


#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2  #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean) 
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)


#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2) 
#Optimising my likelihood function
optim_res <- optim(pars, 
                   fn = my_poly_loglik, 
                   outcome = mydf$y,
                   poly_mean = polymean) 
if (optim_res$convergence != 0) stop("optim_res value is not 0!")


#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
                     "truemean" = mean,
                     "z" = z)
#my mean (black) and true mean (red)
plot_df %>% 
  ggplot(aes(x = z, y = mymean)) +
  geom_line() +
  geom_line(aes(y = truemean), color = "red")
#Works!

#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?

I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.

Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)

*Edit: added library()

library(tidyverse)
my_poly_loglik <- function(pars, #parameters 
                           outcome, #outcome variable
                           poly_mean){ #data frame of polynomials
  
  #modelling the mean - adding intercept column
  mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]

  #modelling the standard deviation on exponential scale
  sd_mdl = exp(pars[length(pars)])
  
  #computing log likelihood
  sum_log_likelihood <- sum(dnorm(outcome,
                                  mean = mean_mdl,
                                  sd = sd_mdl, 
                                  log = TRUE), 
                            na.rm = TRUE)
  #since optim() is minimizing we want the -log likelihood
  return(-sum_log_likelihood)
}


#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2  #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean) 
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)


#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2) 
#Optimising my likelihood function
optim_res <- optim(pars, 
                   fn = my_poly_loglik, 
                   outcome = mydf$y,
                   poly_mean = polymean) 
if (optim_res$convergence != 0) stop("optim_res value is not 0!")


#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
                     "truemean" = mean,
                     "z" = z)
#my mean (black) and true mean (red)
plot_df %>% 
  ggplot(aes(x = z, y = mymean)) +
  geom_line() +
  geom_line(aes(y = truemean), color = "red")
#Works!

#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小伙你站住 2025-01-24 08:40:40

这不是一个完整的解决方案,但它可能会帮助其他人找到正确的答案。

该代码总体看起来不错,并且仅在 z 值范围较大时才会出现问题。事实上,缩放它们或从相当低的范围生成数据会产生正确的解决方案。此外,检查 hessian 表明估计的协方差矩阵不是半正定的,稍微减小范围会导致平均参数的相关性接近 1。(这有点令人费解,因为具有相同参数化的正常线性模型并不遇到同样的问题——我知道它不会直接优化可能性,但对我来说仍然有点不直观)。

那么,时间解决方案可能是重新调整预测变量/使用正交参数化?但这并不能真正解释问题的核心。

this is not a complete solution but it might help others find the correct answer.

The code looks good overall and the issue emerges only with a high range of the z values. In fact, scaling them or generating data from a considerably lower range leads to the correct solution. Furthermore, checking the hessian shows that the covariance matrix of the estimates is not positive semidefinite and slightly reducing the range results in correlations of the mean parameters close to 1. (This is a bit puzzling since a normal linear model with the same parametrization does not suffer from the same issue -- I know it does not optimize the likelihood directly, but still a bit unintuitive to me).

So, a temporal solution might be rescaling the predictors / using an orthogonal parametrization? But that does not really explain core of the issue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文