我的位置尺度估计器函数不适用于多项式均值

发布于 2025-01-17 08:40:40 字数 2209 浏览 1 评论 0原文

我正在构建自己的最大似然估计器，用于估计与平均值和标准差相关的参数。在模拟数据上，当真实平均值是线性函数且标准差恒定时，我的函数起作用。但是，如果平均结构是多项式，我的函数无法恢复真实参数。有人能指出我的解决方案吗？

我知道有很多用于估计均值和标准差的现有函数。我对它们不感兴趣，我感兴趣的是为什么我的功能不起作用。

下面是一个可重现的示例，其中我的模型无法恢复真实的标准差（true sd = 1.648，mysd = 4.184123）

*编辑：添加了库（）

library(tidyverse)
my_poly_loglik <- function(pars, #parameters 
                           outcome, #outcome variable
                           poly_mean){ #data frame of polynomials
  
  #modelling the mean - adding intercept column
  mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]

  #modelling the standard deviation on exponential scale
  sd_mdl = exp(pars[length(pars)])
  
  #computing log likelihood
  sum_log_likelihood <- sum(dnorm(outcome,
                                  mean = mean_mdl,
                                  sd = sd_mdl, 
                                  log = TRUE), 
                            na.rm = TRUE)
  #since optim() is minimizing we want the -log likelihood
  return(-sum_log_likelihood)
}


#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2  #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean) 
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)


#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2) 
#Optimising my likelihood function
optim_res <- optim(pars, 
                   fn = my_poly_loglik, 
                   outcome = mydf$y,
                   poly_mean = polymean) 
if (optim_res$convergence != 0) stop("optim_res value is not 0!")


#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
                     "truemean" = mean,
                     "z" = z)
#my mean (black) and true mean (red)
plot_df %>% 
  ggplot(aes(x = z, y = mymean)) +
  geom_line() +
  geom_line(aes(y = truemean), color = "red")
#Works!

#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd

原文

I'm building my own maximum likelihood estimator that estimates the parameters associated with the mean and standard deviation. On simulated data my function works when the true mean is a linear function and the standard deviation is constant. However, if the mean structure is polynomial my function cannot recover the true parameters. Can anybody point me to a solution?

I'm aware there are plenty of existing functions for estimating means and SDs. I'm not interested in them, I'm interested in why my function is not working.

Below is a reproducible example where my model does not recover the true standard deviation (true sd = 1.648, mysd = 4.184123)

*Edit: added library()

library(tidyverse)
my_poly_loglik <- function(pars, #parameters 
                           outcome, #outcome variable
                           poly_mean){ #data frame of polynomials
  
  #modelling the mean - adding intercept column
  mean_mdl = cbind(1, poly_mean) %*% pars[1:(ncol(poly_mean) + 1)]

  #modelling the standard deviation on exponential scale
  sd_mdl = exp(pars[length(pars)])
  
  #computing log likelihood
  sum_log_likelihood <- sum(dnorm(outcome,
                                  mean = mean_mdl,
                                  sd = sd_mdl, 
                                  log = TRUE), 
                            na.rm = TRUE)
  #since optim() is minimizing we want the -log likelihood
  return(-sum_log_likelihood)
}


#Generate data
set.seed(103)
n <- 100000 #100k obs
z <- runif(n, min = 0.1, max = 40) #independent variable sampled uniformly
mean <- 10 + 0.2 * z + 0.4 * z^2  #mean structure
sd = exp(0.5) #constant SD
y <- rnorm(n,mean, sd)
#Visualizing simulated data
#plot(z,mean) 
#plot(z,sd)
#plot(z,y)
mydf = data.frame(z,y)


#Defining polynomials
polymean = cbind(z, z^2)
#Initial values. 2 extra for mean_intercept and SD
pars = rep(0, ncol(polymean) + 2) 
#Optimising my likelihood function
optim_res <- optim(pars, 
                   fn = my_poly_loglik, 
                   outcome = mydf$y,
                   poly_mean = polymean) 
if (optim_res$convergence != 0) stop("optim_res value is not 0!")


#comparing my function to the real parameter
plot_df = data.frame("mymean" = optim_res$par[1] + (polymean %*% optim_res$par[2:3]),
                     "truemean" = mean,
                     "z" = z)
#my mean (black) and true mean (red)
plot_df %>% 
  ggplot(aes(x = z, y = mymean)) +
  geom_line() +
  geom_line(aes(y = truemean), color = "red")
#Works!

#my SD and true SD - PROBLEM!
sd #true sd
exp(optim_res$par[length(optim_res$par)]) #my sd

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小伙你站住 2025-01-24 08:40:40

这不是一个完整的解决方案，但它可能会帮助其他人找到正确的答案。

该代码总体看起来不错，并且仅在 z 值范围较大时才会出现问题。事实上，缩放它们或从相当低的范围生成数据会产生正确的解决方案。此外，检查 hessian 表明估计的协方差矩阵不是半正定的，稍微减小范围会导致平均参数的相关性接近 1。（这有点令人费解，因为具有相同参数化的正常线性模型并不遇到同样的问题——我知道它不会直接优化可能性，但对我来说仍然有点不直观）。

那么，时间解决方案可能是重新调整预测变量/使用正交参数化？但这并不能真正解释问题的核心。

回复收藏 0 原文

~没有更多了~