估计尾部密度单调递减的 PDF

发布于 2025-01-11 14:21:41 字数 1876 浏览 6 评论 0原文

tldr:我正在根据模拟数据对 PDF 进行数值估计,并且我需要密度在“主”密度区域之外单调递减(如 x-> 无穷大)。我所得到的密度接近于零,但不会单调减少。


详细问题

我正在估计一个模拟最大似然模型,这要求我在某个(观察到的)值处对某个随机变量(其概率无法通过分析得出)的概率分布函数进行数值评估x。目标是最大化这些密度的对数似然,这要求它们不具有虚假的局部最大值。

由于我没有解析似然函数,因此我通过从一些已知的分布函数中提取随机分量来对随机变量进行数值模拟,并对其应用一些非线性变换。我将此模拟的结果保存在名为simulated_stats 的数据集中。

然后,我使用 Density() 来近似 PDF,并使用 approxfun() 来评估 x 处的 PDF:

#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)

这在模拟的 Simulated_stats 范围内效果很好,请参见图片: 示例 PDF。问题是我需要能够评估远离模拟数据范围的 PDF。

因此,在上图中,我需要在 x=50 处评估 PDF:

approxfun(density(simulated_stats))(50)
> [1] NA

因此,我在密度函数中使用 from 和 to 参数,它们正确地近似于接近 0 尾部,例如

approxfun(
 density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18

这很好,在一种情况下 - 我需要密度在距离 x 范围越远时为零。也就是说,如果我在 x=51 处进行评估,结果一定会更小。 (否则,我的估计器可能会发现远离“真实”区域的局部最大值,因为似然函数在距“主”密度质量(即外推区域)很远的地方不是单调的)。

为了测试这一点,我以固定的时间间隔评估了近似的 PDF、记录并绘制了图。结果令人沮丧:远离主密度质量,概率会上下“跳跃”。总是非常接近于零,但不是单调递减。

    a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
      density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
      )(x)})
    aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
    plot(aa[,1],log(aa[,2]))

结果: 非单调对数密度远离密度质量

我的问题

这是吗发生的原因是密度()中的内核估计还是approxfun()中的不准确? (或其他什么?)

我可以使用哪些替代方法来提供远离模拟密度质量的单调下降 PDF?

或者 - 我如何手动更改近似 PDF 以单调下降,距离密度质量越远?我很乐意坚持一些趋于零的线性趋势...

谢谢!

tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.


Detailed Problem

I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.

Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.

I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:

#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)

This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.

So in the image above, I would need to evaluate the PDF at, say, x=50:

approxfun(density(simulated_stats))(50)
> [1] NA

So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such

approxfun(
 density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18

Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).

To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.

    a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
      density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
      )(x)})
    aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
    plot(aa[,1],log(aa[,2]))

Result:
Non-monotonic log density far from density mass

My question

Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)

What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?

Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浮生未歇 2025-01-18 14:21:41

一种可能性是使用 beta 回归模型来估计 CDF;然后可以使用该模型导数的数值估计来估计任意点的 pdf。这是我的想法的一个例子。我不确定它是否对你有帮助。

  1. 导入库
library(mgcv)
library(data.table)
library(ggplot2)
  1. 生成数据
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
  1. 使用 gam beta 回归模型估计 CDF 的函数
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
  qp = quantile(ss, probs=p)
  betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
  return(betamod)
}

betamod <- get_mod(Simulated_stats_)
  1. val 给定估计 CDF 的模型上对 PDF 进行非常基本的估计
est_pdf <- function(val, betamod, tol=0.001) {
  xvals  = c(val,val+tol)
  yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
  as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
  1. 让我们检查单调递增是否低于 Simulated_stats 的最小值
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))

[1] TRUE
  1. 让我们检查是否单调递减超过 Simulated_stats 的最大值
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))

[1] TRUE

其他想法 3/5/22

正如评论中所讨论的,使用 betamod 进行预测可能会减慢估计器的速度。虽然可以通过直接编写自己的预测函数在很大程度上解决这个问题,但还有另一种可能的捷径。

  1. 从 betamod 在 X 范围内(包括极值)生成估计值
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
  1. 使用您最初使用的上述方法,即跨密度的线性插值,但不是对密度结果执行此操作,而是对 k< 执行此操作/code> (即根据上述 beta 模型的估计)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
  1. 您可以在估计器中使用 lin_int() 函数进行预测,而且速度会很快。请注意,它对于给定的 x 产生几乎相同的值
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968

,并且速度非常快。

microbenchmark::microbenchmark(
  list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)

Unit: microseconds
    expr    min      lq     mean  median      uq    max neval
 betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8   100
 lin_int    1.7    2.25    3.503    4.35    4.50   10.5   100

最后,让我们检查一下您之前所做的相同绘图,但使用 lin_int() 而不是 <代码>approxfun(密度(....))

a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

lin_int 在极端情况下的性能

One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.

  1. Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
  1. Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
  1. Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
  qp = quantile(ss, probs=p)
  betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
  return(betamod)
}

betamod <- get_mod(Simulated_stats_)
  1. Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
  xvals  = c(val,val+tol)
  yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
  as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
  1. Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))

[1] TRUE
  1. Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))

[1] TRUE

Additional thoughts 3/5/22

As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.

  1. Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
  1. Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
  1. You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968

and it is very fast

microbenchmark::microbenchmark(
  list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)

Unit: microseconds
    expr    min      lq     mean  median      uq    max neval
 betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8   100
 lin_int    1.7    2.25    3.503    4.35    4.50   10.5   100

Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))

a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

performance of lin_int at extremes

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文