斯坦 - 为什么斯坦中伯努利响应变量的插补会导致有偏差的推论？

发布于 2025-01-14 21:50:40 字数 5305 浏览 4 评论 0原文

我正在尝试使用 Stan 中的分层模型来估算二元响应变量，但得到的估计有偏差。

该数据由多个会话组成，每个会话都有不同的患病率（在本例中为病毒排出的比例），该患病率正在模型中进行估计。然后，使用这些估计值来估算在给定会话中测量但无法验证排泄状态的个体的排泄状态。 [例如，个体 1 在第 1 期中进行采样，第 1 期的患病率为 0.65，因此个体 1 获得阳性脱落状态的概率为 0.65。]

如果我理解正确，则需要通过边缘化来完成此插补插补值，推荐的方法是通过两种可能性的混合（例如此处和此处）。

然而，在实施这一点时，我得到了有偏差的流行率结果。我不知道这是否是因为我的编码（我对斯坦还很陌生）还是由于该方法（插补是否以某种方式影响总体可能性并使流行率的估计产生偏差？），所以我正在寻求帮助这。

（注意：模型设置现在非常简单，但目标是建立这个分层模型，以允许不同协变量对个体脱落状态进行回归，这是测试建模框架的一步）。

这是模拟数据和我的代码（我使用 R 和 rstan）来说明我的问题：

模拟数据：

# number of sessions
n.sess = 10

# number of individuals per session   
n.ind = 20

# create dataframe to store simulated data
shed.sim = data.frame(session = rep(1:n.sess,each = n.ind),
                      ind = rep(1:n.ind,n.sess),
                      prev = NA,
                      shed = NA)

# generate UR prevalences, any value between 0 and 1    
set.seed(1982)  # set random seed so results can be duplicated   
q_t = runif(n = n.sess, min = 0, max = 1)
shed.sim$prev = rep(q_t, each = n.sess)

# for each session, generate a shedding status for each individual    

for(i in 1:nrow(shed.sim)){
       shed.sim$shed[i] = rbinom(n = 1, size = 1, prob = shed.sim$prev[i])
}

==>会话流行率为 0.75、0.61、0.50、0.65、0.22、0.46、0.97、0.88、0.35、0.83。

模型代码：

mod1 = 
       "
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
              
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
              }
              

              for(n in 1:N) {
                     target += log_mix(p_sess[S[n]], bernoulli_lpmf(1 | p_sess[S[n]]), bernoulli_lpmf(0 | p_sess[S[n]]));
              }
       }
       
       generated quantities {
              int y_miss[N];    // missing observations to be imputed

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) generates a bernoulli variate with chance of success alpha
              }
       }
       "

model.fit = stan(model_code = mod1,
                 data = list(N = nrow(shed.sim),
                             N_sess = n.sess,
                             S = shed.sim$session,
                             N_pos_sess = round(q_t * 30),   # using a sample size of 30 for each session
                             N_total_sess = rep(30,n.sess)),
                             chains = 2,
                             warmup = 5000,
                             iter = 10000,
                             cores = 1)

这会导致患病率估计存在偏差：

模拟（“观察”）：0.75、0.61、0.50、0.65、0.22、0.46、0.97、0.88、0.35、0.83。

估计值：0.88、0.76、0.50、0.79、0.12、0.37、0.97、0.93、0.21、0.91。

考虑到这些有偏差的估计，每个会话的个体脱落状态都被正确估算，因此模型的这一部分至少工作正常。

这种偏见是否与此处提出的问题有关？

任何帮助将不胜感激！

感谢您的阅读并祝您度过愉快的一天。

Benny

PS：类似的方法但没有混合方法效果很好，但无法添加回归模型，因此对于本项目的目的没有用处。为了完整起见，我将其包括在内：

"
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
       
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
                     
              }
       }
       
       generated quantities {
              int y_miss[N];

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) 'generates a bernoulli variate with chance of success alpha' (stan manual) 
              }
       }
       "

原文

I am trying to impute a binary response variable using a hierarchical model in Stan, but am getting biased estimates.

The data consist of a number of sessions that each have a different prevalence (proportion viral shedding in this case) that is being estimated in the model.
These estimates are then used to impute shedding status of individuals that were measured in a given session but for which shedding status could not be verified.
[E.g. individual 1 was sampled in session 1, and session 1 has a prevalence of 0.65, so the probability for individual 1 to receive a positive shedding status is 0.65.]

If I understand correctly, this imputation needs to be done by marginalizing over the imputation values, and the recommended approach for this is through a mixture of the two possibilities (e.g. here and here).

When implementing this however, I get biased prevalence results. I don't know whether this is because of my coding (I'm fairly new to stan) or due to the approach (is the imputation somehow affecting the overall likelihood and biasing the estimation of prevalence?), so am looking for help on this.

(Note: the model setup is now extremely simple, but the goal is to set up this hierarchical model to allow regression of different covariates on individual shedding status, and this is one step towards testing the modeling framework).

Here is simulated data, and my code (I'm using R and rstan), to illustrate my problem:

Simulate data:

# number of sessions
n.sess = 10

# number of individuals per session   
n.ind = 20

# create dataframe to store simulated data
shed.sim = data.frame(session = rep(1:n.sess,each = n.ind),
                      ind = rep(1:n.ind,n.sess),
                      prev = NA,
                      shed = NA)

# generate UR prevalences, any value between 0 and 1    
set.seed(1982)  # set random seed so results can be duplicated   
q_t = runif(n = n.sess, min = 0, max = 1)
shed.sim$prev = rep(q_t, each = n.sess)

# for each session, generate a shedding status for each individual    

for(i in 1:nrow(shed.sim)){
       shed.sim$shed[i] = rbinom(n = 1, size = 1, prob = shed.sim$prev[i])
}

==> Session prevalences are 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.

Model code:

mod1 = 
       "
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
              
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
              }
              

              for(n in 1:N) {
                     target += log_mix(p_sess[S[n]], bernoulli_lpmf(1 | p_sess[S[n]]), bernoulli_lpmf(0 | p_sess[S[n]]));
              }
       }
       
       generated quantities {
              int y_miss[N];    // missing observations to be imputed

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) generates a bernoulli variate with chance of success alpha
              }
       }
       "

model.fit = stan(model_code = mod1,
                 data = list(N = nrow(shed.sim),
                             N_sess = n.sess,
                             S = shed.sim$session,
                             N_pos_sess = round(q_t * 30),   # using a sample size of 30 for each session
                             N_total_sess = rep(30,n.sess)),
                             chains = 2,
                             warmup = 5000,
                             iter = 10000,
                             cores = 1)

This results in biased estimates of prevalence:

Simulated ("observed"): 0.75, 0.61, 0.50, 0.65, 0.22, 0.46, 0.97, 0.88, 0.35, 0.83.

Estimated: 0.88, 0.76, 0.50, 0.79, 0.12, 0.37, 0.97, 0.93, 0.21, 0.91.

Given these biased estimates, individual shedding status for each session is imputed correctly, so that part of the model at least works fine.

Could this bias be related to the issue raised here?

Any help would be greatly appreciated!

Thank you for reading and have a wonderful day.

Benny

PS: A similar approach but without the mixture method works well, but doesn't enable adding a regression model so is not useful for the purposes of this project. I include it for completeness:

"
       data {
              int<lower=1> N;             // number of individuals
              int<lower=1> N_sess;        // number of sessions
              int<lower=1, upper=N_sess> S[N];      // session number for each individual
              int<lower=0> N_pos_sess[N_sess];   // number positive in each session
              int<lower=1> N_total_sess[N_sess]; // number of UR sheets in each session
       }
       
       parameters {
              real<lower=0,upper=1> p_sess[N_sess];     // prevalence in each session
       }
       
       model {
       
              for(n_sess in 1:N_sess) {
                     p_sess[n_sess] ~ beta(1,1);   // uninformative prior for uniform distribution range (0,1)
                     N_pos_sess[n_sess] ~ binomial(N_total_sess[n_sess],p_sess[n_sess]);   // estimate prevalence p_sess for each session, assuming a binomial distribution
                     
              }
       }
       
       generated quantities {
              int y_miss[N];

              for(n in 1:N){
                     y_miss[n] = bernoulli_rng(p_sess[S[n]]);  // bernoulli_rng(alpha) 'generates a bernoulli variate with chance of success alpha' (stan manual) 
              }
       }
       "

分享到QQ

分享到微博