R:分层数据的贝叶斯逻辑回归
这是来自 stats.stackexchange 的转发,我在其中没有得到满意的答复。我有两个数据集,第一个关于学校,第二个列出了每所学校在标准化考试中未通过的学生(强调是故意的)。假数据集可以通过以下方式生成(感谢 Tharen):
#random school data for 30 schools
schools.num = 30
schools.data = data.frame(school_id=seq(1,schools.num)
,tot_white=sample(100:300,schools.num,TRUE)
,tot_black=sample(100:300,schools.num,TRUE)
,tot_asian=sample(100:300,schools.num,TRUE)
,school_rev=sample(4e6:6e6,schools.num,TRUE)
)
#total students in each school
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian
#sum of all students all schools
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian)
#generate some random failing students
fail.num = as.integer(tot_students * 0.05)
students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE)
,school_id=sample(1:schools.num, fail.num, TRUE)
,race=sample(c('white', 'black', 'asian'), fail.num, TRUE)
)
我正在尝试估计 P(Fail=1 | Student种族、学校收入)。如果我在学生数据集上运行多项式离散选择模型,我显然会估计 P(Race | Fail=1)。我显然必须估计这个的倒数。由于所有信息都可以在两个数据集中使用(P(失败)、P(比赛)、收入),所以我认为没有理由不能做到这一点。但我对如何在 R 中实现感到困惑。任何指针将不胜感激。谢谢。
This is a repost from stats.stackexchange where I did not get a satisfactory response. I have two datasets, the first on schools, and the second lists students in each school who have failed in a standardized test (emphasis intentional). Fake datasets can be generated by (thanks to Tharen):
#random school data for 30 schools
schools.num = 30
schools.data = data.frame(school_id=seq(1,schools.num)
,tot_white=sample(100:300,schools.num,TRUE)
,tot_black=sample(100:300,schools.num,TRUE)
,tot_asian=sample(100:300,schools.num,TRUE)
,school_rev=sample(4e6:6e6,schools.num,TRUE)
)
#total students in each school
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian
#sum of all students all schools
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian)
#generate some random failing students
fail.num = as.integer(tot_students * 0.05)
students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE)
,school_id=sample(1:schools.num, fail.num, TRUE)
,race=sample(c('white', 'black', 'asian'), fail.num, TRUE)
)
I am trying to estimate P(Fail=1 | Student Race, School Revenue). If I run a multinomial discrete choice model on the student dataset, I shall clearly be estimating P(Race | Fail=1). I obviously have to estimate the inverse of this. Since all the pieces of information are available in the two datasets (P(Fail), P(Race), Revenue), I see no reason why this can't be done. But I am stumped as to actually how to implement in R. Any pointer would be much appreciated. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你有一个 data.frame,那就更容易了。
然后您可以查看数据
或计算您想要的任何内容。
(编辑)如果您有关于失败学生的更多信息,
但仅汇总已通过数据的数据,
您可以按如下方式重新创建完整的数据集。
It will be easier if you have a single data.frame.
You can then look at the data
Or compute anything you want.
(EDIT) If you have more information about the failed students,
but only aggregated data for the passed ones,
you can recreate a complete dataset as follows.
您将需要一个包含所有学生信息的数据集。既失败又通过。
然后,您可以使用 lme4 包的 glmer() 来实现频率主义方法。
如果您需要贝叶斯估计,请查看 MCMCglmm 包。
You'll need a dataset with information on all students. Both failed and passed.
Then you can use glmer() for the lme4 package for a frequentist approach.
Have a look at the MCMCglmm package if you need Bayesian estimates.