在 SAS 和 R 中复制概率回归
我试图在 R 中复制我的 SAS 工作,但得到的结果略有不同——这些差异无法用舍入误差来解释。
这是我的 SAS 代码:
proc qlim data=mydata;
model y = x1 x2 x3/ discrete(d=probit);
output out=outdata marginal;
title "just ran QLIM model";
run;
quit;
这是我的 R 代码:
mymodel <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"), data=mydata)
我不太确定为什么会得到不同的结果,并且非常感谢您的解释。
编辑:
这是我的数据:
2.66 20 0 0
2.89 22 0 0
3.28 24 0 0
2.92 12 0 0
4.00 21 0 1
2.86 17 0 0
2.76 17 0 0
2.87 21 0 0
3.03 25 0 0
3.92 29 0 1
2.63 20 0 0
3.32 23 0 0
3.57 23 0 0
3.26 25 0 1
3.53 26 0 0
2.74 19 0 0
2.75 25 0 0
2.83 19 0 0
3.12 23 1 0
3.16 25 1 1
2.06 22 1 0
3.62 28 1 1
2.89 14 1 0
3.51 26 1 0
3.54 24 1 1
2.83 27 1 1
3.39 17 1 1
2.67 24 1 0
3.65 21 1 1
4.00 23 1 1
3.1 21 1 0
2.39 19 1 1
这是我的估计系数(括号中的标准误差):
SAS: -7.452320 (2.542536)
1.625810 (0.693869)
0.051729 (0.083891)
1.426332 (0.595036)
R: -7.25319 (2.50977)
1.64888 (0.69427)
0.03989 (0.07961)
1.42490 (0.58347)
I'm trying to replicate my SAS work in R, but I get slightly different results -- differences that can't be explained by rounding error.
Here's my SAS code:
proc qlim data=mydata;
model y = x1 x2 x3/ discrete(d=probit);
output out=outdata marginal;
title "just ran QLIM model";
run;
quit;
And here's my R code:
mymodel <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"), data=mydata)
I'm not really sure why I'd get different results, and would greatly appreciate an explanation.
EDIT:
Here's my data:
2.66 20 0 0
2.89 22 0 0
3.28 24 0 0
2.92 12 0 0
4.00 21 0 1
2.86 17 0 0
2.76 17 0 0
2.87 21 0 0
3.03 25 0 0
3.92 29 0 1
2.63 20 0 0
3.32 23 0 0
3.57 23 0 0
3.26 25 0 1
3.53 26 0 0
2.74 19 0 0
2.75 25 0 0
2.83 19 0 0
3.12 23 1 0
3.16 25 1 1
2.06 22 1 0
3.62 28 1 1
2.89 14 1 0
3.51 26 1 0
3.54 24 1 1
2.83 27 1 1
3.39 17 1 1
2.67 24 1 0
3.65 21 1 1
4.00 23 1 1
3.1 21 1 0
2.39 19 1 1
And here are my estimated coefficients (std errors in parentheses):
SAS: -7.452320 (2.542536)
1.625810 (0.693869)
0.051729 (0.083891)
1.426332 (0.595036)
R: -7.25319 (2.50977)
1.64888 (0.69427)
0.03989 (0.07961)
1.42490 (0.58347)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
它可能位于默认使用的对比矩阵中。 R 使用处理对比,而 SAS 使用它自己的处理对比。在帮助中查找对比和对比 SAS。如果您经常使用 SAS 对比,您可能只想将选项设置为该值。
要了解这如何影响事物,请观察治疗和 SAS 对比矩阵的差异
It is possibly in the contrast matrix used by default. R uses treatment contrasts while SAS uses it's own. Look up contrasts and contr.SAS in the help. If you're using SAS contrasts a lot you might want to just set the options to that.
To get an idea how this affects things observe the difference in treatment and SAS contrast matrices
当我在 R 中使用您的数据和代码运行它时,我得到的答案(接近)您为 SAS 结果显示的结果:
标准误差偏离了几个百分点,但这并不令人惊讶。
我还在glmmADMB(R-forge 上提供)中运行了它,这是一个完全不同的实现,并且得到的估计值与 SAS 稍远,但标准误差更接近——比最初的差异小得多无论如何都要报告。
您使用的是哪个版本的 R? (尽管
glm
是非常稳定的代码,但版本之间可能发生了一些变化......)您确定没有搞砸什么吗?When I run it in R with your data and code I get answers (close to) what you show for the SAS results:
The standard errors are off by a few percent, but that's less surprising.
I also ran it in
glmmADMB
(available on R-forge), which is a completely different implementation, and got estimates slightly farther from, but standard errors closer to, SAS -- much smaller differences than you originally reported in any case.What version of R were you using? (It's possible that something changed between versions, although
glm
is very stable code ...) Are you sure you didn't mess something up?您应该比较哪个软件报告的对数似然最高。这些数字可能只是因为两种算法中的终止标准不同而不同。例如,大多数算法使用梯度范数作为停止规则(即:小于 0.0005 时),但每个软件都使用自己的规范。
根据停止的位置,这些估计的方差将明显不同,因为它们是通过反转 Hessian 矩阵(在停止的位置评估)获得的。
为了 100% 确定,您可以使用报告最高对数似然的 R 或 SAS 值进行检查。或者您可以使用这些值手动计算对数似然。
如果有人要求您在 R 和 SAS 中报告完全相同的值,只需触及两种算法的收敛标准即可。设置一些非常严格的参数<0.00000005,在这两种情况下,两个程序都应该报告相同的值。
(好吧,除非你的可能性有多个最大值,这似乎不是这里的问题;在这种情况下,最终的估计将取决于你的初始值)
You should compare which software is reporting the highest log-likelihood. Those numbers may be different just because the termination criterion is different in both algorithms. For example, most algorithms use the norm of gradient as a stopping rule (ie: when less than 0.0005), but every software uses its own specification.
Depending on where it is stopping, the variance of those estimates will be obviously different since they are obtained by inverting the Hessian ( evaluated at where it is stopping).
Just to be 100% sure, you could check using R or SAS values which is reporting the highest log-likelihood. Or you could calculate by hand the log-likelihood using those values.
If you are required by somebody to report the exact same values in R and SAS, just touch the convergence criteria of both algorithms. Set some very tight parameter <0.00000005, in both cases and both programs should report the same value.
( well unless your likelihood has multiple maxima, which doesnt seem to be the problem here; in that case the final estimates will depend on your initial values)
我是 R 新手,但我有一个建议。
尝试使用另一个 R 包运行概率...尝试 Zelig。
该模型中的回归系数是否不同?
I'm an R newbie, but I have a suggestion.
Try running the probit using another R package...try Zelig.
Are the regression coefficients different in this model?