使用 glm 指定 R 中的公式,无需显式声明每个协变量
我想将特定变量强制纳入 glm 回归,而不完全指定每个变量。我的真实数据集有大约 200 个变量。到目前为止,我在网上搜索中还没有找到这样的样本。
例如(只有 3 个变量):
n=200
set.seed(39)
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5))
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1)))))
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
如果我想包括所有主要术语,这有一个简单的快捷方式:
glm(Y~., family=binomial, data=samp)
但是假设我想包括所有主要术语(W1、W2 和 A)加上 W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
是否有快捷方式为了这?
[发布前编辑自我:]这有效! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
好吧,那么这个怎么样!
我想省略一个主要术语变量,只包含两个主要术语(A,W2)和 W2^2 和 W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
显然,只有几个变量并不需要捷径,但我使用高维数据。当前的数据集“仅”有 200 个变量,但其他一些数据集有数千个变量。
I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.
For example (with just 3 variables):
n=200
set.seed(39)
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5))
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1)))))
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
If I want to include all main terms, this has an easy shortcut:
glm(Y~., family=binomial, data=samp)
But say I want to include all main terms (W1, W2, and A) plus W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
Is there a shortcut for this?
[editing self before publishing:] This works! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
Okay, so what about this one!
I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
创造性地使用
.
来构建包含所有或几乎所有变量的公式是一种良好而简洁的方法。有时有用的另一个选项是以编程方式将公式构建为字符串,然后使用as.formula
将其转换为公式:当然,您可以创建
fla
对象方式更复杂。Your use of
.
creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula usingas.formula
:Of course, you can make the
fla
object way more complicated.阿尼科回答了你的问题。扩展一下:
您还可以使用 - 排除变量:
对于大型变量组,我经常制作一个用于对变量进行分组的框架,它允许您执行以下操作:
使用所有类型的条件(关于名称、结构等) )来填充数据框,使我能够快速选择大型数据集中的变量组。
Aniko answered your question. To extend a bit :
You can also exclude variables using - :
For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :
Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.