使用 glm 指定 R 中的公式,无需显式声明每个协变量

发布于 2024-09-16 11:24:21 字数 997 浏览 10 评论 0原文

我想将特定变量强制纳入 glm 回归,而不完全指定每个变量。我的真实数据集有大约 200 个变量。到目前为止,我在网上搜索中还没有找到这样的样本。

例如(只有 3 个变量):

n=200
set.seed(39) 
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5)) 
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1))))) 
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))

如果我想包括所有主要术语,这有一个简单的快捷方式:

glm(Y~., family=binomial, data=samp)

但是假设我想包括所有主要术语(W1、W2 和 A)加上 W2^2:

glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)

是否有快捷方式为了这?

[发布前编辑自我:]这有效! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)

好吧,那么这个怎么样!

我想省略一个主要术语变量,只包含两个主要术语(A,W2)和 W2^2 和 W2^2:A:

glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)

显然,只有几个变量并不需要捷径,但我使用高维数据。当前的数据集“仅”有 200 个变量,但其他一些数据集有数千个变量。

I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.

For example (with just 3 variables):

n=200
set.seed(39) 
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5)) 
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1))))) 
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))

If I want to include all main terms, this has an easy shortcut:

glm(Y~., family=binomial, data=samp)

But say I want to include all main terms (W1, W2, and A) plus W2^2:

glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)

Is there a shortcut for this?

[editing self before publishing:] This works! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)

Okay, so what about this one!

I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:

glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)

Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

长伴 2024-09-23 11:24:22

创造性地使用 . 来构建包含所有或几乎所有变量的公式是一种良好而简洁的方法。有时有用的另一个选项是以编程方式将公式构建为字符串,然后使用 as.formula 将其转换为公式:

vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)

当然,您可以创建 fla 对象方式更复杂。

Your use of . creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula using as.formula:

vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)

Of course, you can make the fla object way more complicated.

云裳 2024-09-23 11:24:22

阿尼科回答了你的问题。扩展一下:

您还可以使用 - 排除变量:

glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)

对于大型变量组,我经常制作一个用于对变量进行分组的框架,它允许您执行以下操作:

vars <- data.frame(
    names = names(samp),
    main = c(T,F,T,F),
    quadratic =c(F,T,T,F),
    main2=c(T,T,F,F),
    stringsAsFactors=F
)


regform <- paste(
    "Y ~",
    paste(
      paste(vars[vars$main,1],collapse="+"),
      paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
      sep="+"
    )
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"

> glm(as.formula(regform),data=samp,family=binomial)

使用所有类型的条件(关于名称、结构等) )来填充数据框,使我能够快速选择大型数据集中的变量组。

Aniko answered your question. To extend a bit :

You can also exclude variables using - :

glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)

For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :

vars <- data.frame(
    names = names(samp),
    main = c(T,F,T,F),
    quadratic =c(F,T,T,F),
    main2=c(T,T,F,F),
    stringsAsFactors=F
)


regform <- paste(
    "Y ~",
    paste(
      paste(vars[vars$main,1],collapse="+"),
      paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
      sep="+"
    )
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"

> glm(as.formula(regform),data=samp,family=binomial)

Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文