在模型列表上使用stepAIC

发布于 2025-01-02 11:53:34 字数 5581 浏览 0 评论 0原文

我想在线性模型列表上使用 AIC 进行逐步回归。想法是使用线性模型的 ea 列表，然后对每个列表元素应用 stepAIC。它失败了。

我试图找出问题所在。我想我发现了问题。但是，我不明白原因。尝试代码来看看三种情况之间的区别：

require(MASS)
n<-30 
x1<-rnorm(n, mean=0, sd=1) #create rv x1 
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable 
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10)) 
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon 
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d)) 
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())-  works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d))) 
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k]) 
stepAIC(lin.model.id1) # check stepAIC - works!

我很确定stepAIC()需要来自data.frame“dat”的原始数据。这就是我之前的想法。（希望我是对的）但是 stepAIC() 中没有参数可以传递原始数据帧。显然，对于未包含在列表中的普通模型，传递模型就足够了。（代码中的最后三行）所以我想知道：

Q1：stepAIC如何知道在哪里可以找到原始数据“dat”（不仅仅是作为参数传递的模型数据）？
问题 2：我怎么可能知道 stepAIC() 中还有另一个参数在帮助页面中没有明确说明？（也许我的英文太差了，找不到）
Q3：如何将该参数传递给stepAIC()？

它必须位于 apply 函数环境中的某个位置并传递数据。 lm() 或 stepAIC() 以及指向原始数据的指针/链接一定会在某个地方丢失。我不太明白 R 中的环境是做什么的。对我来说，这有点将局部变量与全局变量隔离开来。但也许情况更复杂。任何人都可以向我解释一下上述问题吗？老实说，我没有从 R 中读到太多内容文档。任何更好的理解都会对我有帮助。

旧：我的数据框 df 中有数据，可以分为几个子组。为此，我创建了一个名为 df$id 的 groupID。 lm() 返回第一个子组的预期系数。我想使用 AIC 作为每个子组的标准分别进行逐步回归。我使用 lmList {lme4} 为每个子组（id）生成一个模型。但如果我对列表元素使用 stepAIC{MASS} ，则会引发错误。见下文。

所以问题是：我的程序/语法中有什么错误？我得到了单个模型的结果，但没有得到使用 lmList 创建的模型的结果。 lmList() 存储的模型信息是否与 lm() 不同？
但在帮助中它指出：类“lmList”：具有公共模型的 lm 类对象列表。

>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept)          Gap.um     Standoff.um  Voidflaeche.px  
  62.306133       -0.009878        0.026317       -0.015048  

>stepAIC(lme4.list.lm[[1]], direction="backward") 
#stepAIC on first element on the list of linear models
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Error in terms.formula(formula, data = data) : 
'data' argument is of the wrong type

显然有些东西不适用于该列表。但我不知道它可能是什么。因为我尝试对创建相同模型（至少相同系数）的基础包执行相同的操作。结果如下：

>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),]) 
# id is in order, so should be the same subgroup as for the first list element in lmList

Coefficients:  
(Intercept)    Gap.um  Standoff.um  Voidflaeche.px  
  62.306133 -0.009878     0.026317       -0.015048

嗯，这是我在 Linear.model 上使用 stepAIC 得到的结果。据我所知，akaike 信息准则可用于估计在给定一些数据的情况下哪个模型在拟合和泛化之间更好的平衡。

>stepAIC(lin.model,direction="backward")
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14  
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Step:  AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Gap.um          1     28.51 7215.8 291.38
 <none>                        7187.3 293.14
- Voidflaeche.px  1    717.63 7904.9 296.85

Step:  AIC=291.38
Scherkraft.N ~ Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
<none>                        7215.8 291.38
- Voidflaeche.px  1    795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])

Coefficients:
(Intercept)  Voidflaeche.px  
   71.7183         -0.0151

我从输出中读到我应该使用模型：Scherkraft.N ~ Voidflaeche.px 因为这是最小的 AIC。好吧，如果有人能简短地描述一下输出，那就太好了。我对逐步回归（假设向后消除）的理解是所有回归量都包含在初始模型中。然后最不重要的一项就被淘汰了。决定的标准是AIC。等等......不知怎的，我在正确解释表格时遇到了问题。如果有人能证实我的解释，那就太好了。 “-”（减号）代表消除的回归量。顶部是“起始”模型，下面的表格中计算了可能消除的 RSS 和 AIC。因此，第一个表中的第一行表示模型 Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um 将导致 AIC 293.14 。选择没有 Standoff.um 的：Scherkraft.N~Gap.um+Voidflaeche.px

编辑：
我用 dlply() 替换 lmList{lme4} 以创建模型列表。 stepAIC 仍然无法处理该列表。它抛出另一个错误。实际上，我认为这是AIC需要运行的数据步骤的问题。我想知道它如何仅根据模型数据计算每个步骤的 AIC 值。我会利用原始数据构建模型，每次都会留下一个回归量。我会计算 AIC 并进行比较。那么如果stepAIC无法访问原始数据，它是如何工作的呢？（我看不到将原始数据传递给stepAIC的参数）。不过，我不知道为什么它适用于普通模型，但不适用于包装在列表中的模型。

>model.list.all <- dlply(df, .id, function(x) 
  {return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found

原文

I want to do stepwise regression using AIC on a list of linear models. idea is to use e a list of linear models and then apply stepAIC on each list element. It fails.

I tried to track the problem down. I think I found the problem. However, I don't understand the cause. Try the code to see the difference between three cases:

require(MASS)
n<-30 
x1<-rnorm(n, mean=0, sd=1) #create rv x1 
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable 
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10)) 
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon 
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d)) 
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())-  works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d))) 
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k]) 
stepAIC(lin.model.id1) # check stepAIC - works!

I am pretty sure that stepAIC() needs the original data from data.frame "dat". That is what I was thinking of before. (Hope I am right on that)
But there is no parameter in stepAIC() where I can pass the original data frame. Obviously, for plain models not wrapped in a list it's enough to pass the model. (last three lines in code) So I am wondering:

Q1: How does stepAIC knows where to find the original data "dat" (not only the model data which is passed as parameter)?
Q2: How can I possibly know that there is another parameter in stepAIC() which is not explicitly stated in the help pages? (maybe my English is just too bad to find)
Q3: How can I pass that parameter to stepAIC()?

It must be somewhere in the environment of the apply function and passing on the data. Either lm() or stepAIC() and the pointer/link to the raw data must get lost somewhere. I have not a good understanding what an environment in R does. For me it was kind of isolating local from global variables. But maybe its more complicated. Anyone who can explain that to me in regard to the problem above? Honestly, I dont read much out of the R documentation. Any better understanding would help me.

OLD:
I have data in a dataframe df that can be split into several subgroups. For that purpose I created a groupID called df$id. lm() returns the coefficent as expected for the first subgroup. I want to do a stepwise regression using AIC as criterion for each subgroup separately. I use lmList {lme4} which results in a model for each subgroup (id). But if I use stepAIC{MASS} for the list elements it throws an error. see below.

So the question is: What mistake is in my procedure/syntax? I get results for single models but not the ones created with lmList. Does lmList() store different information on the model than lm() does?
But in the help it states:
class "lmList": A list of objects of class lm with a common model.

>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept)          Gap.um     Standoff.um  Voidflaeche.px  
  62.306133       -0.009878        0.026317       -0.015048  

>stepAIC(lme4.list.lm[[1]], direction="backward") 
#stepAIC on first element on the list of linear models
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Error in terms.formula(formula, data = data) : 
'data' argument is of the wrong type

Obviously something does not work with the list. But I have not an idea what it might be.
Since I tried to do the same with the base package which creates the same model (at least the same coefficients). Results are below:

>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),]) 
# id is in order, so should be the same subgroup as for the first list element in lmList

Coefficients:  
(Intercept)    Gap.um  Standoff.um  Voidflaeche.px  
  62.306133 -0.009878     0.026317       -0.015048

Well, this is what I get returned using stepAIC on my linear.model .
As far as I know the akaike information criterion can be used to estimate which model better balances between fit and generalization given some data.

>stepAIC(lin.model,direction="backward")
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14  
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Step:  AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Gap.um          1     28.51 7215.8 291.38
 <none>                        7187.3 293.14
- Voidflaeche.px  1    717.63 7904.9 296.85

Step:  AIC=291.38
Scherkraft.N ~ Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
<none>                        7215.8 291.38
- Voidflaeche.px  1    795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])

Coefficients:
(Intercept)  Voidflaeche.px  
   71.7183         -0.0151

I read from the output I should use the model: Scherkraft.N ~ Voidflaeche.px because this is the minimal AIC. Well, it would be nice if someone could shortly describe the output. My understanding of the stepwise regression (assuming backwards elimination) is all regressors are included in the initial model. Then the least important one is eliminated. The criterion to decide is the AIC. and so forth... Somehow I have problems to get the tables interpreted right. It would be nice if someone could confirm my interpretation. The "-"(minus) stands for the eliminated regressor. On top is the "start" model and in the table table below the RSS and AIC are calculated for possible eliminations. So the first row in the first table says a model Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um would result in an AIC 293.14. Choose the one without Standoff.um: Scherkraft.N~Gap.um+Voidflaeche.px

EDIT:
I replaced the lmList{lme4} with dlply() to create the list of models.
Still stepAIC is not coping with the list. It throws another error. Actually, I believe it is a problem with the data stepAIC needs to run through. I was wondering how it calculates the AIC-value for each step just from the model data. I would take the original data to construct the models leaving one regressor out each time. Thereof I would calculate the AIC and compare. So how stepAIC is working if it has not access to the original data. (I cant see a parameter where I pass the original data to stepAIC). Still, I have no clue why it works with a plain model but not with the model wrapped in a list.

>model.list.all <- dlply(df, .id, function(x) 
  {return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心是晴朗的。 2025-01-09 11:53:34

我不确定版本控制中可能发生了什么变化，导致调试变得如此困难，但一种解决方案是使用 do.call，它在执行调用之前评估调用中的表达式。这意味着，update 和 stepAIC 需要去查找 d，而不是在调用中仅存储 d为了完成他们的工作，它存储数据帧本身的完整表示。

也就是说，使用 do

do.call("lm", list(y~x1+x2+x3, data=d))

而不是

lm(y~x1+x2+x3, data=d)

您可以通过查看模型的 call 元素来了解它正在尝试执行的操作，可能如下所示：

dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
                            do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call

还可以在全局环境中创建数据框列表，并且然后构建调用，以便 update 和 stepAIC 依次查找每个数据帧，因为它们的环境链总是返回到全局环境；像这样：

dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
            do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )

要查看更改的内容，请再次运行 dat.lin.model.lst[[1]]$call。

I'm not sure what may have changed in the versioning to make the debugging so difficult, but one solution would be to use do.call, which evaluates the expressions in the call before executing it. This means that instead of storing just d in the call, so that update and stepAIC need to go find d in order to do their work, it stores a full representation of the data frame itself.

That is, do

do.call("lm", list(y~x1+x2+x3, data=d))

instead of

lm(y~x1+x2+x3, data=d)

You can see what it's trying to do by looking at the call element of the model, perhaps like this:

dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
                            do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call

It's also possible to make your list of data frames in the global environment and then construct the call so that update and stepAIC look for each data frame in turn, because their environment chains always lead back to the global environment; like this:

dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
            do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )

To see what's changed, run dat.lin.model.lst[[1]]$call again.

回复收藏 0 原文

ゞ记忆︶ㄣ 2025-01-09 11:53:34

由于stepAIC似乎脱离循环环境（即在全局环境中）来查找它需要的数据，我使用分配函数来欺骗它：

    results <- do.call(rbind, lapply(response, function (i) { 
    assign("i", response, envir = .GlobalEnv)
            mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML")
            mdl <- stepAIC(mdl, direction ="backward")
}))

As it seems that stepAIC goes out of loop environment (that is in global environment) to look for the data it needs, I trick it using the assign function:

    results <- do.call(rbind, lapply(response, function (i) { 
    assign("i", response, envir = .GlobalEnv)
            mdl <- gls(as.formula(paste0(i,"~",paste(expvar, collapse = "+")), data= parevt, correlation = corARMA(p=1,q=1,form= ~as.integer(Year)), weights= varIdent(~1/Linf_var), method="ML")
            mdl <- stepAIC(mdl, direction ="backward")
}))

回复收藏 0 原文

~没有更多了~