将系数名称转换为 R 中的公式

发布于 2024-10-04 02:37:50 字数 2407 浏览 0 评论 0原文

当使用具有因子的公式时，拟合模型将系数命名为 XY，其中 X 是因子的名称，Y 是因子的特定级别。我希望能够根据这些系数的名称创建一个公式。

原因：如果我将套索拟合到稀疏设计矩阵（如下所示），我想创建一个仅包含非零系数项的新公式对象。

require("MatrixModels")
require("glmnet")
set.seed(1)
n <- 200
Z <- data.frame(letter=factor(sample(letters,n,replace=T),letters),
                x=sample(1:20,200,replace=T))
f <- ~ letter + x:letter + I(x>5):letter
X <- sparse.model.matrix(f, Z)
beta <- matrix(rnorm(dim(X)[2],0,5),dim(X)[2],1)
y <- X %*% beta + rnorm(n)

myfit <- glmnet(X,as.vector(y),lambda=.05)
fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
 [1] "letterb"              "letterc"              "lettere"             
 [4] "letterf"              "letterg"              "letterh"             
 [7] "letterj"              "letterm"              "lettern"             
[10] "lettero"              "letterp"              "letterr"             
[13] "letters"              "lettert"              "letteru"             
[16] "letterw"              "lettery"              "letterz"             
[19] "lettera:x"            "letterb:x"            "letterc:x"           
[22] "letterd:x"            "lettere:x"            "letterf:x"           
[25] "letterg:x"            "letterh:x"            "letteri:x"           
[28] "letterj:x"            "letterk:x"            "letterl:x"           
[31] "letterm:x"            "lettern:x"            "lettero:x"           
[34] "letterp:x"            "letterq:x"            "letterr:x"           
[37] "letters:x"            "lettert:x"            "letteru:x"           
[40] "letterv:x"            "letterw:x"            "letterx:x"           
[43] "lettery:x"            "letterz:x"            "letterb:I(x > 5)TRUE"
[46] "letterc:I(x > 5)TRUE" "letterd:I(x > 5)TRUE" "lettere:I(x > 5)TRUE"
[49] "letteri:I(x > 5)TRUE" "letterj:I(x > 5)TRUE" "letterl:I(x > 5)TRUE"
[52] "letterm:I(x > 5)TRUE" "letterp:I(x > 5)TRUE" "letterq:I(x > 5)TRUE"
[55] "letterr:I(x > 5)TRUE" "letteru:I(x > 5)TRUE" "letterv:I(x > 5)TRUE"
[58] "letterx:I(x > 5)TRUE" "lettery:I(x > 5)TRUE" "letterz:I(x > 5)TRUE"

由此我想要一个公式，

~ I(letter=="d") + I(letter=="e") + ...(etc)

我检查了 Formula() 和 all.vars() 无济于事。此外，编写一个函数来解析它有点痛苦，因为可能会出现不同类型的术语。例如，对于x:letter，当x是数值并且letter是因子时，或者I(x＞5):letter作为另一个烦人的情况。

那么我是否不知道有一些函数可以在公式及其字符表示形式之间进行转换并再次转换回来？

原文

When using formulas that have factors, the fitted models name the coefficients XY, where X is the name of the factor and Y is a particular level of it. I want to be able to create a formula from the names of these coefficients.

The reason: If I fit a lasso to a sparse design matrix (as I do below) I would like to create a new formula object that only contains terms for the nonzero coefficients.

require("MatrixModels")
require("glmnet")
set.seed(1)
n <- 200
Z <- data.frame(letter=factor(sample(letters,n,replace=T),letters),
                x=sample(1:20,200,replace=T))
f <- ~ letter + x:letter + I(x>5):letter
X <- sparse.model.matrix(f, Z)
beta <- matrix(rnorm(dim(X)[2],0,5),dim(X)[2],1)
y <- X %*% beta + rnorm(n)

myfit <- glmnet(X,as.vector(y),lambda=.05)
fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]
 [1] "letterb"              "letterc"              "lettere"             
 [4] "letterf"              "letterg"              "letterh"             
 [7] "letterj"              "letterm"              "lettern"             
[10] "lettero"              "letterp"              "letterr"             
[13] "letters"              "lettert"              "letteru"             
[16] "letterw"              "lettery"              "letterz"             
[19] "lettera:x"            "letterb:x"            "letterc:x"           
[22] "letterd:x"            "lettere:x"            "letterf:x"           
[25] "letterg:x"            "letterh:x"            "letteri:x"           
[28] "letterj:x"            "letterk:x"            "letterl:x"           
[31] "letterm:x"            "lettern:x"            "lettero:x"           
[34] "letterp:x"            "letterq:x"            "letterr:x"           
[37] "letters:x"            "lettert:x"            "letteru:x"           
[40] "letterv:x"            "letterw:x"            "letterx:x"           
[43] "lettery:x"            "letterz:x"            "letterb:I(x > 5)TRUE"
[46] "letterc:I(x > 5)TRUE" "letterd:I(x > 5)TRUE" "lettere:I(x > 5)TRUE"
[49] "letteri:I(x > 5)TRUE" "letterj:I(x > 5)TRUE" "letterl:I(x > 5)TRUE"
[52] "letterm:I(x > 5)TRUE" "letterp:I(x > 5)TRUE" "letterq:I(x > 5)TRUE"
[55] "letterr:I(x > 5)TRUE" "letteru:I(x > 5)TRUE" "letterv:I(x > 5)TRUE"
[58] "letterx:I(x > 5)TRUE" "lettery:I(x > 5)TRUE" "letterz:I(x > 5)TRUE"

From this I would like to have a formula

~ I(letter=="d") + I(letter=="e") + ...(etc)

I checked out formula() and all.vars() to no avail. Also, writing a function to parse this is a bit of a pain because of the different types of terms that can arise. For example, for x:letter when x is a numeric value and letter is a factor, or I(x>5):letter as another annoying case.

So am I not aware of some function to convert between formula and its character representation and back again?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尝蛊 2024-10-11 02:37:51

当我运行代码时，我得到了一些不同的东西，因为 set.seed() 尚未指定。我没有使用变量名“letter”，而是使用“letter_”作为方便的分割标记：

> fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]

> fnew
 [1] "letter_c" "letter_d" "letter_e" "letter_f" "letter_h" "letter_k" "letter_l"
 [8] "letter_o" "letter_q" "letter_r" "letter_s" "letter_t" "letter_u" "letter_v"
[15] "letter_w"

然后进行分割并打包成字符矩阵：

> fnewmtx <- cbind( lapply(sapply(fnew, strsplit, split="_"), "[[", 2),
+ lapply(sapply(fnew, strsplit, split="_"), "[[", 1))

fnewmtx
[,1] [,2]
letter_c“c”“字母”
letter_d“d”“字母”
letter_e“e”“字母”
letter_f“f”“字母”剪掉其余部分

并将粘贴函数输出包装在 as.formula() 中，这是如何“在公式及其字符表示形式之间进行转换并返回”的答案的一半。另一半是 as.character()

form <- as.formula( paste("~", 
             paste( 
               paste(" I(", fnewmtx[,2], "_ ==", "'",fnewmtx[,1],"') ", sep="") , 
             sep="", collapse="+")
                 ) 
           )  # edit: needed to add back the underscore

并且输出现在是一个适当的类对象：

> class(form)
[1] "formula"
> form
~I(letter_ == "c") + I(letter_ == "d") + I(letter_ == "e") + 
    I(letter_ == "f") + I(letter_ == "h") + I(letter_ == "k") + 
    I(letter_ == "l") + I(letter_ == "o") + I(letter_ == "q") + 
    I(letter_ == "r") + I(letter_ == "s") + I(letter_ == "t") + 
    I(letter_ == "u") + I(letter_ == "v") + I(letter_ == "w")

我发现有趣的是 as.formula 转换使字母周围的单引号变成了双引号。

编辑：既然问题有一个或两个额外的维度，我的建议是跳过公式的重新创建。请注意，myfit$beta 的行名称与 X 的列名称完全相同，因此请使用非零行名称作为索引来选择 X 矩阵中的列：

> str(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] )
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:429] 9 54 91 157 166 37 55 68 117 131 ...
  ..@ p       : int [1:61] 0 5 13 20 28 36 42 50 60 68 ...
  ..@ Dim     : int [1:2] 200 60
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:200] "1" "2" "3" "4" ...
  .. ..$ : chr [1:60] "letter_b" "letter_c" "letter_e" "letter_f" ...
  ..@ x       : num [1:429] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()

> myfit2 <- glmnet(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] ,as.vector(y),lambda=.05)
> myfit2

Call:  glmnet(x = X[, which(colnames(X) %in% rownames(myfit$beta)[
                                           which(myfit$beta != 0)])], 
              y = as.vector(y), lambda = 0.05) 

     Df   %Dev Lambda
[1,] 60 0.9996   0.05

When I ran the code, I got something a bit different, since set.seed() had not been specified. Instead of using the variable name "letter", I used "letter_" as a convenient splitting marker:

> fnew <- rownames(myfit$beta)[which(myfit$beta != 0)]

> fnew
 [1] "letter_c" "letter_d" "letter_e" "letter_f" "letter_h" "letter_k" "letter_l"
 [8] "letter_o" "letter_q" "letter_r" "letter_s" "letter_t" "letter_u" "letter_v"
[15] "letter_w"

Then made the split and packaged into a character matrix:

> fnewmtx <- cbind( lapply(sapply(fnew, strsplit, split="_"), "[[", 2),
+ lapply(sapply(fnew, strsplit, split="_"), "[[", 1))

fnewmtx
[,1] [,2]
letter_c "c" "letter"
letter_d "d" "letter"
letter_e "e" "letter"
letter_f "f" "letter" snipped the rest

And wrapped the paste function(s) output in as.formula() which is half of the answer to how to "convert between formula and its character representation and back." The other half is as.character()

form <- as.formula( paste("~", 
             paste( 
               paste(" I(", fnewmtx[,2], "_ ==", "'",fnewmtx[,1],"') ", sep="") , 
             sep="", collapse="+")
                 ) 
           )  # edit: needed to add back the underscore

And the output is now an appropriate class object:

> class(form)
[1] "formula"
> form
~I(letter_ == "c") + I(letter_ == "d") + I(letter_ == "e") + 
    I(letter_ == "f") + I(letter_ == "h") + I(letter_ == "k") + 
    I(letter_ == "l") + I(letter_ == "o") + I(letter_ == "q") + 
    I(letter_ == "r") + I(letter_ == "s") + I(letter_ == "t") + 
    I(letter_ == "u") + I(letter_ == "v") + I(letter_ == "w")

I find it interesting that the as.formula conversion made the single-quotes around the letters into double-quotes.

Edit: Now that the problem has an additional dimension or two, my suggestion is to skip the recreation of the formula. Note that the rownames of myfit$beta are exactly the same as the column names of X, so instead use the non-zero rownames as indices to select columns in the X matrix:

> str(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] )
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:429] 9 54 91 157 166 37 55 68 117 131 ...
  ..@ p       : int [1:61] 0 5 13 20 28 36 42 50 60 68 ...
  ..@ Dim     : int [1:2] 200 60
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:200] "1" "2" "3" "4" ...
  .. ..$ : chr [1:60] "letter_b" "letter_c" "letter_e" "letter_f" ...
  ..@ x       : num [1:429] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()

> myfit2 <- glmnet(X[ , which( colnames(X) %in% rownames(myfit$beta)[which(myfit$beta != 0)] )] ,as.vector(y),lambda=.05)
> myfit2

Call:  glmnet(x = X[, which(colnames(X) %in% rownames(myfit$beta)[
                                           which(myfit$beta != 0)])], 
              y = as.vector(y), lambda = 0.05) 

     Df   %Dev Lambda
[1,] 60 0.9996   0.05

回复收藏 0 原文

寄意 2024-10-11 02:37:51

克里斯托弗，在对sparse.model.matrix等进行一些考虑和检查之后，你所要求的似乎有些复杂。您还没有解释为什么您不想为X_test形成完整的稀疏模型矩阵，因此除了下面的两个选项之外，很难提出前进的建议。

如果您在 X_test 中有大量观察结果，因此出于计算原因不想生成用于 predict() 的完整稀疏矩阵，那么这可能更方便将 X_test 拆分为两个或多个样本块，并依次为每个样本块形成稀疏模型矩阵，使用后将其丢弃。

如果做不到这一点，您将需要详细研究 Matrix 包中的代码。从sparse.model.matrix开始，注意它然后调用Matrix:::model.spmatrix并找到对Matrix:::fac2Sparse的调用在那个函数中。您可能需要从这些函数中选择代码，但使用修改后的 fac2Sparse 来实现您想要实现的目标。

抱歉，我无法提供现成的脚本来执行此操作，但这是一项艰巨的编码任务。如果您沿着这条路线走下去，请查看 Matrix 包中的稀疏模型矩阵小插图，并获取包源代码（来自 CRAN），以查看我提到的函数是否在源代码中得到了更好的记录（有例如，没有 fac2Sparse 的 Rd 文件）。你也可以向《矩阵》的作者（Martin Maechler 和 Doug Bates）寻求建议，尽管请注意，这两个人本学期的教学负担都特别重。

祝你好运！