R 模型矩阵中因子的所有级别
我有一个由数字和因子变量组成的 data.frame
,如下所示。
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
我想构建一个矩阵,将虚拟变量分配给因子并保留数字变量。
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
正如运行 lm 时所预期的那样,这会忽略每个因素的一个水平作为参考水平。但是,我想为所有因素的每个级别构建一个带有虚拟/指示变量的矩阵。我正在为 glmnet 构建这个矩阵,所以我不担心多重共线性。
有没有办法让 model.matrix 为因子的每个级别创建虚拟变量?
I have a data.frame
consisting of numeric and factor variables as seen below.
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
I want to build out a matrix
that assigns dummy variables to the factor and leaves the numeric variables alone.
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
As expected when running lm
this leaves out one level of each factor as the reference level. However, I want to build out a matrix
with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet
so I am not worried about multicollinearity.
Is there a way to have model.matrix
create the dummy for every level of the factor?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
(试图救赎自己......)为了回应 Jared 对 @Fabians 关于自动化它的回答的评论,请注意,您需要提供的只是一个命名的对比矩阵列表。
contrasts()
接受一个向量/因子并从中生成对比矩阵。为此,我们可以使用lapply()
对数据集中的每个因素运行contrasts()
,例如提供的testFrame
示例:这很适合@fabians 的回答:
(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices.
contrasts()
takes a vector/factor and produces the contrasts matrix from it. For this then we can uselapply()
to runcontrasts()
on each factor in our data set, e.g. for thetestFrame
example provided:Which slots nicely into @fabians answer:
您需要重置因子变量的
对比
:或者,少输入一点并且没有正确的名称:
You need to reset the
contrasts
for the factor variables:or, with a little less typing and without the proper names:
caret
实现了一个很好的函数dummyVars
,只需两行即可实现此目的:library(caret)
dmy <- dummyVars(" ~ .", data = testFrame)
testFrame2 <- data.frame(预测(dmy, newdata = testFrame))
这里
检查最后一列:
最好的一点是您获得原始数据框,以及排除用于转换的原始变量的虚拟变量。
更多信息:http://amunategui.github.io/dummyVar-Walkthrough/
caret
implemented a nice functiondummyVars
to achieve this with 2 lines:library(caret)
dmy <- dummyVars(" ~ .", data = testFrame)
testFrame2 <- data.frame(predict(dmy, newdata = testFrame))
Checking the final columns:
The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.
More info: http://amunategui.github.io/dummyVar-Walkthrough/
也可以使用来自
caret
的dummyVars
。 http://caret.r-forge.r-project.org/preprocess.htmldummyVars
fromcaret
could also be used. http://caret.r-forge.r-project.org/preprocess.html好的。只需阅读以上内容并将其放在一起即可。假设您想要矩阵(例如“X.factors”)乘以系数向量以获得线性预测器。还有几个额外的步骤:(
请注意,如果您只有一个因子列,则需要将 X[*] 转回数据框。)
然后假设您得到如下结果:
我们想要摆脱 * *d 各因素的参考水平
Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:
(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)
Then say you get something like this:
We want to get rid of the **'d reference levels of each factor
tidyverse
答案:产生所需的结果(与@Gavin Simpson 的答案相同):
A
tidyverse
answer:yields the desired result (same as @Gavin Simpson's answer):
使用 R 包“CatEncoders”
Using the R package 'CatEncoders'
我目前正在学习 Lasso 模型和
glmnet::cv.glmnet()
、model.matrix()
和Matrix::sparse.model.matrix()< /code>(对于高维矩阵,按照
glmnet
作者的建议,使用model.matrix
会浪费我们的时间。)。只是在那里分享就有一个整洁的编码,可以得到与 @fabians 和 @Gavin 的答案相同的答案。同时,@asdf123 还引入了另一个包
library('CatEncoders')
。来源:R 适合所有人:高级分析和图形(第 273 页)
I am currently learning Lasso model and
glmnet::cv.glmnet()
,model.matrix()
andMatrix::sparse.model.matrix()
(for high dimensions matrix, usingmodel.matrix
will killing our time as suggested by the author ofglmnet
.).Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package
library('CatEncoders')
as well.Source : R for Everyone: Advanced Analytics and Graphics (page273)
或者
应该是最直接的
or
should be the most straightforward
我编写了一个名为 ModelMatrixModel 的包来改进 model.matrix() 的功能。包中的 ModelMatrixModel() 函数默认返回一个包含稀疏矩阵的类,该矩阵具有所有级别的虚拟变量,适合在 glmnet 包中的 cv.glmnet() 中输入。重要的是还回来了
类还存储转换参数,例如因子级别信息,然后可以将其应用于新数据。该函数可以处理 r 公式中的大多数项目,例如 poly() 和交互。它还提供了其他几个选项,例如处理无效因子级别和缩放输出。
I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned
class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.
您可以使用
tidyverse
来实现此目的,而无需手动指定每一列。诀窍是制作一个“长”数据框。
然后,整理一些内容,并将其展开以创建指标/虚拟变量。
代码:
输出:
You can use
tidyverse
to achieve this without specifying each column manually.The trick is to make a "long" dataframe.
Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.
Code:
The output: