为什么功能sparse.model.matrix()忽略了列?
有人可以帮我吗?我想准备XGBoost预测的数据,因此我需要编辑因子数据。我使用sparse.model.matrix(),但是有问题。我不知道,为什么功能忽略了一些列。我会尝试解释。我的数据集数据集具有许多变量,但现在这3个很重要:
- tsunami.event.vality.vality - firact,具有6个类:-1,0,1,1,2,3 ,4
- tsunami.cause.code - 6类的因子:0,1,1,2,3,4,5
- total.total.death.death.destercription - 有5个类的因子:0,1,2,3,4
,但是当我使用sparse.model.matrix()时,我仅获得矩阵,仅获得15列,而不是6+6+5 = 17 。有人可以给马吗?
sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity + Tsunami.Cause.Code + Total.Death.Description -1, data = datas)
str(sp_matrix)
Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
..@ p : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
..@ Dim : int [1:2] 749 15
..@ Dimnames:List of 2
.. ..$ : chr [1:749] "1" "2" "3" "4" ...
.. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
..@ x : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
..$ assign : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
..$ contrasts:List of 3
.. ..$ Tsunami.Event.Validity : chr "contr.treatment"
.. ..$ Tsunami.Cause.Code : chr "contr.treatment"
.. ..$ Total.Death.Description: chr "contr.treatment"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题是在R中,对于具有N唯一类别的分类数据,为什么Sparse.model.matrix()不会产生带有n列的单速编码? ...但是这个问题从未回答过。
这个问题< /a>说明如何获得所需的完整模型矩阵,但不要解释为什么您可能不想。 (就其价值而言,与常规线性模型,因此在这种情况下,完整的模型矩阵实际上可以起作用,但是值得理解R为您的答案提供答案,以及为什么这不会损害您的预测准确性...)
这是这样的基本属性基于多个类别预测器工作(因此R构建模型矩阵的方式)基于(添加性)的线性模型。当您基于因子
f1,...,fn
构建模型矩阵时1 + sum(ni-1),不是sum(ni)
。让我们看看这是如何使用一个更简单的示例来工作的:我们总共有(1 + 3*(2-1)=)4个参数。
第一个参数(
a1
)描述了所有参数的基线级别(a = 1
,b = 1
,) C = 1
)。第二个参数描述了具有a = 1
的观察结果和a = 2
(独立于其他因素)的预期差异。参数3和4(b2
,c2
)描述b1
和b2
之间的类似差异。您可能在想:“但是我想要 all 的预测变量 all 的级别,例如,
这是所有六列的预期,而不仅仅是4。但是,如果您检查了这一点矩阵,或调用
rankmatrix(M)
或caret :: findlinearearcombos(M)
,您会发现它是典型的(固定效应)添加线性的多元。模型,您只能估算级别的截距加上的差异,而不是与回归树模型相关的值。重要的是令人困惑,但不应该损害您的预测。This question is a duplicate of In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns? ... but that question was never answered.
The answers to this question explain how you could get the full model matrix you're looking for, but don't explain why you might not want to. (For what it's worth, unlike regular linear models regression trees are robust to multicollinearity, so a full model matrix would actually work in this case, but it's worth understanding why R gives you the answer it does, and why this won't hurt your predictive accuracy ...)
This is a fundamental property of the way that linear models based (additively) on more than one categorical predictor work (and hence the way that R constructs model matrices). When you construct a model matrix based on factors
f1, ..., fn
with numbers of levelsn1, ..., nn
the number of predictor variables is1 + sum(ni-1)
, notsum(ni)
. Let's see how this works with a slightly simpler example:We have a total of (1 + 3*(2-1) =) 4 parameters.
The first parameter (
A1
) describes the expected mean in the baseline level of all parameters (A=1
,B=1
,C=1
). The second parameter describes the expected difference between an observation withA=1
and one withA=2
(independent of the other factors). Parameters 3 and 4 (B2
,C2
) describe analogous differences betweenB1
andB2
.You might be thinking "but I want predictor variables for all the levels of all the factors, e.g.
This has all six columns expected, not just 4. But if you examine this matrix, or call
rankMatrix(m)
orcaret::findLinearCombos(m)
, you'll discover that it is multicollinear. In a typical (fixed-effect) additive linear model, you can only estimate an intercept plus the differences between levels, not values associated with every level. In a regression tree model, the multicollinearity will make your computations slightly less efficient, and will make results about variable importance confusing, but shouldn't hurt your predictions.