为什么功能sparse.model.matrix（）忽略了列？

发布于 2025-01-18 12:13:42 字数 1408 浏览 0 评论 0 原文

有人可以帮我吗？我想准备XGBoost预测的数据，因此我需要编辑因子数据。我使用sparse.model.matrix（），但是有问题。我不知道，为什么功能忽略了一些列。我会尝试解释。我的数据集数据集具有许多变量，但现在这3个很重要：

tsunami.event.vality.vality - firact，具有6个类：-1,0,1,1,2,3 ，4
tsunami.cause.code - 6类的因子：0,1,1,2,3,4,5
total.total.death.death.destercription - 有5个类的因子：0,1,2,3,4

，但是当我使用sparse.model.matrix（）时，我仅获得矩阵，仅获得15列，而不是6+6+5 = 17 。有人可以给马吗？

sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity + Tsunami.Cause.Code + Total.Death.Description -1, data = datas)

str(sp_matrix)

Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
  ..@ Dim     : int [1:2] 749 15
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:749] "1" "2" "3" "4" ...
  .. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
  ..@ x       : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
  ..$ assign   : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
  ..$ contrasts:List of 3
  .. ..$ Tsunami.Event.Validity : chr "contr.treatment"
  .. ..$ Tsunami.Cause.Code     : chr "contr.treatment"
  .. ..$ Total.Death.Description: chr "contr.treatment"

原文

Can somebody help me please? I want to prepare data for XGBoost prediction so I need edit factor datas. I use sparse.model.matrix() but there is a problem. I don't know, why function ignored some of the columns. I'll try to explain. I have dataset dataset with many variables, but now these 3 are important:

Tsunami.Event.Validity - Factor with 6 classes: -1,0,1,2,3,4
Tsunami.Cause.Code - Factor with 6 classes: 0,1,2,3,4,5
Total.Death.Description - Factor with 5 classes: 0,1,2,3,4

But when I use sparse.model.matrix() I get matrix only with 15 columns not 6+6+5=17 as expected. Can somebody give ma an advice?

sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity + Tsunami.Cause.Code + Total.Death.Description -1, data = datas)

str(sp_matrix)

Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
  ..@ Dim     : int [1:2] 749 15
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:749] "1" "2" "3" "4" ...
  .. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
  ..@ x       : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
  ..$ assign   : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
  ..$ contrasts:List of 3
  .. ..$ Tsunami.Event.Validity : chr "contr.treatment"
  .. ..$ Tsunami.Cause.Code     : chr "contr.treatment"
  .. ..$ Total.Death.Description: chr "contr.treatment"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧伤还要旧人安 2025-01-25 12:13:42

这个问题是在R中，对于具有N唯一类别的分类数据，为什么Sparse.model.matrix（）不会产生带有n列的单速编码？ ...但是这个问题从未回答过。

这个问题< /a>说明如何获得所需的完整模型矩阵，但不要解释为什么您可能不想。（就其价值而言，与常规线性模型，因此在这种情况下，完整的模型矩阵实际上可以起作用，但是值得理解R为您的答案提供答案，以及为什么这不会损害您的预测准确性...）

这是这样的基本属性基于多个类别预测器工作（因此R构建模型矩阵的方式）基于（添加性）的线性模型。当您基于因子 f1，...，fn 构建模型矩阵时1 + sum（ni-1），不是 sum（ni）。让我们看看这是如何使用一个更简单的示例来工作的：

xx <- expand.grid(A=factor(1:2), B = factor(1:2), C = factor(1:2))
model.matrix(~A+B+C-1, xx)
  A1 A2 B2 C2
1  1  0  0  0
2  0  1  0  0
3  1  0  1  0
4  0  1  1  0
5  1  0  0  1
6  0  1  0  1
7  1  0  1  1
8  0  1  1  1

我们总共有（1 + 3*（2-1）=）4个参数。
第一个参数（ a1 ）描述了所有参数的基线级别（ a = 1 ， b = 1 ，） C = 1 ）。第二个参数描述了具有 a = 1 的观察结果和 a = 2 （独立于其他因素）的预期差异。参数3和4（ b2 ， c2 ）描述 b1 和 b2 之间的类似差异。

您可能在想：“但是我想要 all 的预测变量 all 的级别，例如，

m <- do.call(cbind, lapply(xx, \(x) t(fac2sparse(x))))
dim(m)
## [1] 8 6

这是所有六列的预期，而不仅仅是4。但是，如果您检查了这一点矩阵，或调用 rankmatrix（M）或 caret :: findlinearearcombos（M），您会发现它是典型的（固定效应）添加线性的多元。模型，您只能估算级别的截距加上的差异，而不是与回归树模型相关的值。重要的是令人困惑，但不应该损害您的预测。

This question is a duplicate of In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns? ... but that question was never answered.

The answers to this question explain how you could get the full model matrix you're looking for, but don't explain why you might not want to. (For what it's worth, unlike regular linear models regression trees are robust to multicollinearity, so a full model matrix would actually work in this case, but it's worth understanding why R gives you the answer it does, and why this won't hurt your predictive accuracy ...)

This is a fundamental property of the way that linear models based (additively) on more than one categorical predictor work (and hence the way that R constructs model matrices). When you construct a model matrix based on factors f1, ..., fn with numbers of levels n1, ..., nn the number of predictor variables is 1 + sum(ni-1), not sum(ni). Let's see how this works with a slightly simpler example:

xx <- expand.grid(A=factor(1:2), B = factor(1:2), C = factor(1:2))
model.matrix(~A+B+C-1, xx)
  A1 A2 B2 C2
1  1  0  0  0
2  0  1  0  0
3  1  0  1  0
4  0  1  1  0
5  1  0  0  1
6  0  1  0  1
7  1  0  1  1
8  0  1  1  1

We have a total of (1 + 3*(2-1) =) 4 parameters.
The first parameter (A1) describes the expected mean in the baseline level of all parameters (A=1, B=1, C=1). The second parameter describes the expected difference between an observation with A=1 and one with A=2 (independent of the other factors). Parameters 3 and 4 (B2, C2) describe analogous differences between B1 and B2.

You might be thinking "but I want predictor variables for all the levels of all the factors, e.g.

m <- do.call(cbind, lapply(xx, \(x) t(fac2sparse(x))))
dim(m)
## [1] 8 6

This has all six columns expected, not just 4. But if you examine this matrix, or call rankMatrix(m) or caret::findLinearCombos(m), you'll discover that it is multicollinear. In a typical (fixed-effect) additive linear model, you can only estimate an intercept plus the differences between levels, not values associated with every level. In a regression tree model, the multicollinearity will make your computations slightly less efficient, and will make results about variable importance confusing, but shouldn't hurt your predictions.