R 中具有稀疏特征矩阵的大规模回归
我想在 R 中使用许多(例如 100k)特征进行大规模回归(线性/逻辑),其中每个示例在特征空间中相对稀疏——例如,每个示例大约 1k 个非零特征。
看起来像 SparseM 包 slm
应该这样做,但我在从 sparseMatrix
格式转换为 slm
友好格式时遇到困难。
我有一个由标签 y
组成的数字向量和一个由特征 X
\in {0,1} 组成的 sparseMatrix
。当我尝试时,
model <- slm(y ~ X)
出现以下错误:
Error in model.frame.default(formula = y ~ X) :
invalid type (S4) for variable 'X'
大概是因为 slm
想要一个 SparseM
对象而不是 sparseMatrix
。
有没有一种简单的方法可以 a) 直接填充 SparseM
对象或 b) 将 sparseMatrix
转换为 SparseM
对象?或者也许有更好/更简单的方法来做到这一点?
(我想我可以使用 X
和 y
显式编码线性回归的解决方案,但如果有 slm
工作那就太好了。)
I'd like to do large-scale regression (linear/logistic) in R with many (e.g. 100k) features, where each example is relatively sparse in the feature space---e.g., ~1k non-zero features per example.
It seems like the SparseM package slm
should do this, but I'm having difficulty converting from the sparseMatrix
format to a slm
-friendly format.
I have a numeric vector of labels y
and a sparseMatrix
of features X
\in {0,1}. When I try
model <- slm(y ~ X)
I get the following error:
Error in model.frame.default(formula = y ~ X) :
invalid type (S4) for variable 'X'
presumably because slm
wants a SparseM
object instead of a sparseMatrix
.
Is there an easy way to either a) populate a SparseM
object directly or b) convert a sparseMatrix
to a SparseM
object? Or perhaps there's a better/simpler way to do this?
(I suppose I could explicitly code the solutions for linear regression using X
and y
, but it would be nice to have slm
working.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不了解
SparseM
,但MatrixModels
包有一个未导出的lm.fit.sparse
函数可供您使用。请参阅?MatrixModels:::lm.fit.sparse
。这是一个示例:创建数据:
运行回归:
用于比较:
Don't know about
SparseM
but theMatrixModels
package has an unexportedlm.fit.sparse
function that you can use. See?MatrixModels:::lm.fit.sparse
. Here is an example:Create the data:
Run the regression:
For comparison:
迟来的答案:glmnet 还将支持稀疏矩阵和所请求的两种回归模型。这可以使用 Matrix 包生成的稀疏矩阵。我建议通过这个包研究正则化模型。由于稀疏数据通常涉及对某些变量的非常稀疏的支持,L1 正则化对于将这些变量从模型中剔除非常有用。它通常比对支持度很低的变量进行一些非常虚假的参数估计更安全。
A belated answer:
glmnet
will also support sparse matrices and both of the regression models requested. This can use the sparse matrices produced by theMatrix
package. I advise looking into regularized models via this package. As sparse data often involves very sparse support for some variables, L1 regularization is useful for knocking these out of the model. It's often safer than getting some very spurious parameter estimates for variables with very low support.glmnet
是一个不错的选择。支持线性、逻辑回归和多项回归等选项的 L1、L2 正则化。唯一的细节是它没有公式界面,因此您必须创建模型矩阵。但这就是收获所在。
这是一个伪示例:
glmnet
is a good choice. Supports L1, L2 regularization for linear, logistic, and multinomial regression, among other options.The only detail is it doesn't have a formula interface, so you have to create your model matrix. But here is where the gain is.
Here is a pseudo-example:
您还可以通过查看这里获得一些里程:
You might also get some mileage by looking here: