R 中具有稀疏特征矩阵的大规模回归

发布于 2024-09-08 01:20:41 字数 901 浏览 4 评论 0原文

我想在 R 中使用许多（例如 100k）特征进行大规模回归（线性/逻辑），其中每个示例在特征空间中相对稀疏——例如，每个示例大约 1k 个非零特征。

看起来像 SparseM 包 slm 应该这样做，但我在从 sparseMatrix 格式转换为 slm 友好格式时遇到困难。

我有一个由标签 y 组成的数字向量和一个由特征 X \in {0,1} 组成的 sparseMatrix 。当我尝试时，

model <- slm(y ~ X)

出现以下错误：

Error in model.frame.default(formula = y ~ X) : 
invalid type (S4) for variable 'X'

大概是因为 slm 想要一个 SparseM 对象而不是 sparseMatrix。

有没有一种简单的方法可以 a) 直接填充 SparseM 对象或 b) 将 sparseMatrix 转换为 SparseM 对象？或者也许有更好/更简单的方法来做到这一点？

（我想我可以使用 X 和 y 显式编码线性回归的解决方案，但如果有 slm 工作那就太好了。）

原文

I'd like to do large-scale regression (linear/logistic) in R with many (e.g. 100k) features, where each example is relatively sparse in the feature space---e.g., ~1k non-zero features per example.

It seems like the SparseM package slm should do this, but I'm having difficulty converting from the sparseMatrix format to a slm-friendly format.

I have a numeric vector of labels y and a sparseMatrix of features X \in {0,1}. When I try

model <- slm(y ~ X)

I get the following error:

Error in model.frame.default(formula = y ~ X) : 
invalid type (S4) for variable 'X'

presumably because slm wants a SparseM object instead of a sparseMatrix.

Is there an easy way to either a) populate a SparseM object directly or b) convert a sparseMatrix to a SparseM object? Or perhaps there's a better/simpler way to do this?

(I suppose I could explicitly code the solutions for linear regression using X and y, but it would be nice to have slm working.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以往的大感动 2024-09-15 01:20:41

不了解 SparseM，但 MatrixModels 包有一个未导出的 lm.fit.sparse 函数可供您使用。请参阅？MatrixModels:::lm.fit.sparse。这是一个示例：

创建数据：

y <- rnorm(30)
x <- factor(sample(letters, 30, replace=TRUE))
X <- as(x, "sparseMatrix")
class(X)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dim(X)
# [1] 18 30

运行回归：

MatrixModels:::lm.fit.sparse(t(X), y)
#  [1] -0.17499968 -0.89293312 -0.43585172  0.17233007 -0.11899582  0.56610302
#  [7]  1.19654666 -1.66783581 -0.28511569 -0.11859264 -0.04037503  0.04826549
# [13] -0.06039113 -0.46127034 -1.22106064 -0.48729092 -0.28524498  1.81681527

用于比较：

lm(y~x-1)

# Call:
# lm(formula = y ~ x - 1)
# 
# Coefficients:
#       xa        xb        xd        xe        xf        xg        xh        xj  
# -0.17500  -0.89293  -0.43585   0.17233  -0.11900   0.56610   1.19655  -1.66784  
#       xm        xq        xr        xt        xu        xv        xw        xx  
# -0.28512  -0.11859  -0.04038   0.04827  -0.06039  -0.46127  -1.22106  -0.48729  
#       xy        xz  
# -0.28524   1.81682

Don't know about SparseM but the MatrixModels package has an unexported lm.fit.sparse function that you can use. See ?MatrixModels:::lm.fit.sparse. Here is an example:

Create the data:

y <- rnorm(30)
x <- factor(sample(letters, 30, replace=TRUE))
X <- as(x, "sparseMatrix")
class(X)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dim(X)
# [1] 18 30

Run the regression:

MatrixModels:::lm.fit.sparse(t(X), y)
#  [1] -0.17499968 -0.89293312 -0.43585172  0.17233007 -0.11899582  0.56610302
#  [7]  1.19654666 -1.66783581 -0.28511569 -0.11859264 -0.04037503  0.04826549
# [13] -0.06039113 -0.46127034 -1.22106064 -0.48729092 -0.28524498  1.81681527

For comparison:

lm(y~x-1)

# Call:
# lm(formula = y ~ x - 1)
# 
# Coefficients:
#       xa        xb        xd        xe        xf        xg        xh        xj  
# -0.17500  -0.89293  -0.43585   0.17233  -0.11900   0.56610   1.19655  -1.66784  
#       xm        xq        xr        xt        xu        xv        xw        xx  
# -0.28512  -0.11859  -0.04038   0.04827  -0.06039  -0.46127  -1.22106  -0.48729  
#       xy        xz  
# -0.28524   1.81682

回复收藏 0 原文

ㄟ。诗瑗 2024-09-15 01:20:41

迟来的答案：glmnet 还将支持稀疏矩阵和所请求的两种回归模型。这可以使用 Matrix 包生成的稀疏矩阵。我建议通过这个包研究正则化模型。由于稀疏数据通常涉及对某些变量的非常稀疏的支持，L1 正则化对于将这些变量从模型中剔除非常有用。它通常比对支持度很低的变量进行一些非常虚假的参数估计更安全。

回复收藏 0 原文

征﹌骨岁月お 2024-09-15 01:20:41

glmnet 是一个不错的选择。支持线性、逻辑回归和多项回归等选项的 L1、L2 正则化。

唯一的细节是它没有公式界面，因此您必须创建模型矩阵。但这就是收获所在。

这是一个伪示例：

library(glmnet)
library(doMC)
registerDoMC(cores=4)

y_train <- class
x_train <- sparse.model.matrix(~ . -1, data=x_train)

# For example for logistic regression using L1 norm (lasso) 
cv.fit <- cv.glmnet(x=x_train, y=y_train, family='binomial', alpha=1, 
                    type.logistic="modified.Newton", type.measure = "auc",
                    nfolds=5, parallel=TRUE)

plot(cv.fit)

glmnet is a good choice. Supports L1, L2 regularization for linear, logistic, and multinomial regression, among other options.

The only detail is it doesn't have a formula interface, so you have to create your model matrix. But here is where the gain is.

Here is a pseudo-example:

library(glmnet)
library(doMC)
registerDoMC(cores=4)

y_train <- class
x_train <- sparse.model.matrix(~ . -1, data=x_train)

# For example for logistic regression using L1 norm (lasso) 
cv.fit <- cv.glmnet(x=x_train, y=y_train, family='binomial', alpha=1, 
                    type.logistic="modified.Newton", type.measure = "auc",
                    nfolds=5, parallel=TRUE)

plot(cv.fit)

回复收藏 0 原文