R 中具有稀疏特征矩阵的大规模回归

发布于 2024-09-08 01:20:41 字数 901 浏览 2 评论 0原文

我想在 R 中使用许多(例如 100k)特征进行大规模回归(线性/逻辑),其中每个示例在特征空间中相对稀疏——例如,每个示例大约 1k 个非零特征。

看起来像 SparseMslm 应该这样做,但我在从 sparseMatrix 格式转换为 slm 友好格式时遇到困难。

我有一个由标签 y 组成的数字向量和一个由特征 X \in {0,1} 组成的 sparseMatrix 。当我尝试时,

model <- slm(y ~ X)

出现以下错误:

Error in model.frame.default(formula = y ~ X) : 
invalid type (S4) for variable 'X'

大概是因为 slm 想要一个 SparseM 对象而不是 sparseMatrix

有没有一种简单的方法可以 a) 直接填充 SparseM 对象或 b) 将 sparseMatrix 转换为 SparseM 对象?或者也许有更好/更简单的方法来做到这一点?

(我想我可以使用 Xy 显式编码线性回归的解决方案,但如果有 slm 工作那就太好了。)

I'd like to do large-scale regression (linear/logistic) in R with many (e.g. 100k) features, where each example is relatively sparse in the feature space---e.g., ~1k non-zero features per example.

It seems like the SparseM package slm should do this, but I'm having difficulty converting from the sparseMatrix format to a slm-friendly format.

I have a numeric vector of labels y and a sparseMatrix of features X \in {0,1}. When I try

model <- slm(y ~ X)

I get the following error:

Error in model.frame.default(formula = y ~ X) : 
invalid type (S4) for variable 'X'

presumably because slm wants a SparseM object instead of a sparseMatrix.

Is there an easy way to either a) populate a SparseM object directly or b) convert a sparseMatrix to a SparseM object? Or perhaps there's a better/simpler way to do this?

(I suppose I could explicitly code the solutions for linear regression using X and y, but it would be nice to have slm working.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

以往的大感动 2024-09-15 01:20:41

不了解 SparseM,但 MatrixModels 包有一个未导出的 lm.fit.sparse 函数可供您使用。请参阅?MatrixModels:::lm.fit.sparse。这是一个示例:

创建数据:

y <- rnorm(30)
x <- factor(sample(letters, 30, replace=TRUE))
X <- as(x, "sparseMatrix")
class(X)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dim(X)
# [1] 18 30

运行回归:

MatrixModels:::lm.fit.sparse(t(X), y)
#  [1] -0.17499968 -0.89293312 -0.43585172  0.17233007 -0.11899582  0.56610302
#  [7]  1.19654666 -1.66783581 -0.28511569 -0.11859264 -0.04037503  0.04826549
# [13] -0.06039113 -0.46127034 -1.22106064 -0.48729092 -0.28524498  1.81681527

用于比较:

lm(y~x-1)

# Call:
# lm(formula = y ~ x - 1)
# 
# Coefficients:
#       xa        xb        xd        xe        xf        xg        xh        xj  
# -0.17500  -0.89293  -0.43585   0.17233  -0.11900   0.56610   1.19655  -1.66784  
#       xm        xq        xr        xt        xu        xv        xw        xx  
# -0.28512  -0.11859  -0.04038   0.04827  -0.06039  -0.46127  -1.22106  -0.48729  
#       xy        xz  
# -0.28524   1.81682  

Don't know about SparseM but the MatrixModels package has an unexported lm.fit.sparse function that you can use. See ?MatrixModels:::lm.fit.sparse. Here is an example:

Create the data:

y <- rnorm(30)
x <- factor(sample(letters, 30, replace=TRUE))
X <- as(x, "sparseMatrix")
class(X)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
dim(X)
# [1] 18 30

Run the regression:

MatrixModels:::lm.fit.sparse(t(X), y)
#  [1] -0.17499968 -0.89293312 -0.43585172  0.17233007 -0.11899582  0.56610302
#  [7]  1.19654666 -1.66783581 -0.28511569 -0.11859264 -0.04037503  0.04826549
# [13] -0.06039113 -0.46127034 -1.22106064 -0.48729092 -0.28524498  1.81681527

For comparison:

lm(y~x-1)

# Call:
# lm(formula = y ~ x - 1)
# 
# Coefficients:
#       xa        xb        xd        xe        xf        xg        xh        xj  
# -0.17500  -0.89293  -0.43585   0.17233  -0.11900   0.56610   1.19655  -1.66784  
#       xm        xq        xr        xt        xu        xv        xw        xx  
# -0.28512  -0.11859  -0.04038   0.04827  -0.06039  -0.46127  -1.22106  -0.48729  
#       xy        xz  
# -0.28524   1.81682  
ㄟ。诗瑗 2024-09-15 01:20:41

迟来的答案:glmnet 还将支持稀疏矩阵和所请求的两种回归模型。这可以使用 Matrix 包生成的稀疏矩阵。我建议通过这个包研究正则化模型。由于稀疏数据通常涉及对某些变量的非常稀疏的支持,L1 正则化对于将这些变量从模型中剔除非常有用。它通常比对支持度很低的变量进行一些非常虚假的参数估计更安全。

A belated answer: glmnet will also support sparse matrices and both of the regression models requested. This can use the sparse matrices produced by the Matrix package. I advise looking into regularized models via this package. As sparse data often involves very sparse support for some variables, L1 regularization is useful for knocking these out of the model. It's often safer than getting some very spurious parameter estimates for variables with very low support.

征﹌骨岁月お 2024-09-15 01:20:41

glmnet 是一个不错的选择。支持线性、逻辑回归和多项回归等选项的 L1、L2 正则化。

唯一的细节是它没有公式界面,因此您必须创建模型矩阵。但这就是收获所在。

这是一个伪示例:

library(glmnet)
library(doMC)
registerDoMC(cores=4)

y_train <- class
x_train <- sparse.model.matrix(~ . -1, data=x_train)

# For example for logistic regression using L1 norm (lasso) 
cv.fit <- cv.glmnet(x=x_train, y=y_train, family='binomial', alpha=1, 
                    type.logistic="modified.Newton", type.measure = "auc",
                    nfolds=5, parallel=TRUE)

plot(cv.fit)

glmnet is a good choice. Supports L1, L2 regularization for linear, logistic, and multinomial regression, among other options.

The only detail is it doesn't have a formula interface, so you have to create your model matrix. But here is where the gain is.

Here is a pseudo-example:

library(glmnet)
library(doMC)
registerDoMC(cores=4)

y_train <- class
x_train <- sparse.model.matrix(~ . -1, data=x_train)

# For example for logistic regression using L1 norm (lasso) 
cv.fit <- cv.glmnet(x=x_train, y=y_train, family='binomial', alpha=1, 
                    type.logistic="modified.Newton", type.measure = "auc",
                    nfolds=5, parallel=TRUE)

plot(cv.fit)
行雁书 2024-09-15 01:20:41

You might also get some mileage by looking here:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文