拉索跑步慢？

发布于 2025-01-22 23:18:04 字数 1232 浏览 6 评论 0 原文

我有一个大型数据集，我一直在尝试运行套索回归。分类变量被重新编码为假人。在收到有关有限内存的几条消息后，我使用矩阵将数据转换为稀疏矩阵。

问题是我的代码已经运行了很长时间（几个小时未完成），我不确定为什么。

这是产生相同问题的2000行数据（约0.3％）的样本：

这是我一直使用的代码：

    library(tidyverse)
    library(Matrix)
    install.packages('glmnet')
    library(glmnet)
    pacman::p_load(methods,utils,foreach,shape,survival,Rcpp,RcppEigen)

    data_sample_matrix = as.matrix(data_sample) %>% Matrix(.,sparse = TRUE)

    set.seed(879)

    split <- sample(nrow(data_sample_matrix), floor(0.8*nrow(data_sample_matrix)))
    
    train <- data_sample_matrix[split,]
    test <- data_sample_matrix[-split,]
    
    train_s <- train[,-28]
    test_s <- test[,-28]
    
    cv_model = cv.glmnet(train_s, train[,28], alpha=1, family = "binomial", nlambda=10, 
                         trace.it = TRUE)

注意：我包括所有应该与Glmnet上传的软件包 per cran ，因为我注意到当我做库时它们没有上传（glmnet）。

注意：[，28]代表我的结果变量。

谁能指出我在做错什么？

原文

I have a large dataset that I've been trying to run a lasso regression on. Categorical variables are re-coded to dummies. After receiving several messages regarding limited memory, I converted my data into a sparse matrix using Matrix.

The issue is that my code has been running for a long time (several hours without completion), and I'm not sure why.

Here is a sample of 2000 rows of data (~0.3% of data) that produces the same issue:
https://drive.google.com/file/d/1ZhyFIoxJSRHrC_eIe58C5zXFKJW-13Lm/view?usp=sharing

This is the code I've been using:

    library(tidyverse)
    library(Matrix)
    install.packages('glmnet')
    library(glmnet)
    pacman::p_load(methods,utils,foreach,shape,survival,Rcpp,RcppEigen)

    data_sample_matrix = as.matrix(data_sample) %>% Matrix(.,sparse = TRUE)

    set.seed(879)

    split <- sample(nrow(data_sample_matrix), floor(0.8*nrow(data_sample_matrix)))
    
    train <- data_sample_matrix[split,]
    test <- data_sample_matrix[-split,]
    
    train_s <- train[,-28]
    test_s <- test[,-28]
    
    cv_model = cv.glmnet(train_s, train[,28], alpha=1, family = "binomial", nlambda=10, 
                         trace.it = TRUE)

Note: I've included all the packages supposed to be uploaded with glmnet per the CRAN because I noticed that they weren't being uploaded when I did library(glmnet).

Note: [,28] represents my outcome variable.

Can anyone point to what I'm doing wrong?

分享到QQ

分享到微博