将R中的大数据压入CSV中,无效或列表

发布于 2025-02-07 07:53:08 字数 1262 浏览 1 评论 0原文

首次发布:

我正在准备arules()read.transactions的数据:

Invoice001,客户ID,Country,Stockcodexyz,Stockcode123

Invoice002 ...等

但是,在重复每个stockcode的发票时,数据读取如下:

Invoice001,customerId,country,stockcodexyz

Invoice001,customerId,country,stockcode123

Invoice002 .... etc

我一直在尝试Pivot_wider()生成285m+的零单元格中,我很难解决,无法写入CSV或读取arules。我还尝试过keep(〜!is.null(。)),丢弃(is.null),compact()而无需成功,并且对实现上述预期结果的任何方法开放。

但是,我觉得我应该能够使用内置的arules()read.transactions()fx来解决它,但是当我在那里尝试不同的事情时,我会遇到各种错误。

数据是从加州大学欧文分校开放的,在此处找到: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/online%20retail.xlsx

任何帮助都会非常感谢。

library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)

trans <- read.transactions(????????????)

FIRST TIME POSTING:

I'm preparing data for arules() read.transactions and need to compress unique Invoice data (500k+ cases) so that each unique Invoice and its associated info fits on a single line like this:

Invoice001,CustomerID,Country,StockCodeXYZ,StockCode123

Invoice002...etc

However, the data reads in repeating the Invoice for each StockCode like this:

Invoice001,CustomerID,Country,StockCodeXYZ

Invoice001,CustomerID,Country,StockCode123

Invoice002....etc

I've been trying pivot_wider() and then unite(), but it generates 285M+ MOSTLY NULL cells into a LIST which I'm having a hard time resolving and unable to write to csv or read into arules. I've also tried keep(~!is.null(.)), discard(is.null), compact() without success and am open to any method to achieve the desired outcome above.

However, I feel like I should be able to solve it using the built-in arules() read.transactions() fx, but am getting various errors as I try different things there too.

The data is opensource from University of California, Irvin and found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx

Any help would be greatly appreciated.

library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)

trans <- read.transactions(????????????)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

归途 2025-02-14 07:53:08

这张发票“ 573585” HAST超过1.000 ITENS,因此如果您仅从发票项目中获取库存号码,它将生成圆柱数量的数量...仍然我们有1.000列超过1.000列。

library(dplyr)


Online_20Retail %>% 
    dplyr::transmute(new = paste0(InvoiceNo, ", ", 
                                  CustomerID, ", ", 
                                  Country, ", "), 
                     StockCode) %>% 
    dplyr::group_by(new) %>% 
    dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
    dplyr::transmute(mystring = paste0(new, output)) 
    # you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe


# A tibble: 25,900 x 1
   mystring                                                                                                                                         
   <chr>                                                                                                                                            
 1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730                                                               
 2 536366, 17850, United Kingdom, 22633, 22632                                                                                                      
 3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187                                
 4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914                                                                                        
 5 536369, 13047, United Kingdom, 21756                                                                                                             
 6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
 7 536371, 13748, United Kingdom, 22086                                                                                                             
 8 536372, 17850, United Kingdom, 22632, 22633                                                                                                      
 9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258                                                                                                             
# ... with 25,890 more rows

this one invoice "573585" hast over 1.000 itens so it will generate the acording number of columns if you only get the stock number from the invoice items... still we have a bit over 1.000 columns.

library(dplyr)


Online_20Retail %>% 
    dplyr::transmute(new = paste0(InvoiceNo, ", ", 
                                  CustomerID, ", ", 
                                  Country, ", "), 
                     StockCode) %>% 
    dplyr::group_by(new) %>% 
    dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
    dplyr::transmute(mystring = paste0(new, output)) 
    # you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe


# A tibble: 25,900 x 1
   mystring                                                                                                                                         
   <chr>                                                                                                                                            
 1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730                                                               
 2 536366, 17850, United Kingdom, 22633, 22632                                                                                                      
 3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187                                
 4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914                                                                                        
 5 536369, 13047, United Kingdom, 21756                                                                                                             
 6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
 7 536371, 13748, United Kingdom, 22086                                                                                                             
 8 536372, 17850, United Kingdom, 22632, 22633                                                                                                      
 9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258                                                                                                             
# ... with 25,890 more rows
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文