R 中最快的高宽旋转

发布于 2024-10-22 13:32:34 字数 869 浏览 14 评论 0原文

我正在处理一个简单的表格，其形式为

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

（日期，变量）对是唯一的。我想把这张桌子改造成一张宽形式的桌子。

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

我想以尽可能最快的方式完成此操作，因为我必须对具有 1e6 条记录的表重复重复该操作。在 R 原生模式下，我相信 tapply()、reshape() 和 d*ply() 在速度方面均由 <代码>数据.表。我想针对基于 sqlite 的解决方案（或其他数据库）测试后者的性能。这以前做过吗？有性能提升吗？并且，当“宽”字段（日期）的数量是可变的并且事先不知道时，如何在 sqlite 中将高转换为宽？

原文

I am dealing with a simple table of the form

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply(), reshape() and d*ply() are dominated speed-wise by data.table. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this been done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感性不性感 2024-10-29 13:32:34

我使用的方法基于 tapply 的做法，但速度快了一个数量级（主要是因为没有每个单元格的函数调用）。

使用 Prasad 帖子中的 tall 的计时：

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9

关于可能出现的排序问题，应该不会有任何问题：

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE

I use an approach that is based on what tapply does, but is about an order of magnitude faster (primarily as there is no per-cell function call).

Timings using tall from Prasad's post:

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9

Regarding possible issues with ordering, there shouldn't be any problem:

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE

回复收藏 0 原文

半城柳色半声笛 2024-10-29 13:32:34

几点说明。有几个问题解决了如何在 Sql(ite) 中进行从高到宽的旋转：此处和此处。我没有太深入地研究过这些，但我的印象是，在 SQL 中执行此操作很丑陋，例如：您的 sql 查询需要显式提及查询中所有可能的键！（如果我错了，请有人纠正我）。至于data.table，您绝对可以非常快地执行分组操作，但我不知道如何将结果实际转换为宽格式。

如果你想纯粹用 R 来做，我认为 tapply 是这里的速度冠军，比 reshape2 中的 acast 快得多：

创建一些高的数据，其中有一些漏洞只是为了确保代码做正确的事情：

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98

A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for data.table, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.

If you want to do it purely in R, I think tapply is the speed champ here, much faster than acast from reshape2:

Create some tall data, with some holes in it just to make sure the code is doing the right thing:

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98

回复收藏 0 原文

~没有更多了~