R 中最快的高宽旋转

发布于 2024-10-22 13:32:34 字数 869 浏览 3 评论 0原文

我正在处理一个简单的表格,其形式为

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

(日期,变量)对是唯一的。我想把这张桌子改造成一张宽形式的桌子。

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

我想以尽可能最快的方式完成此操作,因为我必须对具有 1e6 条记录的表重复重复该操作。在 R 原生模式下,我相信 tapply()reshape()d*ply() 在速度方面均由 <代码>数据.表。我想针对基于 sqlite 的解决方案(或其他数据库)测试后者的性能。这以前做过吗?有性能提升吗?并且,当“宽”字段(日期)的数量是可变的并且事先不知道时,如何在 sqlite 中将高转换为宽?

I am dealing with a simple table of the form

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply(), reshape() and d*ply() are dominated speed-wise by data.table. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this been done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

感性不性感 2024-10-29 13:32:34

我使用的方法基于 tapply 的做法,但速度快了一个数量级(主要是因为没有每个单元格的函数调用)。

使用 Prasad 帖子中的 tall 的计时:

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9 

关于可能出现的排序问题,应该不会有任何问题:

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE

I use an approach that is based on what tapply does, but is about an order of magnitude faster (primarily as there is no per-cell function call).

Timings using tall from Prasad's post:

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9 

Regarding possible issues with ordering, there shouldn't be any problem:

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE
半城柳色半声笛 2024-10-29 13:32:34

几点说明。有几个问题解决了如何在 Sql(ite) 中进行从高到宽的旋转:此处此处。我没有太深入地研究过这些,但我的印象是,在 SQL 中执行此操作很丑陋,例如:您的 sql 查询需要显式提及查询中所有可能的键! (如果我错了,请有人纠正我)。至于data.table,您绝对可以非常快地执行分组操作,但我不知道如何将结果实际转换为宽格式。

如果你想纯粹用 R 来做,我认为 tapply 是这里的速度冠军,比 reshape2 中的 acast 快得多:

创建一些高的数据,其中有一些漏洞只是为了确保代码做正确的事情:

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98 

A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for data.table, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.

If you want to do it purely in R, I think tapply is the speed champ here, much faster than acast from reshape2:

Create some tall data, with some holes in it just to make sure the code is doing the right thing:

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文