R 中最快的高宽旋转
我正在处理一个简单的表格,其形式为
date variable value
1970-01-01 V1 0.434
1970-01-01 V2 12.12
1970-01-01 V3 921.1
1970-01-02 V1 -1.10
1970-01-03 V3 0.000
1970-01-03 V5 312e6
... ... ...
(日期,变量)对是唯一的。我想把这张桌子改造成一张宽形式的桌子。
date V1 V2 V3 V4 V5
1970-01-01 0.434 12.12 921.1 NA NA
1970-01-02 -1.10 NA NA NA NA
1970-01-03 0.000 NA NA NA 312e6
我想以尽可能最快的方式完成此操作,因为我必须对具有 1e6 条记录的表重复重复该操作。在 R 原生模式下,我相信 tapply()
、reshape()
和 d*ply()
在速度方面均由 <代码>数据.表。我想针对基于 sqlite 的解决方案(或其他数据库)测试后者的性能。这以前做过吗?有性能提升吗?并且,当“宽”字段(日期)的数量是可变的并且事先不知道时,如何在 sqlite 中将高转换为宽?
I am dealing with a simple table of the form
date variable value
1970-01-01 V1 0.434
1970-01-01 V2 12.12
1970-01-01 V3 921.1
1970-01-02 V1 -1.10
1970-01-03 V3 0.000
1970-01-03 V5 312e6
... ... ...
The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.
date V1 V2 V3 V4 V5
1970-01-01 0.434 12.12 921.1 NA NA
1970-01-02 -1.10 NA NA NA NA
1970-01-03 0.000 NA NA NA 312e6
And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply()
, reshape()
and d*ply()
are dominated speed-wise by data.table
. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this been done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我使用的方法基于
tapply
的做法,但速度快了一个数量级(主要是因为没有每个单元格的函数调用)。使用 Prasad 帖子中的
tall
的计时:关于可能出现的排序问题,应该不会有任何问题:
I use an approach that is based on what
tapply
does, but is about an order of magnitude faster (primarily as there is no per-cell function call).Timings using
tall
from Prasad's post:Regarding possible issues with ordering, there shouldn't be any problem:
几点说明。有几个问题解决了如何在 Sql(ite) 中进行从高到宽的旋转:此处 和此处。我没有太深入地研究过这些,但我的印象是,在 SQL 中执行此操作很丑陋,例如:您的 sql 查询需要显式提及查询中所有可能的键! (如果我错了,请有人纠正我)。至于
data.table
,您绝对可以非常快地执行分组操作,但我不知道如何将结果实际转换为宽格式。如果你想纯粹用 R 来做,我认为
tapply
是这里的速度冠军,比reshape2
中的acast
快得多:创建一些高的数据,其中有一些漏洞只是为了确保代码做正确的事情:
A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for
data.table
, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.If you want to do it purely in R, I think
tapply
is the speed champ here, much faster thanacast
fromreshape2
:Create some tall data, with some holes in it just to make sure the code is doing the right thing: