R 中最快的高宽旋转
我正在处理一个简单的表格,其形式为
date variable value
1970-01-01 V1 0.434
1970-01-01 V2 12.12
1970-01-01 V3 921.1
1970-01-02 V1 -1.10
1970-01-03 V3 0.000
1970-01-03 V5 312e6
... ... ...
(日期,变量)对是唯一的。我想把这张桌子改造成一张宽形式的桌子。
date V1 V2 V3 V4 V5
1970-01-01 0.434 12.12 921.1 NA NA
1970-01-02 -1.10 NA NA NA NA
1970-01-03 0.000 NA NA NA 312e6
我想以尽可能最快的方式完成此操作,因为我必须对具有 1e6 条记录的表重复重复该操作。在 R 原生模式下,我相信 tapply()
、reshape()
和 d*ply()
在速度方面均由 <代码>数据.表。我想针对基于 sqlite 的解决方案(或其他数据库)测试后者的性能。这以前做过吗?有性能提升吗?并且,当“宽”字段(日期)的数量是可变的并且事先不知道时,如何在 sqlite 中将高转换为宽?
I am dealing with a simple table of the form
date variable value
1970-01-01 V1 0.434
1970-01-01 V2 12.12
1970-01-01 V3 921.1
1970-01-02 V1 -1.10
1970-01-03 V3 0.000
1970-01-03 V5 312e6
... ... ...
The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.
date V1 V2 V3 V4 V5
1970-01-01 0.434 12.12 921.1 NA NA
1970-01-02 -1.10 NA NA NA NA
1970-01-03 0.000 NA NA NA 312e6
And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply()
, reshape()
and d*ply()
are dominated speed-wise by data.table
. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this been done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我使用的方法基于
tapply
的做法,但速度快了一个数量级(主要是因为没有每个单元格的函数调用)。使用 Prasad 帖子中的
tall
的计时:关于可能出现的排序问题,应该不会有任何问题:
I use an approach that is based on what
tapply
does, but is about an order of magnitude faster (primarily as there is no per-cell function call).Timings using
tall
from Prasad's post:Regarding possible issues with ordering, there shouldn't be any problem:
几点说明。有几个问题解决了如何在 Sql(ite) 中进行从高到宽的旋转:此处 和此处。我没有太深入地研究过这些,但我的印象是,在 SQL 中执行此操作很丑陋,例如:您的 sql 查询需要显式提及查询中所有可能的键! (如果我错了,请有人纠正我)。至于
data.table
,您绝对可以非常快地执行分组操作,但我不知道如何将结果实际转换为宽格式。如果你想纯粹用 R 来做,我认为
tapply
是这里的速度冠军,比reshape2
中的acast
快得多:创建一些高的数据,其中有一些漏洞只是为了确保代码做正确的事情:
A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for
data.table
, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.If you want to do it purely in R, I think
tapply
is the speed champ here, much faster thanacast
fromreshape2
:Create some tall data, with some holes in it just to make sure the code is doing the right thing: