在 R 中重塑数据框
我在重塑大型数据框时遇到了困难。过去我在避免重塑问题方面相对幸运,但这也意味着我在这方面很糟糕。
我当前的数据框看起来像这样:
unique_id seq response detailed.name treatment
a N1 123.23 descr. of N1 T1
a N2 231.12 descr. of N2 T1
a N3 231.23 descr. of N3 T1
...
b N1 343.23 descr. of N1 T2
b N2 281.13 descr. of N2 T2
b N3 901.23 descr. of N3 T2
...
我想:
seq detailed.name T1 T2
N1 descr. of N1 123.23 343.23
N2 descr. of N2 231.12 281.13
N3 descr. of N3 231.23 901.23
我已经研究了重塑包,但我不确定如何将处理因子转换为单独的列名称。
谢谢!
编辑:我尝试在本地计算机(4GB 双核 iMac 3.06Ghz)上运行它,但它一直失败:
> d.tmp.2 <- cast(d.tmp, `SEQ_ID` + `GENE_INFO` ~ treatments)
Aggregation requires fun.aggregate: length used as default
R(5751) malloc: *** mmap(size=647168) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
当我有机会时,我会尝试在我们的一台更大的计算机上运行它。
I'm running into difficulties reshaping a large dataframe. And I've been relatively fortunate in avoiding reshaping problems in the past, which also means I'm terrible at it.
My current dataframe looks something like this:
unique_id seq response detailed.name treatment
a N1 123.23 descr. of N1 T1
a N2 231.12 descr. of N2 T1
a N3 231.23 descr. of N3 T1
...
b N1 343.23 descr. of N1 T2
b N2 281.13 descr. of N2 T2
b N3 901.23 descr. of N3 T2
...
And I'd like:
seq detailed.name T1 T2
N1 descr. of N1 123.23 343.23
N2 descr. of N2 231.12 281.13
N3 descr. of N3 231.23 901.23
I've looked into the reshape package, but I'm not sure how I can convert the treatment factors into individual column names.
Thanks!
Edit: I tried running this on my local machine (4GB dual-core iMac 3.06Ghz) and it keeps failing with:
> d.tmp.2 <- cast(d.tmp, `SEQ_ID` + `GENE_INFO` ~ treatments)
Aggregation requires fun.aggregate: length used as default
R(5751) malloc: *** mmap(size=647168) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
I'll try running this on one of our bigger machines when I get a chance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
重塑对我来说也总是很棘手,但它似乎总是需要一些尝试和错误才能发挥作用。这就是我最终发现的结果:
您的原始数据已经是长格式,但不是熔化/铸造使用的长格式。所以我又把它融化了。第二个参数(id.vars)是不要融化的东西的列表。第三个参数 (measure.vars) 是变化的事物的列表。
然后,演员阵容使用一个公式。波浪线左侧是保持原样的内容,波浪线右侧是用于条件值列的列。
或多或少...!
reshape always seems tricky to me too, but it always seems to work with a little trial and error. Here's what I ended up finding:
Your original data was already in long format, but not in the long format that melt/cast uses. So I re-melted it. The second argument (id.vars) is list of things not to melt. The third argument (measure.vars) is the list of things that vary.
Then, the cast uses a formula. Left of the tilde are the things that stay as they are, and right of the tilde are the columns that are used to condition the value column.
More or less...!
基于 Harlan 的答案 - 如果数据已经是长格式,并且在
cast
调用中指定了保存值的列,则可以避免重熔步骤。Building on Harlan's answer - the remelting step can be avoided if the data is already in the long format, and the column holding values is specified in the
cast
call.另一种选择是使用
tidyr
中的spread
,相反的操作由
gather
执行此外,还有
dcast.data.table< /code> 来自
data.table
数据
Another option would be to use
spread
fromtidyr
The opposite action is performed by
gather
Also, there is
dcast.data.table
fromdata.table
data
您还可以使用
stats
包中的reshape
函数。我没有您的示例数据集,但它看起来像这样:You can also use the
reshape
function in thestats
package. I don't have your sample dataset, but it will look something like this:如果您想使用
reshape2
获得相同的结果,这是对reshape
包的更快且更内存高效的重写,那么以下内容将起作用。主要变化是当您想要使用
data.frame
作为输出进行cast
时使用dcast
函数。这取代了reshape
的cast
函数If you want to get the same results using
reshape2
, which is a faster and more memory efficient rewrite of thereshape
package, then the following will work.The main change is the use of the
dcast
function when you want tocast
with adata.frame
as output. This replaces thecast
function ofreshape