在 R 中重塑数据框

发布于 2024-08-06 19:44:29 字数 1126 浏览 5 评论 0原文

我在重塑大型数据框时遇到了困难。过去我在避免重塑问题方面相对幸运，但这也意味着我在这方面很糟糕。

我当前的数据框看起来像这样：

unique_id    seq   response    detailed.name    treatment 
a            N1     123.23     descr. of N1     T1
a            N2     231.12     descr. of N2     T1
a            N3     231.23     descr. of N3     T1
...
b            N1     343.23     descr. of N1     T2
b            N2     281.13     descr. of N2     T2
b            N3     901.23     descr. of N3     T2
...

我想：

seq    detailed.name   T1           T2
N1     descr. of N1    123.23       343.23
N2     descr. of N2    231.12       281.13
N3     descr. of N3    231.23       901.23

我已经研究了重塑包，但我不确定如何将处理因子转换为单独的列名称。

谢谢！

编辑：我尝试在本地计算机（4GB 双核 iMac 3.06Ghz）上运行它，但它一直失败：

> d.tmp.2 <- cast(d.tmp, `SEQ_ID` + `GENE_INFO` ~ treatments)
Aggregation requires fun.aggregate: length used as default
R(5751) malloc: *** mmap(size=647168) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

当我有机会时，我会尝试在我们的一台更大的计算机上运行它。

原文

I'm running into difficulties reshaping a large dataframe. And I've been relatively fortunate in avoiding reshaping problems in the past, which also means I'm terrible at it.

My current dataframe looks something like this:

unique_id    seq   response    detailed.name    treatment 
a            N1     123.23     descr. of N1     T1
a            N2     231.12     descr. of N2     T1
a            N3     231.23     descr. of N3     T1
...
b            N1     343.23     descr. of N1     T2
b            N2     281.13     descr. of N2     T2
b            N3     901.23     descr. of N3     T2
...

And I'd like:

seq    detailed.name   T1           T2
N1     descr. of N1    123.23       343.23
N2     descr. of N2    231.12       281.13
N3     descr. of N3    231.23       901.23

I've looked into the reshape package, but I'm not sure how I can convert the treatment factors into individual column names.

Thanks!

Edit: I tried running this on my local machine (4GB dual-core iMac 3.06Ghz) and it keeps failing with:

> d.tmp.2 <- cast(d.tmp, `SEQ_ID` + `GENE_INFO` ~ treatments)
Aggregation requires fun.aggregate: length used as default
R(5751) malloc: *** mmap(size=647168) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

I'll try running this on one of our bigger machines when I get a chance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

攀登最高峰 2024-08-13 19:44:29

重塑对我来说也总是很棘手，但它似乎总是需要一些尝试和错误才能发挥作用。这就是我最终发现的结果：

> x
  unique_id seq response detailed.name treatment
1         a  N1   123.23           dN1        T1
2         a  N2   231.12           dN2        T1
3         a  N3   231.23           dN3        T1
4         b  N1   343.23           dN1        T2
5         b  N2   281.13           dN2        T2
6         b  N3   901.23           dN3        T2

> x2 <- melt(x, c("seq", "detailed.name", "treatment"), "response")
> x2
  seq detailed.name treatment variable  value
1  N1           dN1        T1 response 123.23
2  N2           dN2        T1 response 231.12
3  N3           dN3        T1 response 231.23
4  N1           dN1        T2 response 343.23
5  N2           dN2        T2 response 281.13
6  N3           dN3        T2 response 901.23

> cast(x2, seq + detailed.name ~ treatment)
  seq detailed.name     T1     T2
1  N1           dN1 123.23 343.23
2  N2           dN2 231.12 281.13
3  N3           dN3 231.23 901.23

您的原始数据已经是长格式，但不是熔化/铸造使用的长格式。所以我又把它融化了。第二个参数（id.vars）是不要融化的东西的列表。第三个参数 (measure.vars) 是变化的事物的列表。

然后，演员阵容使用一个公式。波浪线左侧是保持原样的内容，波浪线右侧是用于条件值列的列。

或多或少...！

reshape always seems tricky to me too, but it always seems to work with a little trial and error. Here's what I ended up finding:

> x
  unique_id seq response detailed.name treatment
1         a  N1   123.23           dN1        T1
2         a  N2   231.12           dN2        T1
3         a  N3   231.23           dN3        T1
4         b  N1   343.23           dN1        T2
5         b  N2   281.13           dN2        T2
6         b  N3   901.23           dN3        T2

> x2 <- melt(x, c("seq", "detailed.name", "treatment"), "response")
> x2
  seq detailed.name treatment variable  value
1  N1           dN1        T1 response 123.23
2  N2           dN2        T1 response 231.12
3  N3           dN3        T1 response 231.23
4  N1           dN1        T2 response 343.23
5  N2           dN2        T2 response 281.13
6  N3           dN3        T2 response 901.23

> cast(x2, seq + detailed.name ~ treatment)
  seq detailed.name     T1     T2
1  N1           dN1 123.23 343.23
2  N2           dN2 231.12 281.13
3  N3           dN3 231.23 901.23

Your original data was already in long format, but not in the long format that melt/cast uses. So I re-melted it. The second argument (id.vars) is list of things not to melt. The third argument (measure.vars) is the list of things that vary.

Then, the cast uses a formula. Left of the tilde are the things that stay as they are, and right of the tilde are the columns that are used to condition the value column.

More or less...!

回复收藏 0 原文

兔姬 2024-08-13 19:44:29

基于 Harlan 的答案 - 如果数据已经是长格式，并且在 cast 调用中指定了保存值的列，则可以避免重熔步骤。

> x <- read.table(textConnection("  unique_id seq response detailed.name treatment
+ 1         a  N1   123.23           dN1        T1
+ 2         a  N2   231.12           dN2        T1
+ 3         a  N3   231.23           dN3        T1
+ 4         b  N1   343.23           dN1        T2
+ 5         b  N2   281.13           dN2        T2
+ 6         b  N3   901.23           dN3        T2"))
> 
> cast(x, seq + detailed.name ~ treatment, value = "response")
  seq detailed.name     T1     T2
1  N1           dN1 123.23 343.23
2  N2           dN2 231.12 281.13
3  N3           dN3 231.23 901.23

Building on Harlan's answer - the remelting step can be avoided if the data is already in the long format, and the column holding values is specified in the cast call.

> x <- read.table(textConnection("  unique_id seq response detailed.name treatment
+ 1         a  N1   123.23           dN1        T1
+ 2         a  N2   231.12           dN2        T1
+ 3         a  N3   231.23           dN3        T1
+ 4         b  N1   343.23           dN1        T2
+ 5         b  N2   281.13           dN2        T2
+ 6         b  N3   901.23           dN3        T2"))
> 
> cast(x, seq + detailed.name ~ treatment, value = "response")
  seq detailed.name     T1     T2
1  N1           dN1 123.23 343.23
2  N2           dN2 231.12 281.13
3  N3           dN3 231.23 901.23

回复收藏 0 原文

想你的星星会说话 2024-08-13 19:44:29

另一种选择是使用 tidyr 中的 spread ，

library(tidyr) 
Wide1 <- spread(x[-1], treatment, response)
Wide1
#  seq detailed.name     T1     T2
#1  N1           dN1 123.23 343.23
#2  N2           dN2 231.12 281.13
#3  N3           dN3 231.23 901.23

相反的操作由 gather 执行

gather(Wide1, detailed.name, response, T1:T2)
#  seq detailed.name detailed.name response
#1  N1           dN1            T1   123.23
#2  N2           dN2            T1   231.12
#3  N3           dN3            T1   231.23
#4  N1           dN1            T2   343.23
#5  N2           dN2            T2   281.13
#6  N3           dN3            T2   901.23

此外，还有 dcast.data.table< /code> 来自 data.table

library(data.table)
dcast.data.table(setDT(x), seq + detailed.name~treatment,
                                          value.var='response')
#   seq detailed.name     T1     T2
#1:  N1           dN1 123.23 343.23
#2:  N2           dN2 231.12 281.13
#3:  N3           dN3 231.23 901.23

数据

x <- structure(list(unique_id = structure(c(1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), seq = structure(c(1L, 
2L, 3L, 1L, 2L, 3L), .Label = c("N1", "N2", "N3"), class = "factor"), 
response = c(123.23, 231.12, 231.23, 343.23, 281.13, 901.23
), detailed.name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("dN1", 
"dN2", "dN3"), class = "factor"), treatment = structure(c(1L, 
1L, 1L, 2L, 2L, 2L), .Label = c("T1", "T2"), class = "factor")), .Names =
c("unique_id", "seq", "response", "detailed.name", "treatment"), class = 
"data.frame", row.names = c(NA, -6L))

Another option would be to use spread from tidyr

library(tidyr) 
Wide1 <- spread(x[-1], treatment, response)
Wide1
#  seq detailed.name     T1     T2
#1  N1           dN1 123.23 343.23
#2  N2           dN2 231.12 281.13
#3  N3           dN3 231.23 901.23

The opposite action is performed by gather

gather(Wide1, detailed.name, response, T1:T2)
#  seq detailed.name detailed.name response
#1  N1           dN1            T1   123.23
#2  N2           dN2            T1   231.12
#3  N3           dN3            T1   231.23
#4  N1           dN1            T2   343.23
#5  N2           dN2            T2   281.13
#6  N3           dN3            T2   901.23

Also, there is dcast.data.table from data.table

library(data.table)
dcast.data.table(setDT(x), seq + detailed.name~treatment,
                                          value.var='response')
#   seq detailed.name     T1     T2
#1:  N1           dN1 123.23 343.23
#2:  N2           dN2 231.12 281.13
#3:  N3           dN3 231.23 901.23

data

x <- structure(list(unique_id = structure(c(1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), seq = structure(c(1L, 
2L, 3L, 1L, 2L, 3L), .Label = c("N1", "N2", "N3"), class = "factor"), 
response = c(123.23, 231.12, 231.23, 343.23, 281.13, 901.23
), detailed.name = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("dN1", 
"dN2", "dN3"), class = "factor"), treatment = structure(c(1L, 
1L, 1L, 2L, 2L, 2L), .Label = c("T1", "T2"), class = "factor")), .Names =
c("unique_id", "seq", "response", "detailed.name", "treatment"), class = 
"data.frame", row.names = c(NA, -6L))

回复收藏 0 原文

薄凉少年不暖心 2024-08-13 19:44:29

您还可以使用 stats 包中的 reshape 函数。我没有您的示例数据集，但它看起来像这样：

reshape(x, idvar=c("seq","detailed.name"), timevar="treatment", direction="wide")

You can also use the reshape function in the stats package. I don't have your sample dataset, but it will look something like this:

reshape(x, idvar=c("seq","detailed.name"), timevar="treatment", direction="wide")

回复收藏 0 原文

穿越时光隧道 2024-08-13 19:44:29

如果您想使用 reshape2 获得相同的结果，这是对 reshape 包的更快且更内存高效的重写，那么以下内容将起作用。

主要变化是当您想要使用 data.frame 作为输出进行 cast 时使用 dcast 函数。这取代了 reshape 的 cast 函数

library(reshape2)

x = read.table(text = "unique_id seq   response  detailed.name treatment
                           a      N1    123.23         dN1        T1
                           a      N2    231.12         dN2        T1
                           a      N3    231.23         dN3        T1
                           b      N1    343.23         dN1        T2
                           b      N2    281.13         dN2        T2
                           b      N3    901.23         dN3        T2", 
sep = "", header = TRUE)

x

y <- dcast(x, seq + detailed.name ~ treatment, value.var = "response")
y
#   seq detailed.name     T1     T2
# 1  N1           dN1 123.23 343.23
# 2  N2           dN2 231.12 281.13
# 3  N3           dN3 231.23 901.23

# EDIT to show how to return to the original data set:

melt(y, id.vars=c('seq', 'detailed.name'), variable.name='T', value.name='response')

#   seq detailed.name  T response
# 1  N1           dN1 T1   123.23
# 2  N2           dN2 T1   231.12
# 3  N3           dN3 T1   231.23
# 4  N1           dN1 T2   343.23
# 5  N2           dN2 T2   281.13
# 6  N3           dN3 T2   901.23

If you want to get the same results using reshape2, which is a faster and more memory efficient rewrite of the reshape package, then the following will work.

The main change is the use of the dcast function when you want to cast with a data.frame as output. This replaces the cast function of reshape

library(reshape2)

x = read.table(text = "unique_id seq   response  detailed.name treatment
                           a      N1    123.23         dN1        T1
                           a      N2    231.12         dN2        T1
                           a      N3    231.23         dN3        T1
                           b      N1    343.23         dN1        T2
                           b      N2    281.13         dN2        T2
                           b      N3    901.23         dN3        T2", 
sep = "", header = TRUE)

x

y <- dcast(x, seq + detailed.name ~ treatment, value.var = "response")
y
#   seq detailed.name     T1     T2
# 1  N1           dN1 123.23 343.23
# 2  N2           dN2 231.12 281.13
# 3  N3           dN3 231.23 901.23

# EDIT to show how to return to the original data set:

melt(y, id.vars=c('seq', 'detailed.name'), variable.name='T', value.name='response')

#   seq detailed.name  T response
# 1  N1           dN1 T1   123.23
# 2  N2           dN2 T1   231.12
# 3  N3           dN3 T1   231.23
# 4  N1           dN1 T2   343.23
# 5  N2           dN2 T2   281.13
# 6  N3           dN3 T2   901.23

回复收藏 0 原文

~没有更多了~