从 R 中的现有数据帧中提取数据（或重塑）数据帧

发布于 2024-12-09 05:13:04 字数 834 浏览 0 评论 0原文

我正在处理一个大型数据框，前几行如下：

      Assay   Genotype   Sample    Result
1     001        G         1         0
2     001        A         2         1
3     001        G         3         0 
4     001        NA        4         NA
5     002        T         1         0
6     002        G         2         1
7     002        T         3         0 
8     002        T         4         0
9     003        NA        1         N
10    003        G         2         1
11    003        G         3         1 
12    003        T         4         0

总共我将处理 2000 个样本，每个样本进行 168 次检测。对于每个样本，我想提取每个样本的“结果”中的数据，以创建一个列表或数据框，如下所示：

Sample  Data
   1    00N
   2    111
   3    001
   4    N00

因此，生成的数据框（或类似的首选数据结构）将是 2000 行和 2 列。 “数据”行将包含 168 个字符，每个字符对应每个“测定”。

有人可以帮我解决这个问题吗？

原文

I have a large data frame that Im working with, the first few lines are as follows:

      Assay   Genotype   Sample    Result
1     001        G         1         0
2     001        A         2         1
3     001        G         3         0 
4     001        NA        4         NA
5     002        T         1         0
6     002        G         2         1
7     002        T         3         0 
8     002        T         4         0
9     003        NA        1         N
10    003        G         2         1
11    003        G         3         1 
12    003        T         4         0

In total I'll be working with 2000 samples and 168 Assays for each sample. For each sample, Id like extract the data in 'Result' for each sample to create either a list or data frame that looks something like this:

Sample  Data
   1    00N
   2    111
   3    001
   4    N00

The resulting data frame (or similar preferred data structure) would thus be 2000 rows and 2 columns. The 'Data' line would contain 168 characters each one for each 'Assay'.

Can somebody help me with this problem?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情栀口红 2024-12-16 05:13:05

使用包 plyr 和基本函数 paste 的一种方法：

library(plyr)
ddply(dat, "Sample", summarize, Data = paste(Result, collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

编辑以解决问题

我能想到的将 NA 更改为 N 的最简单方法可能是对 ddply 的结果使用 gsub。请注意，我大量借用了 @Brian re: ordering 提供的非常好的观点。这样做，这是一个很好的提示！

out <- ddply(dat, "Sample", summarize, Data = paste(Result[order(Assay)], collapse = ""))

然后使用 gsub

out$Data <- gsub("NA", "N", out$Data)

等瞧：

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4  N00

One approach with package plyr and base function paste:

library(plyr)
ddply(dat, "Sample", summarize, Data = paste(Result, collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

EDIT to address question

Probably the easiest way I can think of to change your NA to N is to use gsub on the result of ddply. Note I'm liberally borrowing the very good point provided by @Brian re: ordering. Do that, it's a good tip!

out <- ddply(dat, "Sample", summarize, Data = paste(Result[order(Assay)], collapse = ""))

Then use gsub

out$Data <- gsub("NA", "N", out$Data)

et voila:

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4  N00

回复收藏 0 原文

燕归巢 2024-12-16 05:13:05

使用 split 和 sapply 的基本 R 解决方案：

sapply(split(dat$Result, dat$Sample), paste, collapse="")

     1      2      3      4 
 "00N"  "111"  "001" "NA00"

Base R solution using split and sapply:

sapply(split(dat$Result, dat$Sample), paste, collapse="")

     1      2      3      4 
 "00N"  "111"  "001" "NA00"

回复收藏 0 原文

疾风者 2024-12-16 05:13:05

请注意，@Chase 和 @Andrie 都假设数据已经按分析排序（您的示例就是这样，所以不是一个不合理的假设）。如果不是，您仍然可以按正确的顺序获取字符串。

适应@Chase的解决方案

library(plyr)
ddply(dat, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

给出了

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

如果我们使用未排序的数据：

dat.scramble <- dat[sample(nrow(dat)),]

> dat.scramble
   Assay Genotype Sample Result
6    002        G      2      1
1    001        G      1      0
3    001        G      3      0
7    002        T      3      0
10   003        G      2      1
8    002        T      4      0
12   003        T      4      0
5    002        T      1      0
2    001        A      2      1
4    001       NA      4     NA
9    003       NA      1      N
11   003        G      3      1

我们仍然得到相同的结果

ddply(dat.scramble, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

Note that @Chase and @Andrie both assume that the data is already sorted by assay (which your example is, so not an unreasonable assumption). If it is not, you can still get the string in the proper order.

Adapting @Chase's solution

library(plyr)
ddply(dat, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

gives

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

If we use data which is not sorted:

dat.scramble <- dat[sample(nrow(dat)),]

> dat.scramble
   Assay Genotype Sample Result
6    002        G      2      1
1    001        G      1      0
3    001        G      3      0
7    002        T      3      0
10   003        G      2      1
8    002        T      4      0
12   003        T      4      0
5    002        T      1      0
2    001        A      2      1
4    001       NA      4     NA
9    003       NA      1      N
11   003        G      3      1

we still get the same result

ddply(dat.scramble, "Sample", summarize, 
  Data = paste(Result[order(Assay)], collapse = ""))

  Sample Data
1      1  00N
2      2  111
3      3  001
4      4 NA00

回复收藏 0 原文

~没有更多了~