按因子和函数对数据帧行进行分组 - 输出完整的原始数据帧行

发布于 2024-12-08 20:36:55 字数 1784 浏览 0 评论 0原文

我的第一篇文章，我对 R 还很陌生，所以这可能是一个高球。我已经到处寻找解决方案，所以我终于发帖寻求帮助。如果我需要澄清或提供更多信息，请告诉我。

我有一个如下所示的大型数据框：

numReads length    name2
0        7384      Ssxb2
7904     93237     St5
3438     12969     Taf9b
0        996       Tas2r138
0        882       Tas2r143
0        960       Tas2r144
0        6761      Tbx10
8125     43804     Tdrd1
8124     43738     Tdrd1
8102     39301     Tdrd1
1227     9286      Thnsl1

如何按第三列 (name2) 对数据进行分组，找到 numReads 的 max() 值，并维护关联的长度值？

我的理想输出是上述数据，其中包含与“Tdrd1”相关的两行，其中不包含该因子水平的最大值（具有 8124 和 8102 值的行）。

我尝试过 tapply()、by() 和 aggregate()。他们都不能为我提供正确的输出。

提前致谢。

评论后进行编辑的速度远远快于预期。谢谢你！

理想的示例结果如下所示

numReads  length  name2
0        7384      Ssxb2
7904     93237     St5
3438     12969     Taf9b
0        996       Tas2r138
0        882       Tas2r143
0        960       Tas2r144
0        6761      Tbx10
8125     43804     Tdrd1
1227     9286      Thnsl1

所以我似乎有两个问题。第一个是根据因素对数据进行分组。第二个是如何计算组上的函数，但在计算所选函数后输出整行。

我喜欢先使用aggregate() 后跟merge() 的想法。但是 merge() 函数如何知道原始行中的哪一行可以根据公因子水平获取“长度”值？

该数据是基于转录本注释的基因表达数据的快照。我正在尝试为关联的“name2”选择最高表达的转录本（以 numReads 计）。我需要下游标准化的长度数据。

在尝试使用 ROLO 的非常有用的建议后进行编辑。再次感谢！

也感谢 Chase 和 daroczig 的帮助

所以我尝试使用 ddply() 方法按“name2”分割我的数据帧，按读取次数降序排序，然后选择顶行。这有效地为我提供了每个组的最大“name2”值，并保留了我的所有原始信息，尤其是长度。

不幸的是，我试图在超过 34,000 行的数据帧上执行此操作。它适用于约 1000 行，甚至约 5000 行，但当我向它提供整个数据集时会崩溃。

我尝试使用 .parallel 选项，但失败并出现以下错误：

Loading required package: foreach
Error: foreach package required for parallel plyr operation

我还尝试使用 .progressbar 选项监视操作。进度条已达到 100%，但操作从未完成。

关于如何将此操作应用于我的完整数据集有什么想法吗？

原文

My first post and I'm very new to R so this may be a lob. I have search all over for a solution though, so I'm finally posting for help. Let me know if I need to clarify or provide more information.

I have a large dataframe that looks like the following:

numReads length    name2
0        7384      Ssxb2
7904     93237     St5
3438     12969     Taf9b
0        996       Tas2r138
0        882       Tas2r143
0        960       Tas2r144
0        6761      Tbx10
8125     43804     Tdrd1
8124     43738     Tdrd1
8102     39301     Tdrd1
1227     9286      Thnsl1

How can I group the data by the third column (name2), find the max() value for numReads, and maintain the associated length value?

My ideal output would be the above data with the two lines associated with "Tdrd1" that DO NOT contain the max value for that factor level (the lines with the 8124 and 8102 values).

I have tried tapply(), by(), and aggregate(). None of them can provide me with the proper output.

Thanks in advance.

Edit after comments that came FAR faster than expected. Thank you!

Ideal example results would look like the following

numReads  length  name2
0        7384      Ssxb2
7904     93237     St5
3438     12969     Taf9b
0        996       Tas2r138
0        882       Tas2r143
0        960       Tas2r144
0        6761      Tbx10
8125     43804     Tdrd1
1227     9286      Thnsl1

So it does seem like I have two questions here. The first is to group the data based on a factor. The second is how to calculate a function on the group, but output the entire row after calculating the chosen function.

I like the idea of an aggregate() followed by a merge(). But how will the merge() function know WHICH row of the original rows from which to grab the 'length' value based on a common factor level?

The data is a snapshot of gene expression data based on transcript annotations. I am trying to select the highest expressed transcript ( in terms of numReads) for an associated 'name2.' I need the length data for downstream normalization.

EDIT after trying to use the very helpful suggestion by ROLO. Thanks again!

also thank you Chase and daroczig for help as well

So I am trying to use the ddply() approach to split my dataframe by 'name2', sort by the number of reads in decreasing order, and selecting the top row. This effectively gives me the max 'name2' value of each group and maintains all my original information, especially the length.

Unfortunately, I'm trying to do this on a dataframe with >34,000 rows. It works fine for ~1000 rows, and even ~5000 rows, but crashes when I give it my whole dataset.

I've trying to use the .parallel option but it fails with the following error:

Loading required package: foreach
Error: foreach package required for parallel plyr operation

I've also tried to monitor operation with the .progressbar option as well. the progress bar makes it to 100%, but the operation never finishes.

Any ideas on how to apply this operation to my complete dataset?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

场罚期间 2024-12-15 20:36:56

我可能无法准确理解您想要的内容，但我认为您希望从数据库中获取 name2 每个级别的 numReads 中具有最高值的行。这可以很容易地完成，例如。与聚合和稍后的合并。

您的演示数据集：

df  <- structure(list(numReads = c(0L, 7904L, 3438L, 0L, 0L, 0L, 0L, 
8125L, 8124L, 8102L, 1227L), length = c(7384L, 93237L, 12969L, 
996L, 882L, 960L, 6761L, 43804L, 43738L, 39301L, 9286L), name2 = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L), .Label = c("Ssxb2", 
"St5", "Taf9b", "Tas2r138", "Tas2r143", "Tas2r144", "Tbx10", 
"Tdrd1", "Thnsl1"), class = "factor")), .Names = c("numReads", 
"length", "name2"), class = "data.frame", row.names = c(NA, -11L
))

让我们通过 name2 使用 max 函数聚合数据框：

> df.a <- aggregate(numReads ~ name2, df, max)
> df.a
     name2 numReads
1    Ssxb2        0
2      St5     7904
3    Taf9b     3438
4 Tas2r138        0
5 Tas2r143        0
6 Tas2r144        0
7    Tbx10        0
8    Tdrd1     8125
9   Thnsl1     1227

并将 length 的原始值合并到数据框 (< code>df.a)：

> merge(df.a, df)
     name2 numReads length
1    Ssxb2        0   7384
2      St5     7904  93237
3    Taf9b     3438  12969
4 Tas2r138        0    996
5 Tas2r143        0    882
6 Tas2r144        0    960
7    Tbx10        0   6761
8    Tdrd1     8125  43804
9   Thnsl1     1227   9286

我希望我没有误解你的问题！

I might not get what you are after exactly, but I think you want to get the rows from the database which have the highest value in numReadsper level of name2. This could be done easily eg. with aggregate and later merge.

Your demo dataset:

df  <- structure(list(numReads = c(0L, 7904L, 3438L, 0L, 0L, 0L, 0L, 
8125L, 8124L, 8102L, 1227L), length = c(7384L, 93237L, 12969L, 
996L, 882L, 960L, 6761L, 43804L, 43738L, 39301L, 9286L), name2 = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L), .Label = c("Ssxb2", 
"St5", "Taf9b", "Tas2r138", "Tas2r143", "Tas2r144", "Tbx10", 
"Tdrd1", "Thnsl1"), class = "factor")), .Names = c("numReads", 
"length", "name2"), class = "data.frame", row.names = c(NA, -11L
))

Let us aggregate the data frame by name2 with max function:

> df.a <- aggregate(numReads ~ name2, df, max)
> df.a
     name2 numReads
1    Ssxb2        0
2      St5     7904
3    Taf9b     3438
4 Tas2r138        0
5 Tas2r143        0
6 Tas2r144        0
7    Tbx10        0
8    Tdrd1     8125
9   Thnsl1     1227

And merge the original values of length to the data frame (df.a):

> merge(df.a, df)
     name2 numReads length
1    Ssxb2        0   7384
2      St5     7904  93237
3    Taf9b     3438  12969
4 Tas2r138        0    996
5 Tas2r143        0    882
6 Tas2r144        0    960
7    Tbx10        0   6761
8    Tdrd1     8125  43804
9   Thnsl1     1227   9286

I hope I did not misunderstood your question!

回复收藏 0 原文

感情废物 2024-12-15 20:36:56

这里似乎有两个不同的问题。第一个问题可以用 plyr 包来解决：

library(plyr)
txt <- "numReads length    name2

0   7384    Ssxb2
7904  93237      St5
3438  12969    Taf9b
0    996 Tas2r138
0    882 Tas2r143
0    960 Tas2r144
0   6761    Tbx10
8125  43804    Tdrd1
8124  43738    Tdrd1
8102  39301    Tdrd1
1227   9286   Thnsl1
"

dat <- read.table(textConnection(txt), header = TRUE)

ddply(dat, "name2", summarize, max = max(numReads))

给你：

     name2  max
1    Ssxb2    0
2      St5 7904
3    Taf9b 3438
4 Tas2r138    0
5 Tas2r143    0
6 Tas2r144    0
7    Tbx10    0
8    Tdrd1 8125
9   Thnsl1 1227

第二个问题似乎可以回答：

dat[dat$name2 == "Tdrd1" & dat$numReads != max(dat$numReads[dat$name2 == "Tdrd1"]),]

   numReads length name2
9      8124  43738 Tdrd1
10     8102  39301 Tdrd1

提供更多关于你想要做什么的上下文，我将进一步详细说明。

There are seemingly two different questions here. The first can be solved with the plyr package:

library(plyr)
txt <- "numReads length    name2

0   7384    Ssxb2
7904  93237      St5
3438  12969    Taf9b
0    996 Tas2r138
0    882 Tas2r143
0    960 Tas2r144
0   6761    Tbx10
8125  43804    Tdrd1
8124  43738    Tdrd1
8102  39301    Tdrd1
1227   9286   Thnsl1
"

dat <- read.table(textConnection(txt), header = TRUE)

ddply(dat, "name2", summarize, max = max(numReads))

Gives you:

     name2  max
1    Ssxb2    0
2      St5 7904
3    Taf9b 3438
4 Tas2r138    0
5 Tas2r143    0
6 Tas2r144    0
7    Tbx10    0
8    Tdrd1 8125
9   Thnsl1 1227

The second question can seemingly be answered with:

dat[dat$name2 == "Tdrd1" & dat$numReads != max(dat$numReads[dat$name2 == "Tdrd1"]),]

   numReads length name2
9      8124  43738 Tdrd1
10     8102  39301 Tdrd1

Provide some more context on what you're trying to do and I'll elaborate further.

回复收藏 0 原文

雨巷深深 2024-12-15 20:36:55

使用 plyr 对 name2 进行拆分，然后对 numReads 进行反向排序并选择第一行：

require(plyr)
ddply(df, "name2", function(dat) {
    dat[order(dat$numReads, decreasing=TRUE), ][1,]
})

  numReads length    name2
1        0   7384    Ssxb2
2     7904  93237      St5
3     3438  12969    Taf9b
4        0    996 Tas2r138
5        0    882 Tas2r143
6        0    960 Tas2r144
7        0   6761    Tbx10
8     8125  43804    Tdrd1
9     1227   9286   Thnsl1

Use plyr to split on name2, then reverse sort numReads and select the first row:

require(plyr)
ddply(df, "name2", function(dat) {
    dat[order(dat$numReads, decreasing=TRUE), ][1,]
})

  numReads length    name2
1        0   7384    Ssxb2
2     7904  93237      St5
3     3438  12969    Taf9b
4        0    996 Tas2r138
5        0    882 Tas2r143
6        0    960 Tas2r144
7        0   6761    Tbx10
8     8125  43804    Tdrd1
9     1227   9286   Thnsl1

回复收藏 0 原文

~没有更多了~