split 函数不返回任何具有大数据集的观察结果

发布于 2025-01-13 18:16:08 字数 1304 浏览 1 评论 0原文

我有一个像这样的数据框：

seqnames       pos     strand    nucleotide     count
    id1         12        +          A            13
    id1         13        +          C            25
    id2         24        +          G            10
    id2         25        +          T            25
    id2         26        +          A            10
    id3         10        +          C            5

但它总共有超过 100,000 行，seqnames 有 3138 个级别。我想根据 seqnames 将其拆分为数据帧列表，因此我使用了 split 函数：

data_list <- split(data,data$seqnames)

但它只返回类似这样的内容：

List of 3138
 $ id1:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels "+","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0) 
 $ id2:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels "+","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0)

我无法弄清楚为什么它是这样的，因为我已经在一个组成的数据帧上使用了它所有数字（当然，没有这个行那么多）并且它可以工作。我该如何解决这个问题？

原文

I have a dataframe like this:

seqnames       pos     strand    nucleotide     count
    id1         12        +          A            13
    id1         13        +          C            25
    id2         24        +          G            10
    id2         25        +          T            25
    id2         26        +          A            10
    id3         10        +          C            5

But it has more than 100,000 rows in total, seqnames has 3138 levels. I would like to split it into lists of dataframes according to seqnames, so I used split function:

data_list <- split(data,data$seqnames)

But it only returns something like this:

List of 3138
 $ id1:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels "+","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0) 
 $ id2:'data.frame':    0 obs. of  6 variables:
  ..$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 
  ..$ pos       : int(0) 
  ..$ strand    : Factor w/ 3 levels "+","-","*": 
  ..$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 
  ..$ count     : int(0) 
  ..$ sample_id : chr(0)

I can't figure out why it is like this because I have used it on a made up dataframe with all numbers (of course, not as many rows as this one) and it works.
How can I solve this problem?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生来就爱笑 2025-01-20 18:16:08

只是有许多未使用的级别，因为“seqnames”列是一个因素。使用split，可以选择drop（drop = TRUE - 默认情况下为FALSE）来删除那些列表元素。否则，它们将返回为包含 0 行的 data.frame。如果我们希望这些元素被 NULL 替换，那么找到那些行数 (nrow) 为 0 的元素并将其赋值给 NULL >

data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] "+" "+"
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] "+" "+" "+"
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr "+"
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0) 
 $ id5:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0)

对 NULL 进行赋值

data_list[sapply(data_list, nrow) == 0] <- list(NULL)

- 再次检查

> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] "+" "+"
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] "+" "+" "+"
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr "+"
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4: NULL
 $ id5: NULL

数据

data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L, 
3L), .Label = c("id1", 
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L, 
13L, 24L, 25L, 26L, 10L), strand = c("+", "+", "+", "+", "+", 
"+"), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L, 
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")

It is just that there are many unused levels as the column 'seqnames' is a factor. With split, there is an option to drop (drop = TRUE - by default it is FALSE) to remove those list elements. Otherwise, they will return as data.frame with 0 rows. If we want those elements to be replaced by NULL, then find those elements where the number of rows (nrow) are 0 and assign it to NULL

data_list <- split(data,data$seqnames)
> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] "+" "+"
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] "+" "+" "+"
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr "+"
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0) 
 $ id5:'data.frame':    0 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 
  ..$ pos       : int(0) 
  ..$ strand    : chr(0) 
  ..$ nucleotide: chr(0) 
  ..$ count     : int(0)

Doing the assignment to NULL

data_list[sapply(data_list, nrow) == 0] <- list(NULL)

-check again

> str(data_list)
List of 5
 $ id1:'data.frame':    2 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 1 1
  ..$ pos       : int [1:2] 12 13
  ..$ strand    : chr [1:2] "+" "+"
  ..$ nucleotide: chr [1:2] "A" "C"
  ..$ count     : int [1:2] 13 25
 $ id2:'data.frame':    3 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 2 2 2
  ..$ pos       : int [1:3] 24 25 26
  ..$ strand    : chr [1:3] "+" "+" "+"
  ..$ nucleotide: chr [1:3] "G" "T" "A"
  ..$ count     : int [1:3] 10 25 10
 $ id3:'data.frame':    1 obs. of  5 variables:
  ..$ seqnames  : Factor w/ 5 levels "id1","id2","id3",..: 3
  ..$ pos       : int 10
  ..$ strand    : chr "+"
  ..$ nucleotide: chr "C"
  ..$ count     : int 5
 $ id4: NULL
 $ id5: NULL

data

data <- structure(list(seqnames = structure(c(1L, 1L, 2L, 2L, 2L, 
3L), .Label = c("id1", 
"id2", "id3", "id4", "id5"), class = "factor"), pos = c(12L, 
13L, 24L, 25L, 26L, 10L), strand = c("+", "+", "+", "+", "+", 
"+"), nucleotide = c("A", "C", "G", "T", "A", "C"), count = c(13L, 
25L, 10L, 25L, 10L, 5L)), row.names = c(NA, -6L), class = "data.frame")

回复收藏 0 原文

~没有更多了~