在summary()中提取匹配变量

发布于 2025-01-16 01:48:45 字数 2950 浏览 2 评论 0原文

示例

gene_nameomot_idmatched_sequenceAy1CCCAy2CCAAAAy3AAGAy3ATBy1AAAABy4AATCy5AAGGR
gene_nameNode1Node2那样序列
,Ay1y2CCCCCAAA
数据一个
数据
尝试出现

获取

2
Ay1y3CCC,AAG,AAT3
Ay2y3CCAAA,AGG,AAT3
By1y4AAAA,AAT2

motif_id 列始终有一个目标,并从没有任何重叠的起始列及其列表的每个组合中寻找共同的 gene_name的序列。

我已经尝试过:

data%>% 
  group_by(gene_name, motif_id) %>% 
  summarize(matched_sequence = paste0(matched_sequence, collapse = ",")) %>% 
  mutate(count = n()) %>% filter(count>=2) %>%
  summarize(motif_id = combn(motif_id, 2, function(x) list(setNames(x, c('Node1', 'Node2')))), matched_sequence = toString(matched_sequence),
            .groups = 'keep') %>%
  tidyr::unnest_wider(motif_id) 

但是未能获取序列和发生列。有人能给我建议吗?

I have a example data set

gene_namemotif_idmatched_sequence
Ay1CCC
Ay2CCAAA
Ay3AAG
Ay3AT
By1AAAA
By4AAT
Cy5AAGG

and trying to get dataset like in R :

gene_nameNode1Node2sequenceoccurence
Ay1y2CCC, CCAAA2
Ay1y3CCC,AAG,AAT3
Ay2y3CCAAA,AGG,AAT3
By1y4AAAA,AAT2

motif_id column alway has a target and looking for common gene_name from each combination of start column without any overlaps and its list of sequence.

I have tried :

data%>% 
  group_by(gene_name, motif_id) %>% 
  summarize(matched_sequence = paste0(matched_sequence, collapse = ",")) %>% 
  mutate(count = n()) %>% filter(count>=2) %>%
  summarize(motif_id = combn(motif_id, 2, function(x) list(setNames(x, c('Node1', 'Node2')))), matched_sequence = toString(matched_sequence),
            .groups = 'keep') %>%
  tidyr::unnest_wider(motif_id) 

however failed to acquire sequence and occurence columns. Can anyone give me an advise?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

八巷 2025-01-23 01:48:45

我们按“gene_name”分组,仅保留“motif_id”中不同(n_distinct)元素数量大于 1 的组。获取“unique”的成对组合。 ' 元素,通过提取与 'motif_id' 值匹配的 'matched_sequence' 来创建 'sequence',获取该序列的长度 'occurrence' 中的 list,使用 unnest_widerlist 列创建列,并转换 'sequence' list > 通过将 list 中的元素粘贴character

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
data %>%
   dplyr::group_by(gene_name) %>%  
   dplyr::filter(n() > 1, n_distinct(motif_id) > 1) %>% 
   dplyr::summarise(Node =  combn(unique(motif_id), 2, 
       simplify = FALSE),
    sequence = purrr::map(Node, ~ 
        matched_sequence[motif_id  %in% .x]),
     occurence = lengths(sequence), .groups = 'drop') %>%
   tidyr::unnest_wider(Node) %>%
   dplyr::mutate(sequence = purrr::map_chr(sequence, toString)) %>%
   dplyr::rename_with(~ stringr::str_c("Node", seq_along(.x)), starts_with("..."))

- 输出

# A tibble: 4 × 5
  gene_name Node1 Node2 sequence       occurence
  <chr>     <chr> <chr> <chr>              <int>
1 A         y1    y2    CCC, CCAAA             2
2 A         y1    y3    CCC, AAG, AT           3
3 A         y2    y3    CCAAA, AAG, AT         3
4 B         y1    y4    AAAA, AAT              2

数据

data <- structure(list(gene_name = c("A", "A", "A", "A", "B", "B", "C"
), motif_id = c("y1", "y2", "y3", "y3", "y1", "y4", "y5"), 
matched_sequence = c("CCC", 
"CCAAA", "AAG", "AT", "AAAA", "AAT", "AAGG")), 
class = "data.frame", row.names = c(NA, 
-7L))

We group by 'gene_name', keep only the groups where the number of distinct (n_distinct elements in 'motif_id' is greater than 1. get the pairwise combnations of 'unique' elements, create the 'sequence' by extracting the 'matched_sequence' that matches with the 'motif_id' values, get the lengths of the list in 'occurence', use unnest_wider to create columns from the list column, and convert the 'sequence' list to character column by pasteing the elements in the list

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
data %>%
   dplyr::group_by(gene_name) %>%  
   dplyr::filter(n() > 1, n_distinct(motif_id) > 1) %>% 
   dplyr::summarise(Node =  combn(unique(motif_id), 2, 
       simplify = FALSE),
    sequence = purrr::map(Node, ~ 
        matched_sequence[motif_id  %in% .x]),
     occurence = lengths(sequence), .groups = 'drop') %>%
   tidyr::unnest_wider(Node) %>%
   dplyr::mutate(sequence = purrr::map_chr(sequence, toString)) %>%
   dplyr::rename_with(~ stringr::str_c("Node", seq_along(.x)), starts_with("..."))

-output

# A tibble: 4 × 5
  gene_name Node1 Node2 sequence       occurence
  <chr>     <chr> <chr> <chr>              <int>
1 A         y1    y2    CCC, CCAAA             2
2 A         y1    y3    CCC, AAG, AT           3
3 A         y2    y3    CCAAA, AAG, AT         3
4 B         y1    y4    AAAA, AAT              2

data

data <- structure(list(gene_name = c("A", "A", "A", "A", "B", "B", "C"
), motif_id = c("y1", "y2", "y3", "y3", "y1", "y4", "y5"), 
matched_sequence = c("CCC", 
"CCAAA", "AAG", "AT", "AAAA", "AAT", "AAGG")), 
class = "data.frame", row.names = c(NA, 
-7L))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文