编辑:不重叠的一列和公共变量的组合

发布于 2025-01-12 07:18:21 字数 2317 浏览 0 评论 0原文

数据已更新!

我有一个示例数据集

目标开始序列
Ay1ccc
Ay2cct
Ay3aag
Ay3act
By1aaa
By4aat

并尝试获取类似 R 中的数据集:

目标开始开始序列
Ay1y2ccc,cct
Ay1y3ccc,aag,act
Ay2y3cct,aag,act
By1y4aaa,aat

起始列始终有一个目标,并从起始列的每个组合及其序列列表中寻找没有任何重叠的共同目标。 我尝试通过以下链接使用 mutate() 和 Comb() 帮助进行操作: 链接,但没有达到想要的结果。

有人可以帮助我并给我进一步学习的机会吗?

Data updated!

I have a example data set

TargetStartsequence
Ay1ccc
Ay2cct
Ay3aag
Ay3act
By1aaa
By4aat

and trying to get dataset like in R :

TargetStartStartsequence
Ay1y2ccc,cct
Ay1y3ccc,aag,act
Ay2y3cct,aag,act
By1y4aaa,aat

Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence.
I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.

Can anyone help me and give me a chance to learn further?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

提笔书几行 2025-01-19 07:18:21

您可以通过对每个组使用 combn 来实现此目的。

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

You may achieve this by using combn for each group.

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat
落花浅忆 2025-01-19 07:18:21

这是另一种不使用 combn()tidyverse 方法。

  1. group_by(Target, Start) 以便任何具有相同 TargetStart 的序列都可以折叠为一行
  2. 删除 Start< group_by() 中的 /code> 列
  3. Start 列更改为数字,以便我们可以直接比较 Start
  4. 创建一个 Start2 列包含 Start 值大于自身,并提取相应的 sequence 字符串并存储在 sequence2 列中
  5. 根据 Start2sequence2 展开数据框> (因为 sapply 每行会有多个输出)
  6. group_by(Target, Start, Start2),以便我们可以粘贴 序列sequence2
library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d+")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat     

Here is another tidyverse approach without the use of combn().

  1. group_by(Target, Start) so that any sequence with same Target and Start can be collapsed to a single row
  2. Drop the Start column in group_by()
  3. Change the Start column into numeric, so that we can directly compare the Start values
  4. Create a Start2 column containing Start value greater than itself, and extract the corresponding sequence string and store in sequence2 column
  5. Expand the dataframe based on Start2 and sequence2 (since there would be multiple output per row by sapply)
  6. group_by(Target, Start, Start2) so that we can paste sequence with sequence2
library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d+")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat     
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文