编辑：不重叠的一列和公共变量的组合

发布于 2025-01-12 07:18:21 字数 2317 浏览 0 评论 0原文

数据已更新！

我有一个示例数据集

目标	开始	序列
A	y1	ccc
A	y2	cct
A	y3	aag
A	y3	act
B	y1	aaa
B	y4	aat

并尝试获取类似 R 中的数据集：

目标	开始	开始	序列
A	y1	y2	ccc,cct
A	y1	y3	ccc,aag,act
A	y2	y3	cct,aag,act
B	y1	y4	aaa,aat

起始列始终有一个目标，并从起始列的每个组合及其序列列表中寻找没有任何重叠的共同目标。我尝试通过以下链接使用 mutate() 和 Comb() 帮助进行操作：链接，但没有达到想要的结果。

有人可以帮助我并给我进一步学习的机会吗？

原文

Data updated!

I have a example data set

Target	Start	sequence
A	y1	ccc
A	y2	cct
A	y3	aag
A	y3	act
B	y1	aaa
B	y4	aat

and trying to get dataset like in R :

Target	Start	Start	sequence
A	y1	y2	ccc,cct
A	y1	y3	ccc,aag,act
A	y2	y3	cct,aag,act
B	y1	y4	aaa,aat

Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence.
I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.

Can anyone help me and give me a chance to learn further?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

提笔书几行 2025-01-19 07:18:21

您可以通过对每个组使用 combn 来实现此目的。

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

You may achieve this by using combn for each group.

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

回复收藏 0 原文

落花浅忆 2025-01-19 07:18:21

这是另一种不使用 combn() 的 tidyverse 方法。

group_by(Target, Start) 以便任何具有相同 Target 和 Start 的序列都可以折叠为一行
删除 Start< group_by() 中的 /code> 列
将 Start 列更改为数字，以便我们可以直接比较 Start 值
创建一个 Start2 列包含 Start 值大于自身，并提取相应的 sequence 字符串并存储在 sequence2 列中
根据 Start2 和 sequence2 展开数据框> （因为 sapply 每行会有多个输出）
group_by(Target, Start, Start2)，以便我们可以粘贴 序列与sequence2

library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d+")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat

Here is another tidyverse approach without the use of combn().

group_by(Target, Start) so that any sequence with same Target and Start can be collapsed to a single row
Drop the Start column in group_by()
Change the Start column into numeric, so that we can directly compare the Start values
Create a Start2 column containing Start value greater than itself, and extract the corresponding sequence string and store in sequence2 column
Expand the dataframe based on Start2 and sequence2 (since there would be multiple output per row by sapply)
group_by(Target, Start, Start2) so that we can paste sequence with sequence2

library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d+")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat

回复收藏 0 原文

~没有更多了~