一列组合之间的公共列

发布于 2025-01-10 19:08:39 字数 735 浏览 1 评论 0原文

我有我的分析数据集。为了解释结果，我尝试构建一个数据框

结果应类似于：

gene_name | Motif_id_1 |主题_id_2 |发生次数 | Matched_sequence

这里一些motif_id可能共享gene_name，结果应该是motif_id的两个组合（允许重叠）。

我已经尝试过以下代码，但是结果没有给出motif_id内的组合。

merge_practice <- reshape2::dcast(group_geneid_CT,
motif_id+ motif_id~gene_name,
value.var ="matched_sequence",
drop = T,fill = 0,
fun.aggregate = length )

如果可能的话，我想让它提高内存和时间效率，并减少对包的依赖。谁能给我另一个视角？

原文

I have dataset from my analysis. To interpretate the result, I am trying to build a dataframe

Result should be like :

gene_name | Motif_id_1 | Motif_id_2 | Occurence | Matched_sequence

here some motif_id may share gene_name and result should be two combination of motif_id(overlap allowed.)

I have tried following code, however the result does not give combination within motif_id.

merge_practice <- reshape2::dcast(group_geneid_CT,
motif_id+ motif_id~gene_name,
value.var ="matched_sequence",
drop = T,fill = 0,
fun.aggregate = length )

If possible, I want to make it memory and time efficient and less dependency with packages. Can anyone give me an another perspective?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凌乱心跳 2025-01-17 19:08:39

library(tidyverse)

data <- tribble(
  ~gene_name, ~motif_id, ~matched_sequence,
  "A", "y1", "ccc",
  "A", "y2", "ccc",
  "A", "y1", "aaa",
  "A", "y2", "aaa",
  "A", "y2", "aat",
)

data %>%
  pull(motif_id) %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  rename(from = V1, to = V2) %>%
  mutate(
    co_occurrence = list(from, to) %>% pmap(~ {
      bind_rows(
        data %>% filter(motif_id == .x) %>% select(-motif_id),
        data %>% filter(motif_id == .y) %>% select(-motif_id)
      ) %>%
        count(gene_name, matched_sequence, name = "co_occurrent")
    })
  ) %>%
  unnest(co_occurrence)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> # A tibble: 3 × 5
#>   from  to    gene_name matched_sequence co_occurrent
#>   <chr> <chr> <chr>     <chr>                   <int>
#> 1 y1    y2    A         aaa                         2
#> 2 y1    y2    A         aat                         1
#> 3 y1    y2    A         ccc                         2

^{由reprex 包 (v2.0.0) 于 2022 年 3 月 1 日创建}

< code>co_occurrent 如果在两个主题中都找到，则应为 2；如果仅在一个主题中找到，则应为 1。

library(tidyverse)

data <- tribble(
  ~gene_name, ~motif_id, ~matched_sequence,
  "A", "y1", "ccc",
  "A", "y2", "ccc",
  "A", "y1", "aaa",
  "A", "y2", "aaa",
  "A", "y2", "aat",
)

data %>%
  pull(motif_id) %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  rename(from = V1, to = V2) %>%
  mutate(
    co_occurrence = list(from, to) %>% pmap(~ {
      bind_rows(
        data %>% filter(motif_id == .x) %>% select(-motif_id),
        data %>% filter(motif_id == .y) %>% select(-motif_id)
      ) %>%
        count(gene_name, matched_sequence, name = "co_occurrent")
    })
  ) %>%
  unnest(co_occurrence)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> # A tibble: 3 × 5
#>   from  to    gene_name matched_sequence co_occurrent
#>   <chr> <chr> <chr>     <chr>                   <int>
#> 1 y1    y2    A         aaa                         2
#> 2 y1    y2    A         aat                         1
#> 3 y1    y2    A         ccc                         2

^{Created on 2022-03-01 by the reprex package (v2.0.0)}

co_occurrent should be either 2 if it was found in both motifs or 1 if it was only found in one motif.

回复收藏 0 原文

~没有更多了~