一列组合之间的公共列

发布于 2025-01-10 19:08:39 字数 735 浏览 1 评论 0原文

我有我的分析数据集。为了解释结果,我尝试构建一个数据框

dataframe 示例

结果应类似于:

gene_name | Motif_id_1 |主题_id_2 |发生次数 | Matched_sequence

wanted dataframe

这里一些motif_id可能共享gene_name,结果应该是motif_id的两个组合(允许重叠)。

我已经尝试过以下代码,但是结果没有给出motif_id内的组合。

merge_practice <- reshape2::dcast(group_geneid_CT,
motif_id+ motif_id~gene_name,
value.var ="matched_sequence",
drop = T,fill = 0,
fun.aggregate = length )

如果可能的话,我想让它提高内存和时间效率,并减少对包的依赖。谁能给我另一个视角?

I have dataset from my analysis. To interpretate the result, I am trying to build a dataframe

example of dataframe

Result should be like :

gene_name | Motif_id_1 | Motif_id_2 | Occurence | Matched_sequence

wanted dataframe

here some motif_id may share gene_name and result should be two combination of motif_id(overlap allowed.)

I have tried following code, however the result does not give combination within motif_id.

merge_practice <- reshape2::dcast(group_geneid_CT,
motif_id+ motif_id~gene_name,
value.var ="matched_sequence",
drop = T,fill = 0,
fun.aggregate = length )

If possible, I want to make it memory and time efficient and less dependency with packages. Can anyone give me an another perspective?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

凌乱心跳 2025-01-17 19:08:39
library(tidyverse)

data <- tribble(
  ~gene_name, ~motif_id, ~matched_sequence,
  "A", "y1", "ccc",
  "A", "y2", "ccc",
  "A", "y1", "aaa",
  "A", "y2", "aaa",
  "A", "y2", "aat",
)

data %>%
  pull(motif_id) %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  rename(from = V1, to = V2) %>%
  mutate(
    co_occurrence = list(from, to) %>% pmap(~ {
      bind_rows(
        data %>% filter(motif_id == .x) %>% select(-motif_id),
        data %>% filter(motif_id == .y) %>% select(-motif_id)
      ) %>%
        count(gene_name, matched_sequence, name = "co_occurrent")
    })
  ) %>%
  unnest(co_occurrence)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> # A tibble: 3 × 5
#>   from  to    gene_name matched_sequence co_occurrent
#>   <chr> <chr> <chr>     <chr>                   <int>
#> 1 y1    y2    A         aaa                         2
#> 2 y1    y2    A         aat                         1
#> 3 y1    y2    A         ccc                         2

reprex 包 (v2.0.0) 于 2022 年 3 月 1 日创建

< code>co_occurrent 如果在两个主题中都找到,则应为 2;如果仅在一个主题中找到,则应为 1。

library(tidyverse)

data <- tribble(
  ~gene_name, ~motif_id, ~matched_sequence,
  "A", "y1", "ccc",
  "A", "y2", "ccc",
  "A", "y1", "aaa",
  "A", "y2", "aaa",
  "A", "y2", "aat",
)

data %>%
  pull(motif_id) %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  rename(from = V1, to = V2) %>%
  mutate(
    co_occurrence = list(from, to) %>% pmap(~ {
      bind_rows(
        data %>% filter(motif_id == .x) %>% select(-motif_id),
        data %>% filter(motif_id == .y) %>% select(-motif_id)
      ) %>%
        count(gene_name, matched_sequence, name = "co_occurrent")
    })
  ) %>%
  unnest(co_occurrence)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> # A tibble: 3 × 5
#>   from  to    gene_name matched_sequence co_occurrent
#>   <chr> <chr> <chr>     <chr>                   <int>
#> 1 y1    y2    A         aaa                         2
#> 2 y1    y2    A         aat                         1
#> 3 y1    y2    A         ccc                         2

Created on 2022-03-01 by the reprex package (v2.0.0)

co_occurrent should be either 2 if it was found in both motifs or 1 if it was only found in one motif.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文