对于所有这些数据帧的所有可能组合，查找多个但不是所有可用数据帧的公共行

发布于 2025-01-17 20:17:54 字数 774 浏览 3 评论 0原文

我有多个具有以下格式的数据框：

        Gene Entrez.Id                                    Dataset Correlation
1     MTHFD2     10797 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.3328479
2   SLC25A32     81034 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.3111028
3    MTHFD1L     25902 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.2710356
4       DTX3    196403 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.2672314

我的目标是在 Gene 列中查找所有数据框共有的元素，为此我使用了以下命令：

df.join <- join_all(list(df1,df2,df3,df4,df5), by = "Gene", type = "inner")

但实际上没有 Gene 元素对所有数据框都是通用的，因此 df.join 为空。现在我想知道 Gene 列中是否有大多数数据帧所共有的元素，但不是全部，比如说 5 个中的 4 个。有没有一种方法可以做到这一点，而无需手动构建行数据框所有可能组合的代码？

原文

I have multiple data frames with the following format:

        Gene Entrez.Id                                    Dataset Correlation
1     MTHFD2     10797 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.3328479
2   SLC25A32     81034 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.3111028
3    MTHFD1L     25902 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.2710356
4       DTX3    196403 CRISPR (DepMap 22Q1 Public+Score, Chronos)   0.2672314

My aim was to find elements in the Gene column that were common to all data frames, for which I used the following command:

df.join <- join_all(list(df1,df2,df3,df4,df5), by = "Gene", type = "inner")

But there are actually no Gene elements that are common to all data frames, so df.join is empty.
Now I want to know whether there are elements in the Gene column that are common to most data frame but not all, let's say 4 out of 5. Is there a way to do this without manually constructing lines of code for all the possible combinations of data frames?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

任谁 2025-01-24 20:17:54

涉及 dplyr 和 purrr 的一个选项可能是：

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset)) %>%
    ungroup() %>%
    filter(n == 5) %>% #The number corresponds to the required number of datasets
    pull(Gene)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join)) %>%
    reduce(inner_join, 
           by = "Gene")

在这种方法中，识别出现在所需数量的数据集中（此处 n = 5）的 ID。然后，在第二步中，这些ID被过滤掉并连接在一起。

如果还需要有关数据集的信息：

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset),
              dataset = paste(dataset, collapse = ", ")) %>%
    ungroup() %>%
    filter(n == 5) %>%
    select(-n)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join[["Gene"]])) %>%
    reduce(inner_join, 
           by = "Gene") %>%
    left_join(ids_to_join,
              by = "Gene")

One option involving dplyr and purrr could be:

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset)) %>%
    ungroup() %>%
    filter(n == 5) %>% #The number corresponds to the required number of datasets
    pull(Gene)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join)) %>%
    reduce(inner_join, 
           by = "Gene")

In this approach, the IDs that are present in the required number of datasets (here n = 5) are identified. Then, in the second step, these IDs are filtered out and joined together.

If also the information on datasets is needed:

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset),
              dataset = paste(dataset, collapse = ", ")) %>%
    ungroup() %>%
    filter(n == 5) %>%
    select(-n)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join[["Gene"]])) %>%
    reduce(inner_join, 
           by = "Gene") %>%
    left_join(ids_to_join,
              by = "Gene")

回复收藏 0 原文

~没有更多了~