如何计算R中分布的重叠百分比?

发布于 2025-01-24 01:12:43 字数 1302 浏览 2 评论 0原文

我在下面有一个虚拟数据框架,我想计算站点分布之间的成对百分比重叠。基本上,site1和site2的百​​分比是重叠,site2 vs site3和site1 vs site3?

structure(list(site = c("site1", "site1", "site1", "site1", "site1", 
"site1", "site1", "site1", "site1", "site1", "site2", "site2", 
"site2", "site2", "site2", "site2", "site2", "site2", "site2", 
"site2", "site3", "site3", "site3", "site3", "site3", "site3", 
"site3", "site3", "site3", "site3"), total = c(0.4191, 0.2844, 
0.2611, 0.2743, 0.2938, 0.3287, 0.2992, 0.4062, 0.2946, 0.2671, 
0.3832, 0.3875, 0.3118, 0.4506, 0.4215, 0.4266, 0.3518, 0.4446, 
0.4255, 0.3208, 0.2377, 0.2818, 0.2526, 0.2425, 0.2973, 0.4539, 
0.357, 0.2865, 0.3624, 0.3026)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    site = c("site1", "site2", "site3"), .rows = structure(list(
        1:10, 11:20, 21:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

ggplot(aes(x = total, group = site, fill = site)) +
  geom_density(adjust = 1.5, alpha = 0.3) 

I have a dummy dataframe below where I'd like to calculate the pairwise percent overlap between site distributions. Basically, what percent of site1 and site2 are overlapping, site2 vs site3 and site1 vs site3?

structure(list(site = c("site1", "site1", "site1", "site1", "site1", 
"site1", "site1", "site1", "site1", "site1", "site2", "site2", 
"site2", "site2", "site2", "site2", "site2", "site2", "site2", 
"site2", "site3", "site3", "site3", "site3", "site3", "site3", 
"site3", "site3", "site3", "site3"), total = c(0.4191, 0.2844, 
0.2611, 0.2743, 0.2938, 0.3287, 0.2992, 0.4062, 0.2946, 0.2671, 
0.3832, 0.3875, 0.3118, 0.4506, 0.4215, 0.4266, 0.3518, 0.4446, 
0.4255, 0.3208, 0.2377, 0.2818, 0.2526, 0.2425, 0.2973, 0.4539, 
0.357, 0.2865, 0.3624, 0.3026)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    site = c("site1", "site2", "site3"), .rows = structure(list(
        1:10, 11:20, 21:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

ggplot(aes(x = total, group = site, fill = site)) +
  geom_density(adjust = 1.5, alpha = 0.3) 

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半夏半凉 2025-01-31 01:12:43

您的密度图可能有些误导,因为密度图将延伸到X轴上数据的实际范围之外,并且比您的数据中实际上存在的重叠率更高。一个更好的可视化可能是:

df %>%
  group_by(site) %>%
  mutate(site = factor(site)) %>%
  summarize(xmin = min(total), xmax = max(total), 
            ymin = as.numeric(site), ymax = as.numeric(site)) %>%
  ggplot() +
  geom_segment(aes(x = xmin, xend = xmax, y = ymin, yend = ymax, color = site),
               size = 2) +
  scale_y_continuous(breaks = 1:3, expand = c(1, 1)) +
  theme_bw()
#> `summarise()` has grouped output by 'site'. You can override using the
#> `.groups` argument.

“”

创建成对比较的一种方法是使用explive.grid.grid to to to To获取所有独特的站点对:

comp_df <- expand.grid(A = sort(unique(df$site)), 
                       B = sort(unique(df$site)))

然后,我们需要一个功能,该功能将取两个站点的名称并计算其范围之间的百分比重叠。我使用简单的算术以一种相当的行人方式进行此操作:

comp_func <- function(a, b) {
  max_a <- max(df$total[df$site == a])
  min_a <- min(df$total[df$site == a])
  max_b <- max(df$total[df$site == b])
  min_b <- min(df$total[df$site == b])
  max_b <- ifelse(max_b > max_a, max_a, max_b)
  min_b <- ifelse(min_b < min_a, min_a, min_b)
  (max_b - min_b) / (max_a - min_a)
}

现在,我们可以MAP此功能到比较数据框架的行,以便为每个唯一的站点获得一个成对的估算。

comp_df$overlap <- unlist(Map(comp_func, a = comp_df$A, b = comp_df$B))

最后,我们要删除针对与自身重叠的区域测试的条目,因为这始终是100%:

comp_df <- comp_df[comp_df$A != comp_df$B,]

可以对我们的情节检查最终结果,并且可以看出有意义(重叠列是A列中站点的比例,该列是B列中的站点与站点重叠的,

comp_df
#>       A     B   overlap
#> 2 site2 site1 0.7730548
#> 3 site3 site1 0.7308048
#> 4 site1 site2 0.6791139
#> 6 site3 site2 0.6419981
#> 7 site1 site3 1.0000000
#> 8 site2 site3 1.0000000

因此,我们可以看到站点1和站点2 100%由站点3重叠,因为我们可以在我们的情节,而站点1约有68%的地点2

。软件包(v2.0.1)

Your density plot is perhaps a little misleading, since a density plot will extend outside the actual range of the data on the x axis, and tend to give a much higher estimate for the overlap than actually exists in your data. A better visualization might be:

df %>%
  group_by(site) %>%
  mutate(site = factor(site)) %>%
  summarize(xmin = min(total), xmax = max(total), 
            ymin = as.numeric(site), ymax = as.numeric(site)) %>%
  ggplot() +
  geom_segment(aes(x = xmin, xend = xmax, y = ymin, yend = ymax, color = site),
               size = 2) +
  scale_y_continuous(breaks = 1:3, expand = c(1, 1)) +
  theme_bw()
#> `summarise()` has grouped output by 'site'. You can override using the
#> `.groups` argument.

One approach to creating pairwise comparisons is to use expand.grid to get all unique pairs of sites:

comp_df <- expand.grid(A = sort(unique(df$site)), 
                       B = sort(unique(df$site)))

Then we need a function that will take the name of two sites and calculate the percentage overlap between their ranges. I'm doing this here in a rather pedestrian way using simple arithmetic:

comp_func <- function(a, b) {
  max_a <- max(df$total[df$site == a])
  min_a <- min(df$total[df$site == a])
  max_b <- max(df$total[df$site == b])
  min_b <- min(df$total[df$site == b])
  max_b <- ifelse(max_b > max_a, max_a, max_b)
  min_b <- ifelse(min_b < min_a, min_a, min_b)
  (max_b - min_b) / (max_a - min_a)
}

Now we can Map this function to the rows of our comparison data frame so that we get a pairwise estimate for each unique pair of sites.

comp_df$overlap <- unlist(Map(comp_func, a = comp_df$A, b = comp_df$B))

Finally, we want to remove the entries where an area is tested against overlap with itself, since this will always be 100%:

comp_df <- comp_df[comp_df$A != comp_df$B,]

The final result can be sense checked against our plot, and can be seen to make sense (the overlap column is the proportion of the site in column A that is overlapped by the site in column B)

comp_df
#>       A     B   overlap
#> 2 site2 site1 0.7730548
#> 3 site3 site1 0.7308048
#> 4 site1 site2 0.6791139
#> 6 site3 site2 0.6419981
#> 7 site1 site3 1.0000000
#> 8 site2 site3 1.0000000

So for example, we can see that site 1 and site 2 are 100% overlapped by site 3, as we can confirm in our plot, whereas site 1 is about 68% overlapped by site 2.

Created on 2022-04-25 by the reprex package (v2.0.1)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文