如何计算R中分布的重叠百分比?
我在下面有一个虚拟数据框架,我想计算站点分布之间的成对百分比重叠。基本上,site1和site2的百分比是重叠,site2 vs site3和site1 vs site3?
structure(list(site = c("site1", "site1", "site1", "site1", "site1",
"site1", "site1", "site1", "site1", "site1", "site2", "site2",
"site2", "site2", "site2", "site2", "site2", "site2", "site2",
"site2", "site3", "site3", "site3", "site3", "site3", "site3",
"site3", "site3", "site3", "site3"), total = c(0.4191, 0.2844,
0.2611, 0.2743, 0.2938, 0.3287, 0.2992, 0.4062, 0.2946, 0.2671,
0.3832, 0.3875, 0.3118, 0.4506, 0.4215, 0.4266, 0.3518, 0.4446,
0.4255, 0.3208, 0.2377, 0.2818, 0.2526, 0.2425, 0.2973, 0.4539,
0.357, 0.2865, 0.3624, 0.3026)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
site = c("site1", "site2", "site3"), .rows = structure(list(
1:10, 11:20, 21:30), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
ggplot(aes(x = total, group = site, fill = site)) +
geom_density(adjust = 1.5, alpha = 0.3)
I have a dummy dataframe below where I'd like to calculate the pairwise percent overlap between site distributions. Basically, what percent of site1 and site2 are overlapping, site2 vs site3 and site1 vs site3?
structure(list(site = c("site1", "site1", "site1", "site1", "site1",
"site1", "site1", "site1", "site1", "site1", "site2", "site2",
"site2", "site2", "site2", "site2", "site2", "site2", "site2",
"site2", "site3", "site3", "site3", "site3", "site3", "site3",
"site3", "site3", "site3", "site3"), total = c(0.4191, 0.2844,
0.2611, 0.2743, 0.2938, 0.3287, 0.2992, 0.4062, 0.2946, 0.2671,
0.3832, 0.3875, 0.3118, 0.4506, 0.4215, 0.4266, 0.3518, 0.4446,
0.4255, 0.3208, 0.2377, 0.2818, 0.2526, 0.2425, 0.2973, 0.4539,
0.357, 0.2865, 0.3624, 0.3026)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
site = c("site1", "site2", "site3"), .rows = structure(list(
1:10, 11:20, 21:30), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
ggplot(aes(x = total, group = site, fill = site)) +
geom_density(adjust = 1.5, alpha = 0.3)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的密度图可能有些误导,因为密度图将延伸到X轴上数据的实际范围之外,并且比您的数据中实际上存在的重叠率更高。一个更好的可视化可能是:
创建成对比较的一种方法是使用
explive.grid.grid
to to to To获取所有独特的站点对:然后,我们需要一个功能,该功能将取两个站点的名称并计算其范围之间的百分比重叠。我使用简单的算术以一种相当的行人方式进行此操作:
现在,我们可以
MAP
此功能到比较数据框架的行,以便为每个唯一的站点获得一个成对的估算。最后,我们要删除针对与自身重叠的区域测试的条目,因为这始终是100%:
可以对我们的情节检查最终结果,并且可以看出有意义(
重叠
列是A列中站点的比例,该列是B列中的站点与站点重叠的,因此,我们可以看到站点1和站点2 100%由站点3重叠,因为我们可以在我们的情节,而站点1约有68%的地点2
。软件包(v2.0.1)
Your density plot is perhaps a little misleading, since a density plot will extend outside the actual range of the data on the x axis, and tend to give a much higher estimate for the overlap than actually exists in your data. A better visualization might be:
One approach to creating pairwise comparisons is to use
expand.grid
to get all unique pairs of sites:Then we need a function that will take the name of two sites and calculate the percentage overlap between their ranges. I'm doing this here in a rather pedestrian way using simple arithmetic:
Now we can
Map
this function to the rows of our comparison data frame so that we get a pairwise estimate for each unique pair of sites.Finally, we want to remove the entries where an area is tested against overlap with itself, since this will always be 100%:
The final result can be sense checked against our plot, and can be seen to make sense (the
overlap
column is the proportion of the site in column A that is overlapped by the site in column B)So for example, we can see that site 1 and site 2 are 100% overlapped by site 3, as we can confirm in our plot, whereas site 1 is about 68% overlapped by site 2.
Created on 2022-04-25 by the reprex package (v2.0.1)