如何添加错误条并自定义堆叠条形图

发布于 2025-02-09 05:54:54 字数 1547 浏览 1 评论 0原文

我有一个数据集和堆叠的条形图，类似于下面所示。该图是按照治疗类型（TRT = 1或2）随时间流逝的特定疾病的患病率（否，是1）。

我想知道是否有人可以帮助我：

在每个访问号码下为样本量添加n。例如，访问= 0，n = 7，在访问= 1，n = 7等。
用置信区间的CVD患病率和误差线的流行率进行注释。我希望患病率估计具有95％的置信区间，以更好地理解治疗1和2之间是否存在差异。
获取单独的表格，其中具有置信区间的实际数字。
检查/显示的最佳方法是什么，是否

a）随着时间的变化是否具有统计学意义 b）TRT 1和2之间是否存在统计学上的显着差异？

任何帮助/指导将不胜感激。

谢谢！

这是一个模拟数据集，类似于我正在从事的数据集，也是我用来生成堆叠图的一些代码。

ID = c(001, 001, 001, 001, 001, 002, 002, 002, 003, 003, 004, 004, 004, 004, 005, 005, 006, 007, 007, 007, 008, 008, 008, 008, 008, 009, 009, 009, 009, 009)
Visit = c(00, 01, 02, 03, 04, 00, 01, 02,  01, 02, 00, 01, 02, 03, 00, 02, 00, 01, 02, 04, 00, 01, 02, 03, 04, 00, 01, 02, 03, 04)
CVD = c(0, 0, 1, 1, 1, 1, 1, 0, 1, 1,  1, 0, 0, 1, 1,  0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0)
TRT= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
BIO=c(12.00, 11.9, 15.24, 13.10, 30.01, 45.90, 20.09, 23.45, 14.78, 18.05,
      24.23, 12.34, 80.01, 13.98, 12.50, 36.95, 29.00, 39.87, 19.03, 11.48,
      14.14, 28.06, 12.22, 72.08, 15.00, 11.33, 58.00, 17.71, 52.08, 15.25)
df<-data.frame(ID,Visit, CVD, TRT, BIO)
percentData <- df %>% group_by(Visit, TRT) %>% count(CVD) %>%
 mutate(ratio=scales::percent(n/sum(n))) %>% 
  ungroup()
ggplot(df,aes(x=Visit,fill=factor(CVD)))+
  geom_bar(position="fill")+ facet_wrap(~TRT)+
  geom_text(data=percentData, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5))+
  ylab('Percentage')

原文

I have a dataset and stacked bar chart similar to what’s shown below. The chart is for prevalence of a specific disease (CVD=0 for no, 1 for yes) over time, by treatment type (trt=1 or 2).

I am wondering if someone could please help me with:

Adding N for sample size under each visit number. For example, under visit=0, N=7, under visit=1, N=7, and so on.
Annotate the graph with point estimate for prevalence of CVD and error bars for the confidence intervals? I would like the prevalence estimates to have 95% confidence intervals to better appreciate whether there was difference between treatment 1 and 2.
get a separate table with the actual numbers for the confidence intervals.
what is the best way to check/show if

a) change over time was statistically significant
b) are there statistically significant differences between trt 1 and 2?

Any help/guidance would be appreciated.

Thanks!

Here’s a mock dataset similar to what I am working on and some code I’ve used to generate the stacked plot.

ID = c(001, 001, 001, 001, 001, 002, 002, 002, 003, 003, 004, 004, 004, 004, 005, 005, 006, 007, 007, 007, 008, 008, 008, 008, 008, 009, 009, 009, 009, 009)
Visit = c(00, 01, 02, 03, 04, 00, 01, 02,  01, 02, 00, 01, 02, 03, 00, 02, 00, 01, 02, 04, 00, 01, 02, 03, 04, 00, 01, 02, 03, 04)
CVD = c(0, 0, 1, 1, 1, 1, 1, 0, 1, 1,  1, 0, 0, 1, 1,  0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0)
TRT= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
BIO=c(12.00, 11.9, 15.24, 13.10, 30.01, 45.90, 20.09, 23.45, 14.78, 18.05,
      24.23, 12.34, 80.01, 13.98, 12.50, 36.95, 29.00, 39.87, 19.03, 11.48,
      14.14, 28.06, 12.22, 72.08, 15.00, 11.33, 58.00, 17.71, 52.08, 15.25)
df<-data.frame(ID,Visit, CVD, TRT, BIO)
percentData <- df %>% group_by(Visit, TRT) %>% count(CVD) %>%
 mutate(ratio=scales::percent(n/sum(n))) %>% 
  ungroup()
ggplot(df,aes(x=Visit,fill=factor(CVD)))+
  geom_bar(position="fill")+ facet_wrap(~TRT)+
  geom_text(data=percentData, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5))+
  ylab('Percentage')

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

扎心 2025-02-16 05:54:54

这是一个问题要问的很多。我解决的方法是首先生成一个数据框架，其中所有绘制 and 表格的数字结果：

library(tidyverse)

df2 <- df %>% 
  group_by(Visit, TRT) %>% 
  summarize(no_CVD = table(factor(CVD, 0:1))[1],
            CVD = table(factor(CVD, 0:1))[2],
            n = no_CVD + CVD,
            prop = CVD / n,
            lower = suppressWarnings(prop.test(CVD, n)$conf.int[1]),
            upper = suppressWarnings(prop.test(CVD, n)$conf.int[2])) %>%
  mutate(Percent = scales::percent(prop, 0.1),
         CI = paste0("(", scales::percent(lower, 0.1), " - ",
                     scales::percent(upper, 0.1), ")"),
         TRT = paste("Treatment", TRT))

您可以看到这给我们带来了点估计和置信区间，因为这两者都是RAW比例和作为文本标签，应该足以满足问题3：

df2
#> # A tibble: 10 x 10
#> # Groups:   Visit [5]
#>    Visit TRT         no_CVD   CVD     n  prop  lower upper Percent CI           
#>    <dbl> <chr>        <int> <int> <int> <dbl>  <dbl> <dbl> <chr>   <chr>        
#>  1     0 Treatment 1      3     1     4 0.25  0.0132 0.781 25.0%   (1.3% - 78.1~
#>  2     0 Treatment 2      0     3     3 1     0.310  1     100.0%  (31.0% - 100~
#>  3     1 Treatment 1      2     2     4 0.5   0.150  0.850 50.0%   (15.0% - 85.~
#>  4     1 Treatment 2      2     1     3 0.333 0.0177 0.875 33.3%   (1.8% - 87.5~
#>  5     2 Treatment 1      1     3     4 0.75  0.219  0.987 75.0%   (21.9% - 98.~
#>  6     2 Treatment 2      3     1     4 0.25  0.0132 0.781 25.0%   (1.3% - 78.1~
#>  7     3 Treatment 1      0     2     2 1     0.198  1     100.0%  (19.8% - 100~
#>  8     3 Treatment 2      1     1     2 0.5   0.0945 0.905 50.0%   (9.5% - 90.5~
#>  9     4 Treatment 1      1     2     3 0.667 0.125  0.982 66.7%   (12.5% - 98.~
#> 10     4 Treatment 2      1     0     1 0     0      0.945 0.0%    (0.0% - 94.5~

要将N放在每个访问下，我们可以简单地paste带有包含该表计数的字符串的访问号1）。

问题2有点奇怪。 CVD的样本患病率已经存在于您的情节上（这就是蓝条所示）。在堆叠的条形图中添加错误条将看起来很混乱，并且在这种情况下，栏实际上并没有提供任何简单的信息。因此，我更改了图表以显示错误条。

ggplot(df2, aes(paste0(Visit, "\n(n = ", n, ")"), prop)) +
  geom_errorbar(aes(ymin = lower, ymax = upper, color = factor(TRT)), 
                size = 1, width = 0.25) +
  geom_point(size = 2) +
  geom_text(aes(label = paste(Percent, CI, sep = "\n"), y = 1.1), size = 3) +
  facet_grid(.~TRT, scales = "free_x") +
  scale_y_continuous(labels = scales::percent, name = "CVD Prevalance") +
  xlab("Visit") +
  theme_light(base_size = 16) +
  theme(legend.position = "none") +
  scale_color_brewer(palette = "Set1")

您的最后两个点是统计的，而不是编程问题，因此在这里较低。您可能需要一个二项式通用线性混合模型来处理重复的措施，但是您应该在

^由

This is quite a lot to ask in a single question. The way I would tackle this is to first of all generate a data frame with all the numbers you will need for plotting and tabling results:

library(tidyverse)

df2 <- df %>% 
  group_by(Visit, TRT) %>% 
  summarize(no_CVD = table(factor(CVD, 0:1))[1],
            CVD = table(factor(CVD, 0:1))[2],
            n = no_CVD + CVD,
            prop = CVD / n,
            lower = suppressWarnings(prop.test(CVD, n)$conf.int[1]),
            upper = suppressWarnings(prop.test(CVD, n)$conf.int[2])) %>%
  mutate(Percent = scales::percent(prop, 0.1),
         CI = paste0("(", scales::percent(lower, 0.1), " - ",
                     scales::percent(upper, 0.1), ")"),
         TRT = paste("Treatment", TRT))

You can see this gives us the point estimates and confidence intervals as both raw proportions and as text labels, which should be adequate to fulfill question 3:

df2
#> # A tibble: 10 x 10
#> # Groups:   Visit [5]
#>    Visit TRT         no_CVD   CVD     n  prop  lower upper Percent CI           
#>    <dbl> <chr>        <int> <int> <int> <dbl>  <dbl> <dbl> <chr>   <chr>        
#>  1     0 Treatment 1      3     1     4 0.25  0.0132 0.781 25.0%   (1.3% - 78.1~
#>  2     0 Treatment 2      0     3     3 1     0.310  1     100.0%  (31.0% - 100~
#>  3     1 Treatment 1      2     2     4 0.5   0.150  0.850 50.0%   (15.0% - 85.~
#>  4     1 Treatment 2      2     1     3 0.333 0.0177 0.875 33.3%   (1.8% - 87.5~
#>  5     2 Treatment 1      1     3     4 0.75  0.219  0.987 75.0%   (21.9% - 98.~
#>  6     2 Treatment 2      3     1     4 0.25  0.0132 0.781 25.0%   (1.3% - 78.1~
#>  7     3 Treatment 1      0     2     2 1     0.198  1     100.0%  (19.8% - 100~
#>  8     3 Treatment 2      1     1     2 0.5   0.0945 0.905 50.0%   (9.5% - 90.5~
#>  9     4 Treatment 1      1     2     3 0.667 0.125  0.982 66.7%   (12.5% - 98.~
#> 10     4 Treatment 2      1     0     1 0     0      0.945 0.0%    (0.0% - 94.5~

To put the N under each visit, we can simply paste the visit number with a string containing the counts from this table (this handles question 1).

Question 2 is a little odd. The sample prevalence of CVD was already present on your plot (that's what the blue bars showed). Adding error bars to a stacked bar graph is going to look messy, and the bars don't actually give any information that a simple point does in this scenario. I have therefore changed the plot to show points with error bars.

ggplot(df2, aes(paste0(Visit, "\n(n = ", n, ")"), prop)) +
  geom_errorbar(aes(ymin = lower, ymax = upper, color = factor(TRT)), 
                size = 1, width = 0.25) +
  geom_point(size = 2) +
  geom_text(aes(label = paste(Percent, CI, sep = "\n"), y = 1.1), size = 3) +
  facet_grid(.~TRT, scales = "free_x") +
  scale_y_continuous(labels = scales::percent, name = "CVD Prevalance") +
  xlab("Visit") +
  theme_light(base_size = 16) +
  theme(legend.position = "none") +
  scale_color_brewer(palette = "Set1")

Your last two points are statistical rather than programming questions, so are off-topic here. It is likely that you need a binomial generalized linear mixed model to handle the repeated measures, but you should post that question on CrossValidated, our sister site for stats questions if you do not have access to a statistician.

^{Created on 2022-06-21 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~