如何通过另外两个列分组和总结列

发布于 2025-02-04 14:08:07 字数 1275 浏览 1 评论 0 原文

我有一个这样的问题:“对于您选择的5个国家 /地区,请使用小组栏图比较自大流行开始以来每年每年每100例确认案件的死亡”。 我写了一些代码:

COVID_data%>%
 filter(countriesAndTerritories%in%selected_countries)%>%
  drop_na(deaths)%>%
  filter(deaths>0, cases>0)%>%
  mutate(d =(deaths*100)/cases)%>%
  ggplot(aes(x=countriesAndTerritories, y=d, fill=as.factor(year)))+
  geom_bar(position = "dodge", stat = "identity")+
  labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
  ggtitle("Number of Deaths per 100 confirmed cases in each year")

它给了我这个输出:


but the output of my teacher is like that:

“预期输出”

我们对法国和意大利的输出不同,我检查了我的数据并计算了每100例病例的死亡人数,而我的数据看起来正确,我找不到我的错误。你能帮我吗? 我的数据来自此链接:

I have a question like that "For 5 countries of your choice, use a group bar chart to compare “deaths per 100 confirmed cases” in each year since the beginning of the pandemic."
I wrote some code like :

COVID_data%>%
 filter(countriesAndTerritories%in%selected_countries)%>%
  drop_na(deaths)%>%
  filter(deaths>0, cases>0)%>%
  mutate(d =(deaths*100)/cases)%>%
  ggplot(aes(x=countriesAndTerritories, y=d, fill=as.factor(year)))+
  geom_bar(position = "dodge", stat = "identity")+
  labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
  ggtitle("Number of Deaths per 100 confirmed cases in each year")

It gives me this output:
Created output

but the output of my teacher is like that:

Expected output

Our output of France and Italy are different I examined my data and calculate the number of deaths per 100 cases and my data looks correct I couldn't find my mistake. Could you help me?
My data is from this link:
https://www.ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-country

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

何以笙箫默 2025-02-11 14:08:07

问题中提交的守则中最明显的问题是,它不能按国家和年份正确汇总案件和死亡。与 stat =“ Identity” geom_bar()中的参数中的使用以及使用 stat =“ Identity” 的使用不太明显的错误与为子集,丢失值的处理以及使用 stat =“ Identity” 的使用不太明显。

这是一个完全可复制的示例,可复制讲师的图表。

首先,我们加载来自欧洲疾病预防控制中心的数据。

library(ggplot2)
library(dplyr)
library(tidyr)

data <- read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", 
                 na.strings = "", fileEncoding = "UTF-8-BOM")

接下来,我们创建一个列表,以符合教师图表中的国家 /地区。

# select some countries
countryList <- c("France","Italy","Germany","Poland","Romania")

在这里,我们按国家和年份进行分组,然后汇总案件&amp;死亡,然后我们计算死亡率(每100例已确认病例死亡),并保存到输出数据框架中。

data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

我们使用 na.rm = true sum()上的参数来包含尽可能多的数据,因为 codebook 告诉我们,这些都是案件和死亡的每日报告。

如果我们使用 View(SumpedData)查看数据框架,则发现死亡率在1到4之间,如预期的。

检查了数据后,我们使用 ggplot()绘制它。

诊断错误

我们将逐步走进原始帖子的代码以查找错误,现在我们知道我们能够使用从欧洲疾病预防症。

阅读上述数据后,我们将国家子集并执行 dplyr 管道的第一部分,然后保存到数据框架中。

selected_countries <- c("France","Italy","Norway","Sweden","Finland")

data%>%
     filter(countriesAndTerritories%in%selected_countries) -> step1

在RSTUDIO对象查看器中,我们看到所得数据框架具有4,240个观测值。

如果我们总结了数据以查看平均每日病例和平均每日死亡,我们会发现案件平均为12260.9,而死亡人数平均为81.03。

> mean(step1$cases,na.rm=TRUE)   
[1] 12260.9
> mean(step1$deaths,na.rm=TRUE) 
[1] 81.03202

到目前为止,之所以如此出色,因为这意味着所有数据的平均死亡率均小于1.0,这在全球范围内有关于自2020年3月以来的共同死亡率的报告

> mean(step1$deaths,na.rm=TRUE) / (mean(step1$cases,na.rm=TRUE) /100)
[1] 0.6608977

。 )功能,看看会发生什么。

library(tidyr)
step1 %>%     drop_na(deaths) -> step2 
nrow(step2)

看来我们已经失去了24个观察结果。

> nrow(step2)
[1] 4216

当我们通过死亡进行排序并从rstudio数据查看器中从 step1 中检查输出时,我们发现挪威消失的观察结果。有24天有记录的病例,但没有死亡。

仍然没有什么可以暗示我们会生成一个图表,其中有400人死于100个已确认的covid案例。

接下来,我们在原始海报的 dplyr 管道中应用下一个操作并计数行。

step2 %>%    filter(deaths>0, cases>0) -> step3 
nrow(step3)

嗯...我们现在丢失了980行数据。

> nrow(step3)
[1] 3231

在这一点上,代码将有效的数据丢弃,这将使结果偏向。为什么?随着时间的流逝,COVID案件和死亡人数臭名昭著,因此有时政府会报告负面案件或死亡,以纠正过去的过度报告错误。

果然,我们的数据包括具有负值的行。

> summary(step1$cases)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-348846.0     275.5    1114.0   12260.9    7570.5  501635.0         1 
> summary(step1$deaths)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-217.00    1.00   11.00   81.03   76.00 2004.00      24 
>

哇!一个国家的一天内报告了-348,846例。这是一个重大的更正。

再次检查数据,我们看到法国是这里的罪魁祸首。如果我们对这些数据进行了更认真的分析,研究人员将有义务通过做事来评估该观察结果的有效性,以审查有关2021年法国案件COVID报告校正的新闻报道。

i.sstatic.net/nwqem.png“ rel =“ nofollow noreferrer” > ,当原始代码使用 mutate()计算死亡率时,它不会通过 code> and andsterories 汇总。

step3 %>%  mutate(d =(deaths*100)/cases) -> step4 

因此,当代码使用 ggplot() geom(stat =“ Identity”) y轴使用单个观察值的值,此时3,231,并产生意外结果。

原始海报图表的更正版本

是使用原始海报选择的五个国家正确分析数据的代码。

countryList <- c("France","Italy","Norway","Sweeden","Finland")
data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

# plot the results

ggplot(data = summedData,
       aes(x=countriesAndTerritories, y=deathRate, fill=as.factor(year)))+
     geom_bar(position = "dodge", stat = "identity")+
     labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
     ggtitle("Number of Deaths per 100 confirmed cases in each year")

...以及输出的死亡率在1%至3.5%之间。

请注意,用于生成图表的数据框架只有12个观察值,或图表上的每个数字一个观察值。这就是为什么 stat =“ Identity” 可以与 geom_bar()一起使用。 ggplot()使用 DeathRate 的值来绘制沿Y轴的每个条的高度。

结论

首先,重要的是要了解我们正在分析的数据的细节,尤其是当有大量外部参考文献(例如全球共同死亡率)时。

其次,重要的是要了解数据集中的有效观察结果,例如,像2021年5月的法国一样,数据校正是否合理。

最后,对结果进行面部有效性分析很重要。期望每100例确认的共同案件死亡400人死亡是现实的吗?对于全球死亡人数的疾病,据报道有1至4%的确认病例,可能没有。

The most obvious problem in the code submitted in the question is that it does not correctly aggregate cases and deaths by country and year. Less obvious errors are related to choices made to subset the data, handling of missing values, and the use of stat = "identity" as an argument in geom_bar().

Here's a completely reproducible example that reproduces the instructor's chart.

First, we load the data from the European CDC.

library(ggplot2)
library(dplyr)
library(tidyr)

data <- read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", 
                 na.strings = "", fileEncoding = "UTF-8-BOM")

Next, we create a list of countries to subset that match those in the instructor's chart.

# select some countries
countryList <- c("France","Italy","Germany","Poland","Romania")

Here we group by country and year, and then aggregate cases & deaths, then we calculate the death rate (deaths per 100 confirmed cases), and save to an output data frame.

data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

We use the na.rm = TRUE argument on sum() to include as much of the data as possible, since the codebook tells us that these are daily reports of cases and deaths.

If we view the data frame with View(summedData), we see that the death rates are between 1 and 4, as expected.

enter image description here

Having inspected the data, we plot it with ggplot().

enter image description here

Diagnosing the Errors

We'll walk step-by-step through the code of the original post to find the errors, now that we know we are able to reproduce the professor's chart with the data provided from the European CDC.

After reading the data as above, we subset countries and execute the first part of the dplyr pipeline, and save to a data frame.

selected_countries <- c("France","Italy","Norway","Sweden","Finland")

data%>%
     filter(countriesAndTerritories%in%selected_countries) -> step1

In the RStudio object viewer we see that the resulting data frame has 4,240 observations.

If we summarize the data to look at the average daily cases and average daily deaths, we see that the cases average 12260.9 while deaths average 81.03.

> mean(step1$cases,na.rm=TRUE)   
[1] 12260.9
> mean(step1$deaths,na.rm=TRUE) 
[1] 81.03202

So far, so good because because this means that the average death rate across all the data is less than 1.0, which makes sense given worldwide reports about COVID mortality rates since March 2020.

> mean(step1$deaths,na.rm=TRUE) / (mean(step1$cases,na.rm=TRUE) /100)
[1] 0.6608977

Next, we execute the tidyr::drop_na() function and see what happens.

library(tidyr)
step1 %>%     drop_na(deaths) -> step2 
nrow(step2)

Looks like we've lost 24 observations.

> nrow(step2)
[1] 4216

When we sort by deaths and inspect the output from step1 in the RStudio data viewer, we find the disappearing observations in Norway. There are 24 days where there were cases recorded but no deaths.

enter image description here

Still, there's nothing to suggest we'd generate a graph where 400 people die for every 100 confirmed COVID cases.

Next, we apply the next operation in the original poster's dplyr pipeline and count the rows.

step2 %>%    filter(deaths>0, cases>0) -> step3 
nrow(step3)

Hmm... we've lost over 980 rows of data now.

> nrow(step3)
[1] 3231

At this point the code is throwing valid data away, which is going to skew the results. Why? COVID case and death counts are notorious for data corrections over time, so sometimes governments will report negative cases or deaths to correct past over-reporting errors.

Sure enough, our data includes rows with negative values.

> summary(step1$cases)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-348846.0     275.5    1114.0   12260.9    7570.5  501635.0         1 
> summary(step1$deaths)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-217.00    1.00   11.00   81.03   76.00 2004.00      24 
>

Wow! one country reported -348,846 cases in one day. That's a major correction.

Inspecting the data again, we see that France is the culprit here. If we were conducting a more serious analysis with this data, the researcher would be obligated to assess the validity of this observation by doing things as reviewing news reports about case COVID reporting corrections in France during 2021.

enter image description here

Now, when the original code uses mutate() to calculate death rates, it does not aggregate by countriesAndTerritories or year.

step3 %>%  mutate(d =(deaths*100)/cases) -> step4 

Therefore, when the code uses ggplot() with geom(stat = "identity") the y axis uses the values of the individual observations, 3,231 at this point, and produces unexpected results.

Corrected Version of Original Poster's Chart

Here is the code that correctly analyzes the data, using the five countries selected by the original poster.

countryList <- c("France","Italy","Norway","Sweeden","Finland")
data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

# plot the results

ggplot(data = summedData,
       aes(x=countriesAndTerritories, y=deathRate, fill=as.factor(year)))+
     geom_bar(position = "dodge", stat = "identity")+
     labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
     ggtitle("Number of Deaths per 100 confirmed cases in each year")

...and the output, which has death rates between 1 and 3.5%.

enter image description here

Note that the data frame used to generate the chart has only 12 observations, or one observation per number on the chart. This is why stat = "identity" can be used with geom_bar(). ggplot() uses the value of deathRate to plot the height of each bar along the y axis.

Conclusions

First, it's important to understand the details of the data we're analyzing, especially when there are plenty of outside references such as worldwide COVID death rates.

Second, it's important to understand what are valid observations in a data set, such as whether it's reasonable for a data correction like the one France made in May 2021.

Finally, it's important to conduct a face validity analysis of the results. Is it realistic to expect 400 people to die for every 100 confirmed COVID cases? For a disease with a worldwide deaths reported between 1 and 4% of confirmed cases, probably not.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文