我有一个这样的问题:“对于您选择的5个国家 /地区,请使用小组栏图比较自大流行开始以来每年每年每100例确认案件的死亡”。
我写了一些代码:
COVID_data%>%
filter(countriesAndTerritories%in%selected_countries)%>%
drop_na(deaths)%>%
filter(deaths>0, cases>0)%>%
mutate(d =(deaths*100)/cases)%>%
ggplot(aes(x=countriesAndTerritories, y=d, fill=as.factor(year)))+
geom_bar(position = "dodge", stat = "identity")+
labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
ggtitle("Number of Deaths per 100 confirmed cases in each year")
它给了我这个输出:
but the output of my teacher is like that:
我们对法国和意大利的输出不同,我检查了我的数据并计算了每100例病例的死亡人数,而我的数据看起来正确,我找不到我的错误。你能帮我吗?
我的数据来自此链接:
I have a question like that "For 5 countries of your choice, use a group bar chart to compare “deaths per 100 confirmed cases” in each year since the beginning of the pandemic."
I wrote some code like :
COVID_data%>%
filter(countriesAndTerritories%in%selected_countries)%>%
drop_na(deaths)%>%
filter(deaths>0, cases>0)%>%
mutate(d =(deaths*100)/cases)%>%
ggplot(aes(x=countriesAndTerritories, y=d, fill=as.factor(year)))+
geom_bar(position = "dodge", stat = "identity")+
labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
ggtitle("Number of Deaths per 100 confirmed cases in each year")
It gives me this output:
data:image/s3,"s3://crabby-images/fc1e6/fc1e6816487b5532e5a1b31e5a8530ee8c9f7afb" alt="Created output"
but the output of my teacher is like that:
data:image/s3,"s3://crabby-images/a1078/a1078b3988344fd07385b101fc2f91ebd5895a32" alt="Expected output"
Our output of France and Italy are different I examined my data and calculate the number of deaths per 100 cases and my data looks correct I couldn't find my mistake. Could you help me?
My data is from this link:
https://www.ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-country
发布评论
评论(1)
问题中提交的守则中最明显的问题是,它不能按国家和年份正确汇总案件和死亡。与
stat =“ Identity”
在geom_bar()
中的参数中的使用以及使用stat =“ Identity”
的使用不太明显的错误与为子集,丢失值的处理以及使用stat =“ Identity”
的使用不太明显。这是一个完全可复制的示例,可复制讲师的图表。
首先,我们加载来自欧洲疾病预防控制中心的数据。
接下来,我们创建一个列表,以符合教师图表中的国家 /地区。
在这里,我们按国家和年份进行分组,然后汇总案件&死亡,然后我们计算死亡率(每100例已确认病例死亡),并保存到输出数据框架中。
我们使用
na.rm = true
sum()
上的参数来包含尽可能多的数据,因为 codebook 告诉我们,这些都是案件和死亡的每日报告。如果我们使用
View(SumpedData)
查看数据框架,则发现死亡率在1到4之间,如预期的。检查了数据后,我们使用
ggplot()
绘制它。诊断错误
我们将逐步走进原始帖子的代码以查找错误,现在我们知道我们能够使用从欧洲疾病预防症。
阅读上述数据后,我们将国家子集并执行
dplyr
管道的第一部分,然后保存到数据框架中。在RSTUDIO对象查看器中,我们看到所得数据框架具有4,240个观测值。
如果我们总结了数据以查看平均每日病例和平均每日死亡,我们会发现案件平均为12260.9,而死亡人数平均为81.03。
到目前为止,之所以如此出色,因为这意味着所有数据的平均死亡率均小于1.0,这在全球范围内有关于自2020年3月以来的共同死亡率的报告
。 )功能,看看会发生什么。
看来我们已经失去了24个观察结果。
当我们通过死亡进行排序并从rstudio数据查看器中从
step1
中检查输出时,我们发现挪威消失的观察结果。有24天有记录的病例,但没有死亡。仍然没有什么可以暗示我们会生成一个图表,其中有400人死于100个已确认的covid案例。
接下来,我们在原始海报的
dplyr
管道中应用下一个操作并计数行。嗯...我们现在丢失了980行数据。
在这一点上,代码将有效的数据丢弃,这将使结果偏向。为什么?随着时间的流逝,COVID案件和死亡人数臭名昭著,因此有时政府会报告负面案件或死亡,以纠正过去的过度报告错误。
果然,我们的数据包括具有负值的行。
哇!一个国家的一天内报告了-348,846例。这是一个重大的更正。
再次检查数据,我们看到法国是这里的罪魁祸首。如果我们对这些数据进行了更认真的分析,研究人员将有义务通过做事来评估该观察结果的有效性,以审查有关2021年法国案件COVID报告校正的新闻报道。
i.sstatic.net/nwqem.png“ rel =“ nofollow noreferrer” > ,当原始代码使用
mutate()
计算死亡率时,它不会通过code> and andsterories
或年
汇总。因此,当代码使用
ggplot()
与geom(stat =“ Identity”)
y轴使用单个观察值的值,此时3,231,并产生意外结果。原始海报图表的更正版本
是使用原始海报选择的五个国家正确分析数据的代码。
...以及输出的死亡率在1%至3.5%之间。
请注意,用于生成图表的数据框架只有12个观察值,或图表上的每个数字一个观察值。这就是为什么
stat =“ Identity”
可以与geom_bar()
一起使用。ggplot()
使用DeathRate
的值来绘制沿Y轴的每个条的高度。结论
首先,重要的是要了解我们正在分析的数据的细节,尤其是当有大量外部参考文献(例如全球共同死亡率)时。
其次,重要的是要了解数据集中的有效观察结果,例如,像2021年5月的法国一样,数据校正是否合理。
最后,对结果进行面部有效性分析很重要。期望每100例确认的共同案件死亡400人死亡是现实的吗?对于全球死亡人数的疾病,据报道有1至4%的确认病例,可能没有。
The most obvious problem in the code submitted in the question is that it does not correctly aggregate cases and deaths by country and year. Less obvious errors are related to choices made to subset the data, handling of missing values, and the use of
stat = "identity"
as an argument ingeom_bar()
.Here's a completely reproducible example that reproduces the instructor's chart.
First, we load the data from the European CDC.
Next, we create a list of countries to subset that match those in the instructor's chart.
Here we group by country and year, and then aggregate cases & deaths, then we calculate the death rate (deaths per 100 confirmed cases), and save to an output data frame.
We use the
na.rm = TRUE
argument onsum()
to include as much of the data as possible, since the codebook tells us that these are daily reports of cases and deaths.If we view the data frame with
View(summedData)
, we see that the death rates are between 1 and 4, as expected.Having inspected the data, we plot it with
ggplot()
.Diagnosing the Errors
We'll walk step-by-step through the code of the original post to find the errors, now that we know we are able to reproduce the professor's chart with the data provided from the European CDC.
After reading the data as above, we subset countries and execute the first part of the
dplyr
pipeline, and save to a data frame.In the RStudio object viewer we see that the resulting data frame has 4,240 observations.
If we summarize the data to look at the average daily cases and average daily deaths, we see that the cases average 12260.9 while deaths average 81.03.
So far, so good because because this means that the average death rate across all the data is less than 1.0, which makes sense given worldwide reports about COVID mortality rates since March 2020.
Next, we execute the
tidyr::drop_na()
function and see what happens.Looks like we've lost 24 observations.
When we sort by deaths and inspect the output from
step1
in the RStudio data viewer, we find the disappearing observations in Norway. There are 24 days where there were cases recorded but no deaths.Still, there's nothing to suggest we'd generate a graph where 400 people die for every 100 confirmed COVID cases.
Next, we apply the next operation in the original poster's
dplyr
pipeline and count the rows.Hmm... we've lost over 980 rows of data now.
At this point the code is throwing valid data away, which is going to skew the results. Why? COVID case and death counts are notorious for data corrections over time, so sometimes governments will report negative cases or deaths to correct past over-reporting errors.
Sure enough, our data includes rows with negative values.
Wow! one country reported -348,846 cases in one day. That's a major correction.
Inspecting the data again, we see that France is the culprit here. If we were conducting a more serious analysis with this data, the researcher would be obligated to assess the validity of this observation by doing things as reviewing news reports about case COVID reporting corrections in France during 2021.
Now, when the original code uses
mutate()
to calculate death rates, it does not aggregate bycountriesAndTerritories
oryear
.Therefore, when the code uses
ggplot()
withgeom(stat = "identity")
the y axis uses the values of the individual observations, 3,231 at this point, and produces unexpected results.Corrected Version of Original Poster's Chart
Here is the code that correctly analyzes the data, using the five countries selected by the original poster.
...and the output, which has death rates between 1 and 3.5%.
Note that the data frame used to generate the chart has only 12 observations, or one observation per number on the chart. This is why
stat = "identity"
can be used withgeom_bar()
.ggplot()
uses the value ofdeathRate
to plot the height of each bar along the y axis.Conclusions
First, it's important to understand the details of the data we're analyzing, especially when there are plenty of outside references such as worldwide COVID death rates.
Second, it's important to understand what are valid observations in a data set, such as whether it's reasonable for a data correction like the one France made in May 2021.
Finally, it's important to conduct a face validity analysis of the results. Is it realistic to expect 400 people to die for every 100 confirmed COVID cases? For a disease with a worldwide deaths reported between 1 and 4% of confirmed cases, probably not.