当我以稍微不同的方式运行这个均值时,为什么会得到不同的结果?
在第一个代码块中,我按 VAR1
进行分组,并根据 VAR1
的两个类别获得 VAR2
的平均值。但是,当我仅从 VAR1
中选择感兴趣的类别时,为了获得 VAR2
属于 VAR1 中 类别 1
的人员的平均值代码>,我得到了不同的结果。
我做错了什么?
DF %>%
na.omit() %>%
group_by(VAR1) %>%
summarise(mean(VAR2))
# A tibble: 2 × 2
VAR1 `mean(VAR2)`
<dbl> <dbl>
1 0 12.1
2 1 11.6
> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)
[1] 11.95238
In the first block of code, I am grouping by VAR1
and I get a mean for VAR2
based on the two categories of VAR1
. However, when I select only for the category of interest from VAR1
, in order to obtain a mean of people for VAR2
who fall into category 1 in VAR1
, I get a different result.
What am I doing wrong?
DF %>%
na.omit() %>%
group_by(VAR1) %>%
summarise(mean(VAR2))
# A tibble: 2 × 2
VAR1 `mean(VAR2)`
<dbl> <dbl>
1 0 12.1
2 1 11.6
> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)
[1] 11.95238
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在这些情况下,
VAR1
平均值的差异可能来自于另一列中存在 NA,正如@DanY 简要解释的那样。我将尝试通过以下示例来详细说明这一点。假设我们有由三列组成的 DF,其中两列来自 mtcars,另一列来自 VAR1。假设有 15 行VAR1
为0
。现在,我们将
cyl
的三个值更改为NA
。现在,
DF
仍然有 15 个mpg
值,其中VAR1
为0
,因为mpg
都没有VAR1
也没有NA
。现在我们尝试比较您所比较的两种情况下mpg
的平均值:显然您得到了不同的平均值。这是因为在
DF %>% na.omit()
中对DF
进行操作,因此DF
中任何行中的任何 NA 值都会导致遗漏该行的。这就是为什么cyl
中包含NA
的三行被省略,这样mpg
列只有 12 个值,然后计算其中的平均值。另一方面,
DF[DF$VAR1 == 0, 'mpg',]
仅对mpg
列进行操作,与cyl<无关/代码>。这就是为什么它有 15 个
mpg
值,没有NA
,因此na.rm
不会省略任何内容。The discrepancy in the mean values of
VAR1
in those cases probably comes from the presence of NAs in another column, as @DanY explained briefly. I will try to elaborate this through the following example. Suppose we have DF that consists of three columns, two of which come frommtcars
and another one isVAR1
. Suppose there are 15 rows withVAR1
of0
.Now, let's change three values of
cyl
intoNA
.Now
DF
still have 15 values ofmpg
withVAR1
of0
because neithermpg
norVAR1
hasNA
. Now we try to compare the mean ofmpg
in the two cases you have compared:Clearly you got different mean values. It is because in
DF %>% na.omit()
operates onDF
so any NA value in any row inDF
will cause omission of that row. That's why the three rows incyl
that containNA
s are omitted so thatmpg
column has only 12 values, of which the mean is then computed.On the other hand,
DF[DF$VAR1 == 0, 'mpg',]
operates only onmpg
column, and has nothing to do withcyl
. That's why it has 15 values ofmpg
with noNA
sona.rm
will not omit anything.