当我以稍微不同的方式运行这个均值时,为什么会得到不同的结果?

发布于 2025-01-11 17:19:10 字数 614 浏览 2 评论 0原文

在第一个代码块中,我按 VAR1 进行分组,并根据 VAR1 的两个类别获得 VAR2 的平均值。但是,当我仅从 VAR1 中选择感兴趣的类别时,为了获得 VAR2 属于 VAR1 中 类别 1 的人员的平均值代码>,我得到了不同的结果。

我做错了什么?

 DF %>%
   na.omit() %>%
   group_by(VAR1) %>% 
   summarise(mean(VAR2))

# A tibble: 2 × 2
  VAR1 `mean(VAR2)`
                     <dbl>                         <dbl>
1                        0                          12.1
2                        1                          11.6

> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)

[1] 11.95238

In the first block of code, I am grouping by VAR1 and I get a mean for VAR2 based on the two categories of VAR1. However, when I select only for the category of interest from VAR1, in order to obtain a mean of people for VAR2 who fall into category 1 in VAR1, I get a different result.

What am I doing wrong?

 DF %>%
   na.omit() %>%
   group_by(VAR1) %>% 
   summarise(mean(VAR2))

# A tibble: 2 × 2
  VAR1 `mean(VAR2)`
                     <dbl>                         <dbl>
1                        0                          12.1
2                        1                          11.6

> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)

[1] 11.95238

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

故人爱我别走 2025-01-18 17:19:10

在这些情况下,VAR1 平均值的差异可能来自于另一列中存在 NA,正如@DanY 简要解释的那样。我将尝试通过以下示例来详细说明这一点。假设我们有由三列组成的 DF,其中两列来自 mtcars,另一列来自 VAR1。假设有 15 行 VAR10

library(dplyr) 
DF <- mtcars %>% select(mpg, cyl)
DF$VAR1 <- c(rep(0, 15),rep(1,17))
DF
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Datsun 710          22.8   4    0
#Hornet 4 Drive      21.4   6    0
#Hornet Sportabout   18.7   8    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

现在,我们将 cyl 的三个值更改为 NA

DF$cyl[3:5] <- NA
head(DF)
#                   mpg cyl VAR1
#Mazda RX4         21.0   6    0
#Mazda RX4 Wag     21.0   6    0
#Datsun 710        22.8  NA    0
#Hornet 4 Drive    21.4  NA    0
#Hornet Sportabout 18.7  NA    0
#Valiant           18.1   6    0

现在,DF 仍然有 15 个 mpg 值,其中 VAR10,因为 mpg 都没有VAR1 也没有 NA。现在我们尝试比较您所比较的两种情况下 mpg 的平均值:

DF %>% na.omit() %>% group_by(VAR1) %>% summarise(mean(mpg))
# A tibble: 2 x 2
#   VAR1 `mean(mpg)`
#  <dbl>       <dbl>
#1     0        18.2
#2     1        21.3

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
mean(DF[DF$VAR1 == 1, 'mpg',], na.rm = TRUE)
#[1] 21.3

显然您得到了不同的平均值。这是因为在 DF %>% na.omit() 中对 DF 进行操作,因此 DF 中任何行中的任何 NA 值都会导致遗漏该行的。这就是为什么 cyl 中包含 NA 的三行被省略,这样 mpg 列只有 12 个值,然后计算其中的平均值。

DF %>% na.omit()
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

DF %>% na.omit() %>% filter(VAR1 == 0) %>% pull(mpg) %>% mean
# [1] 18.15833

另一方面,DF[DF$VAR1 == 0, 'mpg',]仅对mpg列进行操作,与cyl<无关/代码>。这就是为什么它有 15 个 mpg 值,没有 NA,因此 na.rm 不会省略任何内容。

DF[DF$VAR1 == 0, 'mpg',]
#[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4
#[13] 17.3 15.2 10.4

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72

The discrepancy in the mean values of VAR1 in those cases probably comes from the presence of NAs in another column, as @DanY explained briefly. I will try to elaborate this through the following example. Suppose we have DF that consists of three columns, two of which come from mtcars and another one is VAR1. Suppose there are 15 rows with VAR1 of 0.

library(dplyr) 
DF <- mtcars %>% select(mpg, cyl)
DF$VAR1 <- c(rep(0, 15),rep(1,17))
DF
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Datsun 710          22.8   4    0
#Hornet 4 Drive      21.4   6    0
#Hornet Sportabout   18.7   8    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

Now, let's change three values of cyl into NA.

DF$cyl[3:5] <- NA
head(DF)
#                   mpg cyl VAR1
#Mazda RX4         21.0   6    0
#Mazda RX4 Wag     21.0   6    0
#Datsun 710        22.8  NA    0
#Hornet 4 Drive    21.4  NA    0
#Hornet Sportabout 18.7  NA    0
#Valiant           18.1   6    0

Now DF still have 15 values of mpg with VAR1 of 0 because neither mpg nor VAR1 has NA. Now we try to compare the mean of mpg in the two cases you have compared:

DF %>% na.omit() %>% group_by(VAR1) %>% summarise(mean(mpg))
# A tibble: 2 x 2
#   VAR1 `mean(mpg)`
#  <dbl>       <dbl>
#1     0        18.2
#2     1        21.3

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
mean(DF[DF$VAR1 == 1, 'mpg',], na.rm = TRUE)
#[1] 21.3

Clearly you got different mean values. It is because in DF %>% na.omit() operates on DF so any NA value in any row in DF will cause omission of that row. That's why the three rows in cyl that contain NAs are omitted so that mpg column has only 12 values, of which the mean is then computed.

DF %>% na.omit()
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

DF %>% na.omit() %>% filter(VAR1 == 0) %>% pull(mpg) %>% mean
# [1] 18.15833

On the other hand, DF[DF$VAR1 == 0, 'mpg',] operates only on mpg column, and has nothing to do with cyl. That's why it has 15 values of mpg with no NA so na.rm will not omit anything.

DF[DF$VAR1 == 0, 'mpg',]
#[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4
#[13] 17.3 15.2 10.4

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文