如果n = 1，则用组替换值

发布于 2025-01-17 12:46:44 字数 1548 浏览 0 评论 0原文

我正在尝试对包含大量缺失数据的数据集采取分组均值，并且某些组有 1 或 0 个样本可从中导出均值。我试图对每个海洋中的每个物种采取平均值。然而，对于每个海洋只有一个值（或没有）的物种，我想使用“全球”平均值 - 例如，该物种在所有海洋中的平均值（而不是仅使用一个值来取“平均值”）。

我的数据如下所示：

species<- c("turtle","turtle","turtle","turtle",
            "turtle","turtle","turtle","turtle",
            "shark",  "shark", "shark","shark",
            "shark",  "shark", "shark","shark",
             "bird")
gear<- c("t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t"  ) 
ocean<- c("north", "south", "east", "west", 
           "north", "south", "east", "west",
           "north", "south", "east", "west", 
           "north", "south", "east", "west",
           "north")
rate<-c( 0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 )

df<- as.data.frame(cbind(species, gear, region, rate))
df$rate<-as.numeric(df$rate)

db <- df %>%   
  group_by(species, gear, region) %>% 
  summarize(mean=mean(rate),
            sd=sd(rate),
            n = n()) %>% 
  mutate(se = sd/sqrt(n), 
         upper_rate = mean + 1.96*se, 
         lower_rate = mean - 1.96*se)

我想做的是用每个物种、海洋和装备的分组平均值填充数据框，但对于那些只有一种比率的数据（例如鸟类），我希望它为数据分配一个“全局”平均值所有海洋。（例如，南、东、西海洋中的鸟类平均值为 0.10。

我正在寻找如下所示的输出：

我正在尝试以一种干净且可重复的方式来做到这一点。我认为这很简单，但似乎无法弄清楚！任何帮助将不胜感激！

原文

I am trying to take grouped means for a dataset that has a lot of missing data, and for which SOME groups have 1 or 0 samples from which to derive means. I am trying to take a mean for each species within each ocean. However for species with only one value (or none) per ocean, I would like to use a "global" mean -- eg, a mean for that species across all oceans (rather than use only one value to take a "mean").

My data looks like this:

species<- c("turtle","turtle","turtle","turtle",
            "turtle","turtle","turtle","turtle",
            "shark",  "shark", "shark","shark",
            "shark",  "shark", "shark","shark",
             "bird")
gear<- c("t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t"  ) 
ocean<- c("north", "south", "east", "west", 
           "north", "south", "east", "west",
           "north", "south", "east", "west", 
           "north", "south", "east", "west",
           "north")
rate<-c( 0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 )

df<- as.data.frame(cbind(species, gear, region, rate))
df$rate<-as.numeric(df$rate)

db <- df %>%   
  group_by(species, gear, region) %>% 
  summarize(mean=mean(rate),
            sd=sd(rate),
            n = n()) %>% 
  mutate(se = sd/sqrt(n), 
         upper_rate = mean + 1.96*se, 
         lower_rate = mean - 1.96*se)

What I would like to do is populate a dataframe with grouped means for EACH species AND ocean and gear, but for those with only one rate (eg birds), I want it to assign a "global" mean to all oceans. (Eg the bird mean in the south, east, and west oceans would be 0.10.

I am looking for the output to look like this:

I am trying to do this in a clean and reproducible way. I think it's really simple but can't seem to figure it out! Any help would be greatly appreciated!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深府石板幽径 2025-01-24 12:46:45

library(data.table)

# dummy data
species <- c("turtle","turtle","turtle","turtle",
            "turtle","turtle","turtle","turtle",
            "shark",  "shark", "shark","shark",
            "shark",  "shark", "shark","shark",
            "bird")
gear <- c("t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t"  ) 
region <- c("north", "south", "east", "west", 
          "north", "south", "east", "west",
          "north", "south", "east", "west", 
          "north", "south", "east", "west",
          "north")
rate <-c( 0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 )
df <- data.table(species, gear, region, rate)
# use setDT(df) if your dataframe isn't a data table already. Use class(df) to check

# group by level
x <- c('species', 'gear' ,'region')

# create all combinations
y <- CJ(df$species, df$gear, df$region, unique=T)
setnames(y, names(y), x)

# calculate mean, sd
z <- df[, .(mean = mean(rate)
             , sd = sd(rate)
             )
         , x
         ]

# join back to all combinations
a <- z[y, on=.(species, region ,gear)]

# calculate mean, sd at species level
b <- df[, .(mean_species_region = mean(rate)
             , sd_species_region = sd(rate)
             )
         , .(species)
         ]

# join back to all combinations
c <- b[a, on=.(species)]

# coalesce
c[, `:=` (mean = fcoalesce(mean, mean_species_region)
          , sd = fcoalesce(sd, sd_species_region)
          )
  ][, c('mean_species_region', 'sd_species_region') := NULL]

# clean environment
rm(list=letters[1:26])

输出：

    species gear region   mean         sd
 1:    bird    p   east 0.1000         NA
 2:    bird    p  north 0.1000         NA
 3:    bird    p  south 0.1000         NA
 4:    bird    p   west 0.1000         NA
 5:    bird    t   east 0.1000         NA
 6:    bird    t  north 0.1000         NA
 7:    bird    t  south 0.1000         NA
 8:    bird    t   west 0.1000         NA
 9:   shark    p   east 0.2625 0.10606602
10:   shark    p  north 0.2625 0.10606602
11:   shark    p  south 0.2000 0.00000000
12:   shark    p   west 0.4000 0.00000000
13:   shark    t   east 0.3000 0.00000000
14:   shark    t  north 0.1500 0.07071068
15:   shark    t  south 0.2625 0.10606602
16:   shark    t   west 0.2625 0.10606602
17:  turtle    p   east 0.2625 0.10606602
18:  turtle    p  north 0.2625 0.10606602
19:  turtle    p  south 0.2000 0.00000000
20:  turtle    p   west 0.4000 0.00000000
21:  turtle    t   east 0.3000 0.00000000
22:  turtle    t  north 0.1500 0.07071068
23:  turtle    t  south 0.2625 0.10606602
24:  turtle    t   west 0.2625 0.10606602

这是基于我的理解，即您希望最终使用一个使用列的所有组合的表stell，Gear> Gear和region。
对于原始集合中不存在或组合仅在原始数据集中的组合中的组合，我们将分配emane和SD由分组物种。

如上所示，我们仍然有na bird's sd。这是因为我们至少需要2个数据点才能计算SD。但是bird在原始数据中只有一个行（数据点）。

对于将来的示例，最好使用更简化的数据集。任何人，希望这会有所帮助。

library(data.table)

# dummy data
species <- c("turtle","turtle","turtle","turtle",
            "turtle","turtle","turtle","turtle",
            "shark",  "shark", "shark","shark",
            "shark",  "shark", "shark","shark",
            "bird")
gear <- c("t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t", "p", "t", "p",
         "t"  ) 
region <- c("north", "south", "east", "west", 
          "north", "south", "east", "west",
          "north", "south", "east", "west", 
          "north", "south", "east", "west",
          "north")
rate <-c( 0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 , 0.2, 0.3, 0.4,
         0.2 , 0.2, 0.3, 0.4,
         0.1 )
df <- data.table(species, gear, region, rate)
# use setDT(df) if your dataframe isn't a data table already. Use class(df) to check

# group by level
x <- c('species', 'gear' ,'region')

# create all combinations
y <- CJ(df$species, df$gear, df$region, unique=T)
setnames(y, names(y), x)

# calculate mean, sd
z <- df[, .(mean = mean(rate)
             , sd = sd(rate)
             )
         , x
         ]

# join back to all combinations
a <- z[y, on=.(species, region ,gear)]

# calculate mean, sd at species level
b <- df[, .(mean_species_region = mean(rate)
             , sd_species_region = sd(rate)
             )
         , .(species)
         ]

# join back to all combinations
c <- b[a, on=.(species)]

# coalesce
c[, `:=` (mean = fcoalesce(mean, mean_species_region)
          , sd = fcoalesce(sd, sd_species_region)
          )
  ][, c('mean_species_region', 'sd_species_region') := NULL]

# clean environment
rm(list=letters[1:26])

Output:

    species gear region   mean         sd
 1:    bird    p   east 0.1000         NA
 2:    bird    p  north 0.1000         NA
 3:    bird    p  south 0.1000         NA
 4:    bird    p   west 0.1000         NA
 5:    bird    t   east 0.1000         NA
 6:    bird    t  north 0.1000         NA
 7:    bird    t  south 0.1000         NA
 8:    bird    t   west 0.1000         NA
 9:   shark    p   east 0.2625 0.10606602
10:   shark    p  north 0.2625 0.10606602
11:   shark    p  south 0.2000 0.00000000
12:   shark    p   west 0.4000 0.00000000
13:   shark    t   east 0.3000 0.00000000
14:   shark    t  north 0.1500 0.07071068
15:   shark    t  south 0.2625 0.10606602
16:   shark    t   west 0.2625 0.10606602
17:  turtle    p   east 0.2625 0.10606602
18:  turtle    p  north 0.2625 0.10606602
19:  turtle    p  south 0.2000 0.00000000
20:  turtle    p   west 0.4000 0.00000000
21:  turtle    t   east 0.3000 0.00000000
22:  turtle    t  north 0.1500 0.07071068
23:  turtle    t  south 0.2625 0.10606602
24:  turtle    t   west 0.2625 0.10606602

This is based on my understanding that you would like to end up with a table with ALL combinations using columns species, gear and region.
For those combination(s) which did not exist in original set or where combination only had one row in original data set, we will assign mean and sd grouped by species.

As you can see in output above we still have NA for bird's sd. This is because we need at least 2 data points to calculate sd. But bird only had one row (data point) in original data.

For future examples it maybe better to use a more simplified data set. Anywho, hope this helps.

回复收藏 0 原文

~没有更多了~