在随着时间的推移重复的条件内总结

发布于 2025-01-29 03:02:04 字数 2082 浏览 3 评论 0原文

我正在尝试使用与随时间重复的条件以不同的间隔重复的数据集在时间间隔内总结数据。我想在每个条件的时间间隔内获得手段和标准偏差。

但是，在我的真实数据中，我不知道每个条件都会有多少个间隔。我认为也许我可以通过从一行到下一行的状态变化来指示间隔的结束。但是我不知道该如何编码。

library(tidyverse)

df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
        rowwise() %>%
        mutate(X = rnorm(1.4, 0.3))

我正在尝试计算每个条件间隔（编号数）的平均值（x）和SD（x）：

Condition   interval        mean(X)   sd(X)
A            [160,190]       1.4      0.32
B            [190.05,230]    1.46     0.36
C            [230.05,260]    1.32     0.26
A            [260.05,293]    1.5      0.40
B            [293.05,321]    1.25     0.34
C            [321.05,352]    1.43     0.41

我已经尝试过，但是它没有做我需要的事情：

df %>%  
        group_by(Condition) %>%
        mutate(interval = cut(Time,
                              breaks = c(floor(min(Time)), ceiling(max(Time))),
                              include.lowest = F, 
                              right = F)) %>%
        group_by(Condition, interval) %>% 
        summarise( mean.X = mean(X),
                   sd.X = sd(X))

这不会给我第二个每个条件的间隔：

  Condition interval  mean.X   sd.X
  <chr>     <fct>      <dbl>  <dbl>
1 A         [160,293)  0.231  0.991
2 A         NA         1.61  NA    
3 B         [190,321)  0.421  0.893
4 B         NA         0.249 NA    
5 C         [230,352)  0.193  0.898
6 C         NA         0.427 NA

有什么建议？

原文

I am trying to summarize data within time intervals using a data set with conditions repeated over time at varying intervals. I would like to get means and standard deviations within time intervals for each of the conditions.

However, in my real data I don't know how many intervals of each condition there will be. I thought perhaps I could indicate the end of an interval by a change in Condition from one row to the next row. But I don't know how to code that.

library(tidyverse)

df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
        rowwise() %>%
        mutate(X = rnorm(1.4, 0.3))

I'm trying to calculate mean(X) and sd(X) for each interval of Condition (made up numbers):

Condition   interval        mean(X)   sd(X)
A            [160,190]       1.4      0.32
B            [190.05,230]    1.46     0.36
C            [230.05,260]    1.32     0.26
A            [260.05,293]    1.5      0.40
B            [293.05,321]    1.25     0.34
C            [321.05,352]    1.43     0.41

I've tried this, but it doesn't do what I need:

df %>%  
        group_by(Condition) %>%
        mutate(interval = cut(Time,
                              breaks = c(floor(min(Time)), ceiling(max(Time))),
                              include.lowest = F, 
                              right = F)) %>%
        group_by(Condition, interval) %>% 
        summarise( mean.X = mean(X),
                   sd.X = sd(X))

This doesn't give me the second intervals for each Condition:

  Condition interval  mean.X   sd.X
  <chr>     <fct>      <dbl>  <dbl>
1 A         [160,293)  0.231  0.991
2 A         NA         1.61  NA    
3 B         [190,321)  0.421  0.893
4 B         NA         0.249 NA    
5 C         [230,352)  0.193  0.898
6 C         NA         0.427 NA

Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

捎一片雪花 2025-02-05 03:02:04

我们可以使用rle来定义您状况的“组”。

library(dplyr)

df %>% 
  ungroup() %>% 
  mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>% 
  group_by(group) %>% 
  summarize(Condition = unique(Condition),
            interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"), 
            mean_X = mean(X), 
            sd_X = sd(X))

# A tibble: 6 × 5
  group Condition interval     mean_X  sd_X
  <int> <chr>     <chr>         <dbl> <dbl>
1     1 A         [160,190]    0.160  0.926
2     2 B         [190.05,230] 0.0258 0.990
3     3 C         [230.05,260] 0.296  1.03 
4     4 A         [260.05,293] 0.472  1.08 
5     5 B         [293.05,321] 0.0363 1.08 
6     6 C         [321.05,352] 0.361  1.10

We can use rle to define "groups" of your Condition.

library(dplyr)

df %>% 
  ungroup() %>% 
  mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>% 
  group_by(group) %>% 
  summarize(Condition = unique(Condition),
            interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"), 
            mean_X = mean(X), 
            sd_X = sd(X))

# A tibble: 6 × 5
  group Condition interval     mean_X  sd_X
  <int> <chr>     <chr>         <dbl> <dbl>
1     1 A         [160,190]    0.160  0.926
2     2 B         [190.05,230] 0.0258 0.990
3     3 C         [230.05,260] 0.296  1.03 
4     4 A         [260.05,293] 0.472  1.08 
5     5 B         [293.05,321] 0.0363 1.08 
6     6 C         [321.05,352] 0.361  1.10

回复收藏 0 原文

酒浓于脸红 2025-02-05 03:02:04

我绝对认为应该有一种不那么混乱的方法，但是kmeans（）给出了以下可能的解决方案：

library(tidyverse)

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

df %>% 
  group_by(Condition) %>% 
  mutate(Block = kmeans(Time, 2)$cluster) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = as.character(cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, 
                        right = T))) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X),
            sd.X = sd(X)) %>% 
  arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <chr>      <dbl> <dbl>
#> 1 A         [160,190]  0.382 0.819
#> 2 A         [260,293]  0.277 0.940
#> 3 B         [190,230]  0.229 1.14 
#> 4 B         [293,321]  0.303 1.08 
#> 5 C         [230,260]  0.265 0.755
#> 6 C         [321,352]  0.301 0.900

由您决定na s的处理方式。

编辑1：

添加 @sinh nguyen '

编辑2：回答更新的问题：

我们可以从data.table中使用rleid（）函数

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

Block <- rleid(df$Condition)
df %>% 
  add_column(Block) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval     mean.X  sd.X
#>   <chr>     <chr>         <dbl> <dbl>
#> 1 A         [160,190]     0.382 0.819
#> 2 A         [260.05,293]  0.277 0.940
#> 3 B         [190.05,230]  0.229 1.14 
#> 4 B         [293.05,321]  0.303 1.08 
#> 5 C         [230.05,260]  0.265 0.755
#> 6 C         [321.05,352]  0.301 0.900

I definitely think there should be a less messy way to do it, but kmeans() gives the following possible solution:

library(tidyverse)

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

df %>% 
  group_by(Condition) %>% 
  mutate(Block = kmeans(Time, 2)$cluster) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = as.character(cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, 
                        right = T))) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X),
            sd.X = sd(X)) %>% 
  arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <chr>      <dbl> <dbl>
#> 1 A         [160,190]  0.382 0.819
#> 2 A         [260,293]  0.277 0.940
#> 3 B         [190,230]  0.229 1.14 
#> 4 B         [293,321]  0.303 1.08 
#> 5 C         [230,260]  0.265 0.755
#> 6 C         [321,352]  0.301 0.900

It's up to you how the NAs are dealt with.

Edit 1:

Added @Sinh Nguyen's cut improvements.

Edit 2: In response to updated question:

We can use the rleid() function from data.table

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

Block <- rleid(df$Condition)
df %>% 
  add_column(Block) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval     mean.X  sd.X
#>   <chr>     <chr>         <dbl> <dbl>
#> 1 A         [160,190]     0.382 0.819
#> 2 A         [260.05,293]  0.277 0.940
#> 3 B         [190.05,230]  0.229 1.14 
#> 4 B         [293.05,321]  0.303 1.08 
#> 5 C         [230.05,260]  0.265 0.755
#> 6 C         [321.05,352]  0.301 0.900

回复收藏 0 原文

我恋#小黄人 2025-02-05 03:02:04

您拥有具有Na值的第二间间隔组的原因是由于您的输入到cut> cut函数中，其中right = f，结果记录time == max（时间）将从间隔输出中排除。

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = F, right = F)) %>%
  filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition  Time      X interval
#>   <chr>     <dbl>  <dbl> <fct>   
#> 1 A           293 -1.52  <NA>    
#> 2 B           321  1.35  <NA>    
#> 3 C           352  0.758 <NA>

您可以在上面的情况下进行一个记录，每个组都有na间隔。
如果将更改为 param to right = t和incruph.lowest = t，则将所有这些都包含在内。

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, right = T)) %>%
  group_by(Condition, interval) %>% 
  summarise( mean.X = mean(X),
             sd.X = sd(X))

#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <fct>      <dbl> <dbl>
#> 1 A         [160,293]  0.230 0.963
#> 2 B         [190,321]  0.124 0.961
#> 3 C         [230,352]  0.146 0.961

如果这不是您期望的，请更多地澄清您希望该间隔的方式

。 =“ nofollow noreferrer”> reprex软件包（v2.0.1）

The reasons that you have the 2nd interval group with NA values is due to your input to cut function where right = F which result records with Time == max(Time) would be excluded from the interval output.

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = F, right = F)) %>%
  filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition  Time      X interval
#>   <chr>     <dbl>  <dbl> <fct>   
#> 1 A           293 -1.52  <NA>    
#> 2 B           321  1.35  <NA>    
#> 3 C           352  0.758 <NA>

As you can se above about there are one record having NA interval for each group.
If you change cut param to right = T and include.lowest = T then you would included all of them.

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, right = T)) %>%
  group_by(Condition, interval) %>% 
  summarise( mean.X = mean(X),
             sd.X = sd(X))

#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <fct>      <dbl> <dbl>
#> 1 A         [160,293]  0.230 0.963
#> 2 B         [190,321]  0.124 0.961
#> 3 C         [230,352]  0.146 0.961

If this is not what you expected, please clarify more on how you would like the interval to be.,

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~