基于多种条件的时间序列的滞后累计总和

发布于 2025-01-27 21:48:48 字数 2124 浏览 6 评论 0原文

我想在较小列中的每个名称中的相应记录的累积总和 self_a 和toxt_b作为两个新列,并将它们放在同一行中,而不包括该行的值。 较小的列列出了哪个物种列的宽度较小。

  Species_a Species_b Sepal.Width_a Sepal.Width_b  Date        smaller
1 versicolor virginica           2.5           3.0 2022-05-05       a
2 versicolor virginica           2.6           2.8 2022-04-04       a
3 versicolor    setosa           2.2           4.4 2021-03-03       a
4     setosa virginica           4.2           2.5 2021-02-02       b
5  virginica    setosa           3.0           3.4 2020-01-01       a

理想情况下,数据的格式将以与现在相同的格式,总结将基于较小的datetoxt_a 代码>和self_b列单独使用。我试图创建一个计数列,但根据date,我被卡在适当地累积的位置,小于该列的当前值。

我所需的输出将如下:

   Species_a Species_b Sepal.Width_a Sepal.Width_b       Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica           2.5           3.0 2022-05-05       a             2             2
2 versicolor virginica           2.6           2.8 2022-04-04       a             1             2
3 versicolor    setosa           2.2           4.4 2021-03-03       a             0             0
4     setosa virginica           4.2           2.5 2021-02-02       b             0             1
5  virginica    setosa           3.0           3.4 2020-01-01       a             0             0

代码

library(tidyverse)
set.seed(12)

data_a <- iris[sample(1:nrow(iris)), ] %>% 
  head()
colnames(data_a) <- paste0(colnames(data_a), "_a")

data_b <- iris[sample(1:nrow(iris)), ] %>% 
  tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")

data <- bind_cols(data_a, data_b) %>% 
  filter(Species_a != Species_b) %>% 
  select(Species_a,
         Species_b, 
         Sepal.Width_a,
         Sepal.Width_b) %>% 
  mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
         smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b', 
                          ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))

I'd like to get the cumulative sum of the corresponding records in the smaller column for each name under Species_a and Species_b as two new columns, and have them in the same row without including the value for that row. the smaller column lists which species column has a smaller width.

  Species_a Species_b Sepal.Width_a Sepal.Width_b  Date        smaller
1 versicolor virginica           2.5           3.0 2022-05-05       a
2 versicolor virginica           2.6           2.8 2022-04-04       a
3 versicolor    setosa           2.2           4.4 2021-03-03       a
4     setosa virginica           4.2           2.5 2021-02-02       b
5  virginica    setosa           3.0           3.4 2020-01-01       a

Ideally the format of the data would be in the same format as it is now, and the summation would be based off of the smaller, Date, Species_a, and Species_b columns alone. I tried to create a count column but I get stuck on properly accumulating based on Date being less than the current value for that column.

My desired output would be as follows:

   Species_a Species_b Sepal.Width_a Sepal.Width_b       Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica           2.5           3.0 2022-05-05       a             2             2
2 versicolor virginica           2.6           2.8 2022-04-04       a             1             2
3 versicolor    setosa           2.2           4.4 2021-03-03       a             0             0
4     setosa virginica           4.2           2.5 2021-02-02       b             0             1
5  virginica    setosa           3.0           3.4 2020-01-01       a             0             0

Code:

library(tidyverse)
set.seed(12)

data_a <- iris[sample(1:nrow(iris)), ] %>% 
  head()
colnames(data_a) <- paste0(colnames(data_a), "_a")

data_b <- iris[sample(1:nrow(iris)), ] %>% 
  tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")

data <- bind_cols(data_a, data_b) %>% 
  filter(Species_a != Species_b) %>% 
  select(Species_a,
         Species_b, 
         Sepal.Width_a,
         Sepal.Width_b) %>% 
  mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
         smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b', 
                          ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

私藏温柔 2025-02-03 21:48:48

我不知道这是否是解决方案,但这可能是一个开始。

新列如何计算如何?看起来smalle_sum_a是物种a具有较小值的连续行数。但是我不认为smalle_sum_b也不适用吗?还是只是累积的天数,每个物种具有较小的值,减去1,但是如果该行中的物种不小(同样,则不会检查smalle_sum_b,又不要查看。尽管...)。

因为确定date是否小于当前值,首先,您要告诉R您的date列实际上是日期,而不仅仅是一个字符。

查看其格式的最简单方法是使您的data(不是数据的好名称btw),最好使其成为R或计算机不使用的东西,例如my_datatibble而不是data.frametibble s告诉您每列的格式是什么。

data %>% 
  tibble 

# # A tibble: 5 x 6
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller
#   <fct>      <fct>             <dbl>         <dbl> <chr>      <chr>  
# 1 versicolor virginica           2.5           3   2022-05-05 a      
# 2 versicolor virginica           2.6           2.8 2022-04-04 a      
# 3 versicolor setosa              2.2           4.4 2021-03-03 a      
# 4 setosa     virginica           4.2           2.5 2021-02-02 b      
# 5 virginica  setosa              3             3.4 2020-01-01 a  

&lt中的位; &gt;在列名中告诉您格式,&lt; fct&gt; is factor&lt; dbl; dbl&gt; is 数字说明)和&lt; chr&gt; is targin

因此,我们希望将日期纳入date格式,我们可以使用ymd()(年度周期)函数来进行。来自lubridate。另外,我对数据进行了重新排列,以使行按时间顺序(最早的顶部)进行,因为这是正常安排的方式,对我来说更有意义,尤其是如果您对累积总和感兴趣的情况下。

data %>% 
  tibble %>% 
  mutate(
    Date = ymd(Date)
  ) %>% 
  arrange(Date) %>% 
  {. ->> my_data}

my_data

# # A tibble: 5 x 6
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller
#   <fct>      <fct>             <dbl>         <dbl> <date>     <chr>  
# 1 virginica  setosa              3             3.4 2020-01-01 a      
# 2 setosa     virginica           4.2           2.5 2021-02-02 b      
# 3 versicolor setosa              2.2           4.4 2021-03-03 a      
# 4 versicolor virginica           2.6           2.8 2022-04-04 a      
# 5 versicolor virginica           2.5           3   2022-05-05 a 

我们可以看到R现在认识到date列是一个日期,现在在R-RECHISED &lt; date&gt;格式中。

现在,我不是100%确切确切地确定您要如何计算新列的地方,但是例如,您可以使用ifelse()来确定物种A是否较小,然后计算累积总和较小的日子。

my_data %>%
  mutate(
    s_a = ifelse(smaller == 'a', 1, 0),
    smaller_sum_a = cumsum(s_a),
  )
  
# # A tibble: 5 x 8
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller   s_a smaller_sum_a
#   <fct>      <fct>             <dbl>         <dbl> <date>     <chr>   <dbl>         <dbl>
# 1 virginica  setosa              3             3.4 2020-01-01 a           1             1
# 2 setosa     virginica           4.2           2.5 2021-02-02 b           0             1
# 3 versicolor setosa              2.2           4.4 2021-03-03 a           1             2
# 4 versicolor virginica           2.6           2.8 2022-04-04 a           1             3
# 5 versicolor virginica           2.5           3   2022-05-05 a           1             4

只要a)date列在R-RENSIDER &lt; date&gt;格式或b中,或b)按时间顺序排列,您可以使用小于大于运算符&lt;&amp; &gt;确定给定行之前/之后的日期是否在。

这是理解R如何处理日期和时间的好资源,值得一读 https://r4ds.had.co.nz/dates-and-times.html

I don't know if this is a solution, but it might be a start.

How exactly are the new columns calculated? Looks like smaller_sum_a is the number of consecutive rows where species a has the smaller value, minus one. But the same doesn't work for smaller_sum_b I don't think? Or is it just cumulative number of days where each species is has the smaller value, minus 1, but with zero if the species isn't smaller in that row (again this doesn't check out for smaller_sum_b though...).

As for determining if Date is less than the current value, firstly you'll want to tell R that your Date column is actually a date, rather than just a character.

Easiest way to see what format it is in is to make your data (not a good name for your data btw, preferably make it something that R or the computer wouldn't use, like my_data) a tibble rather than a data.frame. tibbles tell you what format each column is in which is handy.

data %>% 
  tibble 

# # A tibble: 5 x 6
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller
#   <fct>      <fct>             <dbl>         <dbl> <chr>      <chr>  
# 1 versicolor virginica           2.5           3   2022-05-05 a      
# 2 versicolor virginica           2.6           2.8 2022-04-04 a      
# 3 versicolor setosa              2.2           4.4 2021-03-03 a      
# 4 setosa     virginica           4.2           2.5 2021-02-02 b      
# 5 virginica  setosa              3             3.4 2020-01-01 a  

The bits inside the < > under the column names tell you the formats, <fct> is factor, <dbl> is numeric (see here for explanation) and <chr> is character.

So, we want to make Date into a date format, which we can do with the ymd() (year-month-day) function from lubridate. Also, I rearranged the data so the rows are in chronological order (earliest at the top), because that's how things are normally arranged, and makes more sense to me, especially if you're interested in cumulative sums.

data %>% 
  tibble %>% 
  mutate(
    Date = ymd(Date)
  ) %>% 
  arrange(Date) %>% 
  {. ->> my_data}

my_data

# # A tibble: 5 x 6
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller
#   <fct>      <fct>             <dbl>         <dbl> <date>     <chr>  
# 1 virginica  setosa              3             3.4 2020-01-01 a      
# 2 setosa     virginica           4.2           2.5 2021-02-02 b      
# 3 versicolor setosa              2.2           4.4 2021-03-03 a      
# 4 versicolor virginica           2.6           2.8 2022-04-04 a      
# 5 versicolor virginica           2.5           3   2022-05-05 a 

We can see that R now recognises that the Date column is a date, and is now in the R-recognised <date> format.

Now this is where I'm not 100% sure on exactly how you want to calculate your new columns, but for example you can use ifelse() to determine if species a is smaller, and then calculate the cumulative sum of the days where it was smaller.

my_data %>%
  mutate(
    s_a = ifelse(smaller == 'a', 1, 0),
    smaller_sum_a = cumsum(s_a),
  )
  
# # A tibble: 5 x 8
#   Species_a  Species_b Sepal.Width_a Sepal.Width_b Date       smaller   s_a smaller_sum_a
#   <fct>      <fct>             <dbl>         <dbl> <date>     <chr>   <dbl>         <dbl>
# 1 virginica  setosa              3             3.4 2020-01-01 a           1             1
# 2 setosa     virginica           4.2           2.5 2021-02-02 b           0             1
# 3 versicolor setosa              2.2           4.4 2021-03-03 a           1             2
# 4 versicolor virginica           2.6           2.8 2022-04-04 a           1             3
# 5 versicolor virginica           2.5           3   2022-05-05 a           1             4

As long as either a) the Date column is in an R-recognised <date> format, or b) it is arranged chronologically, you can use the less than or greater than operators < & > to determine if dates are before/after a given row.

This is a good resource for understanding how R treats dates and times, and is well worth a read https://r4ds.had.co.nz/dates-and-times.html

把时间冻结 2025-02-03 21:48:48

这是我当前的解决方案,如果我听到它会破坏Dplyr的一些功能,我不想使用Plyr。我觉得肯定有一种更高效,更现代的解决这个问题的方法,但我似乎找不到它。

library(plyr)
library(lubridate)

# creating counts for smaller sums for red side

data$Date <- lubridate::parse_date_time(x = data$Date, # standardizing date (outside of the reproducible example there are two date types)
                                              orders = c("%m/%d/%Y", "%Y-%m-%d"))

A_rn <- mutate(filter(select(data, 
                             Species_a, 
                             Date, 
                             smaller), 
                      smaller == 'a'), 
                      smaller_ct_a  = 1)

# creating counts for smaller sums for b
BtoA_rn <- mutate(filter(select(data, 
                                Species_b, 
                                Date, 
                                smaller), 
                         smaller == 'b'), # calling Species_b Species_a for easier joining
                  Species_a = Species_b,
                  smaller_ct_a  = 1) %>% 
  select(Species_a, Date, smaller, smaller_ct_a)

# cumsum for both a and b
A <- ddply(bind_rows(A_rn, BtoA_rn) %>% 
             arrange(Date),
           .(Species_a), transform,
           smaller_sum_a = lag(cumsum(replace_na(smaller_ct_a, 0)))) %>% 
  select(-smaller_ct_a)

# naming adjustment
B <- A %>% filter(smaller == "b") %>% 
  select(-smaller)

names(B) <- gsub(x = names(B), pattern = "_a", replacement = "_b")  
A <- A %>% filter(smaller == "a") %>% 
  select(-smaller)

data <- left_join(data, A, by = c("Species_a", "Date")) %>% 
  left_join(B, by = c("Species_b", "Date"))

data[is.na(data)] <- 0

Here is my current solution, I'd like to not use plyr if I can help it since I heard it breaks some of dplyr's functions. I feel like there is definitely a more efficient and modern way of solving this issue but I can't seem to find it.

library(plyr)
library(lubridate)

# creating counts for smaller sums for red side

data$Date <- lubridate::parse_date_time(x = data$Date, # standardizing date (outside of the reproducible example there are two date types)
                                              orders = c("%m/%d/%Y", "%Y-%m-%d"))

A_rn <- mutate(filter(select(data, 
                             Species_a, 
                             Date, 
                             smaller), 
                      smaller == 'a'), 
                      smaller_ct_a  = 1)

# creating counts for smaller sums for b
BtoA_rn <- mutate(filter(select(data, 
                                Species_b, 
                                Date, 
                                smaller), 
                         smaller == 'b'), # calling Species_b Species_a for easier joining
                  Species_a = Species_b,
                  smaller_ct_a  = 1) %>% 
  select(Species_a, Date, smaller, smaller_ct_a)

# cumsum for both a and b
A <- ddply(bind_rows(A_rn, BtoA_rn) %>% 
             arrange(Date),
           .(Species_a), transform,
           smaller_sum_a = lag(cumsum(replace_na(smaller_ct_a, 0)))) %>% 
  select(-smaller_ct_a)

# naming adjustment
B <- A %>% filter(smaller == "b") %>% 
  select(-smaller)

names(B) <- gsub(x = names(B), pattern = "_a", replacement = "_b")  
A <- A %>% filter(smaller == "a") %>% 
  select(-smaller)

data <- left_join(data, A, by = c("Species_a", "Date")) %>% 
  left_join(B, by = c("Species_b", "Date"))

data[is.na(data)] <- 0

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文