基于多种条件的时间序列的滞后累计总和
我想在较小
列中的每个名称中的相应记录的累积总和 self_a 和toxt_b
作为两个新列,并将它们放在同一行中,而不包括该行的值。 较小的
列列出了哪个物种列的宽度较小。
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
1 versicolor virginica 2.5 3.0 2022-05-05 a
2 versicolor virginica 2.6 2.8 2022-04-04 a
3 versicolor setosa 2.2 4.4 2021-03-03 a
4 setosa virginica 4.2 2.5 2021-02-02 b
5 virginica setosa 3.0 3.4 2020-01-01 a
理想情况下,数据的格式将以与现在相同的格式,总结将基于较小的
,date
,toxt_a
代码>和self_b
列单独使用。我试图创建一个计数列,但根据date
,我被卡在适当地累积的位置,小于该列的当前值。
我所需的输出将如下:
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica 2.5 3.0 2022-05-05 a 2 2
2 versicolor virginica 2.6 2.8 2022-04-04 a 1 2
3 versicolor setosa 2.2 4.4 2021-03-03 a 0 0
4 setosa virginica 4.2 2.5 2021-02-02 b 0 1
5 virginica setosa 3.0 3.4 2020-01-01 a 0 0
代码:
library(tidyverse)
set.seed(12)
data_a <- iris[sample(1:nrow(iris)), ] %>%
head()
colnames(data_a) <- paste0(colnames(data_a), "_a")
data_b <- iris[sample(1:nrow(iris)), ] %>%
tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")
data <- bind_cols(data_a, data_b) %>%
filter(Species_a != Species_b) %>%
select(Species_a,
Species_b,
Sepal.Width_a,
Sepal.Width_b) %>%
mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b',
ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))
I'd like to get the cumulative sum of the corresponding records in the smaller
column for each name under Species_a
and Species_b
as two new columns, and have them in the same row without including the value for that row. the smaller
column lists which species column has a smaller width.
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller
1 versicolor virginica 2.5 3.0 2022-05-05 a
2 versicolor virginica 2.6 2.8 2022-04-04 a
3 versicolor setosa 2.2 4.4 2021-03-03 a
4 setosa virginica 4.2 2.5 2021-02-02 b
5 virginica setosa 3.0 3.4 2020-01-01 a
Ideally the format of the data would be in the same format as it is now, and the summation would be based off of the smaller
, Date
, Species_a
, and Species_b
columns alone. I tried to create a count column but I get stuck on properly accumulating based on Date
being less than the current value for that column.
My desired output would be as follows:
Species_a Species_b Sepal.Width_a Sepal.Width_b Date smaller smaller_sum_a smaller_sum_b
1 versicolor virginica 2.5 3.0 2022-05-05 a 2 2
2 versicolor virginica 2.6 2.8 2022-04-04 a 1 2
3 versicolor setosa 2.2 4.4 2021-03-03 a 0 0
4 setosa virginica 4.2 2.5 2021-02-02 b 0 1
5 virginica setosa 3.0 3.4 2020-01-01 a 0 0
Code:
library(tidyverse)
set.seed(12)
data_a <- iris[sample(1:nrow(iris)), ] %>%
head()
colnames(data_a) <- paste0(colnames(data_a), "_a")
data_b <- iris[sample(1:nrow(iris)), ] %>%
tail()
colnames(data_b) <- paste0(colnames(data_b), "_b")
data <- bind_cols(data_a, data_b) %>%
filter(Species_a != Species_b) %>%
select(Species_a,
Species_b,
Sepal.Width_a,
Sepal.Width_b) %>%
mutate(Date = c('2022-05-05', '2022-04-04', '2021-03-03', '2021-02-02', '2020-01-01'),
smaller = ifelse(Sepal.Width_a > Sepal.Width_b, 'b',
ifelse(Sepal.Width_a < Sepal.Width_b, 'a', NA)))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不知道这是否是解决方案,但这可能是一个开始。
新列如何计算如何?看起来
smalle_sum_a
是物种a
具有较小值的连续行数。但是我不认为smalle_sum_b
也不适用吗?还是只是累积的天数,每个物种具有较小的值,减去1,但是如果该行中的物种不小(同样,则不会检查smalle_sum_b
,又不要查看。尽管...)。因为确定
date
是否小于当前值,首先,您要告诉R您的date
列实际上是日期,而不仅仅是一个字符。查看其格式的最简单方法是使您的
data
(不是数据的好名称btw),最好使其成为R或计算机不使用的东西,例如my_data
)tibble
而不是data.frame
。tibble
s告诉您每列的格式是什么。&lt中的位; &gt;
在列名中告诉您格式,&lt; fct&gt;
isfactor
,&lt; dbl; dbl&gt;
is数字
(说明)和&lt; chr&gt;
istargin
。因此,我们希望将
日期
纳入date
格式,我们可以使用ymd()
(年度周期)函数来进行。来自lubridate
。另外,我对数据进行了重新排列,以使行按时间顺序(最早的顶部)进行,因为这是正常安排的方式,对我来说更有意义,尤其是如果您对累积总和感兴趣的情况下。我们可以看到R现在认识到
date
列是一个日期,现在在R-RECHISED&lt; date&gt;
格式中。现在,我不是100%确切确切地确定您要如何计算新列的地方,但是例如,您可以使用
ifelse()
来确定物种A是否较小,然后计算累积总和较小的日子。只要a)
date
列在R-RENSIDER&lt; date&gt;
格式或b中,或b)按时间顺序排列,您可以使用小于或大于运算符&lt;
&amp;&gt;
确定给定行之前/之后的日期是否在。这是理解R如何处理日期和时间的好资源,值得一读 https://r4ds.had.co.nz/dates-and-times.html
I don't know if this is a solution, but it might be a start.
How exactly are the new columns calculated? Looks like
smaller_sum_a
is the number of consecutive rows where speciesa
has the smaller value, minus one. But the same doesn't work forsmaller_sum_b
I don't think? Or is it just cumulative number of days where each species is has the smaller value, minus 1, but with zero if the species isn't smaller in that row (again this doesn't check out forsmaller_sum_b
though...).As for determining if
Date
is less than the current value, firstly you'll want to tell R that yourDate
column is actually a date, rather than just a character.Easiest way to see what format it is in is to make your
data
(not a good name for your data btw, preferably make it something that R or the computer wouldn't use, likemy_data
) atibble
rather than adata.frame
.tibble
s tell you what format each column is in which is handy.The bits inside the
< >
under the column names tell you the formats,<fct>
isfactor
,<dbl>
isnumeric
(see here for explanation) and<chr>
ischaracter
.So, we want to make
Date
into adate
format, which we can do with theymd()
(year-month-day) function fromlubridate
. Also, I rearranged the data so the rows are in chronological order (earliest at the top), because that's how things are normally arranged, and makes more sense to me, especially if you're interested in cumulative sums.We can see that R now recognises that the
Date
column is a date, and is now in the R-recognised<date>
format.Now this is where I'm not 100% sure on exactly how you want to calculate your new columns, but for example you can use
ifelse()
to determine if species a is smaller, and then calculate the cumulative sum of the days where it was smaller.As long as either a) the
Date
column is in an R-recognised<date>
format, or b) it is arranged chronologically, you can use the less than or greater than operators<
&>
to determine if dates are before/after a given row.This is a good resource for understanding how R treats dates and times, and is well worth a read https://r4ds.had.co.nz/dates-and-times.html
这是我当前的解决方案,如果我听到它会破坏Dplyr的一些功能,我不想使用Plyr。我觉得肯定有一种更高效,更现代的解决这个问题的方法,但我似乎找不到它。
Here is my current solution, I'd like to not use plyr if I can help it since I heard it breaks some of dplyr's functions. I feel like there is definitely a more efficient and modern way of solving this issue but I can't seem to find it.