运行长度的累积总和。这个循环可以矢量化吗?
我有一个数据框,我可以在其中计算特定列的游程长度编码。 dir
列的值为 -1、0 或 1。
dir.rle <- rle(df$dir)
然后我获取游程长度并计算数据框中另一列的分段累积和。我正在使用 for 循环,但我觉得应该有一种方法可以更智能地做到这一点。
ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
l <- dir.rle$lengths[i] - 1
s <- ndx
e <- ndx+l
tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
ndx <- e + 1
}
dir
的运行长度定义每次运行的开始 s
和结束 e
。上面的代码可以工作,但感觉不像惯用的 R 代码。我觉得好像应该有另一种没有循环的方法来做到这一点。
I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir
, are either -1, 0, or 1.
dir.rle <- rle(df$dir)
I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.
ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
l <- dir.rle$lengths[i] - 1
s <- ndx
e <- ndx+l
tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
ndx <- e + 1
}
The run lengths of dir
define the start, s
, and end, e
, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这可以分解为两步问题。首先,如果我们基于
rle
创建一个索引列,那么我们可以使用它来分组并运行cumsum
。然后可以通过任意数量的聚合技术来执行分组。我将展示两个选项,一个使用data.table
,另一个使用plyr
。This can be broken down into a two step problem. First, if we create an indexing column based off of the
rle
, then we can use that to group by and run thecumsum
. The group by can then be performed by any number of aggregation techniques. I'll show two options, one usingdata.table
and the other usingplyr
.Spacedman 和蔡斯提出了分组变量简化一切的关键点(蔡斯提出了两种很好的方法来继续下去)。
我将提出另一种方法来形成分组变量。它不使用 rle,至少对我来说,感觉更直观。基本上,在
diff()
检测到值变化的每个点,将形成分组变量的cumsum
都会增加 1:Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).
I'll just throw in an alternative approach to forming that grouping variable. It doesn't use
rle
and, at least to me, feels more intuitive. Basically, at each point wherediff()
detects a change in value, thecumsum
that will form your grouping variable is incremented by one:将“组”列添加到数据框中。类似于:
然后使用 tapply 在组内求和:
Add a 'group' column to the data frame. Something like:
then use tapply to sum within groups: