运行长度的累积总和。这个循环可以矢量化吗？

发布于 12-17 01:37 字数 525 浏览 8 评论 0原文

我有一个数据框，我可以在其中计算特定列的游程长度编码。 dir 列的值为 -1、0 或 1。

dir.rle <- rle(df$dir)

然后我获取游程长度并计算数据框中另一列的分段累积和。我正在使用 for 循环，但我觉得应该有一种方法可以更智能地做到这一点。

ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
    l <- dir.rle$lengths[i] - 1
    s <- ndx
    e <- ndx+l
    tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
    ndx <- e + 1
}

dir 的运行长度定义每次运行的开始 s 和结束 e。上面的代码可以工作，但感觉不像惯用的 R 代码。我觉得好像应该有另一种没有循环的方法来做到这一点。

原文

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.

dir.rle <- rle(df$dir)

I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.

ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
    l <- dir.rle$lengths[i] - 1
    s <- ndx
    e <- ndx+l
    tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
    ndx <- e + 1
}

The run lengths of dir define the start, s, and end, e, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一杆小烟枪2024-12-24 01:37:34

这可以分解为两步问题。首先，如果我们基于rle创建一个索引列，那么我们可以使用它来分组并运行cumsum。然后可以通过任意数量的聚合技术来执行分组。我将展示两个选项，一个使用 data.table，另一个使用 plyr。

library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))


#What does the indexer column look like?
> head(dat)
     dir      value indexer
[1,]   1  0.5045807       1
[2,]   0  0.2660617       2
[3,]   1  1.0369641       3
[4,]   1 -0.4514342       3
[5,]  -1 -0.3968631       4
[6,]  -1 -2.1517093       4


#data.table approach
dat[, cumsum(value), by = indexer]

#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))

This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr.

library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))


#What does the indexer column look like?
> head(dat)
     dir      value indexer
[1,]   1  0.5045807       1
[2,]   0  0.2660617       2
[3,]   1  1.0369641       3
[4,]   1 -0.4514342       3
[5,]  -1 -0.3968631       4
[6,]  -1 -2.1517093       4


#data.table approach
dat[, cumsum(value), by = indexer]

#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))

回复收藏 0 原文

溺渁∝2024-12-24 01:37:34

Spacedman 和蔡斯提出了分组变量简化一切的关键点（蔡斯提出了两种很好的方法来继续下去）。

我将提出另一种方法来形成分组变量。它不使用 rle，至少对我来说，感觉更直观。基本上，在 diff() 检测到值变化的每个点，将形成分组变量的 cumsum 都会增加 1：

df$group <- c(0, cumsum(!(diff(df$dir)==0)))

# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))

Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).

I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:

df$group <- c(0, cumsum(!(diff(df$dir)==0)))

# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))

回复收藏 0 原文

小…楫夜泊2024-12-24 01:37:34

将“组”列添加到数据框中。类似于：

df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)

然后使用 tapply 在组内求和：

tapply(df$z,df$group,sum)

Add a 'group' column to the data frame. Something like:

df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)

then use tapply to sum within groups:

tapply(df$z,df$group,sum)

回复收藏 0 原文

~没有更多了~

关于作者

℡寂寞咖啡

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

运行长度的累积总和。这个循环可以矢量化吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

我爱人

frankyang2017

饭团

wenkai

Caesar.Yang

白鸥掠海

友情链接

运行长度的累积总和。这个循环可以矢量化吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

我爱人

frankyang2017

饭团

wenkai

Caesar.Yang

白鸥掠海

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。