对数据帧中带时间戳的行进行分区和聚合间隔的有效方法是什么?

发布于 2024-08-25 19:00:04 字数 873 浏览 11 评论 0原文

从带有时间戳行的数据帧(strptime 结果)中,聚合间隔统计信息的最佳方法是什么?

间隔可以是一小时、一天等。

aggregate 函数,但这无助于将每一行分配给一个间隔。我计划在数据框中添加一列来表示间隔,并将其与聚合一起使用,但如果有更好的解决方案,那么很高兴听到它。

感谢您的指点!


示例数据

五行时间戳,从 03:00 开始按 15 分钟间隔划分。

间隔 1

  • "2010-01-13 03:02:38 UTC"
  • "2010-01-13 03:08:14 UTC"
  • "2010-01-13 03:14:52 UTC"

间隔 2

  • "2010-01-13 03:20:42 UTC"
  • "2010-01-13 03:22:19 UTC"

结论

使用诸如xts之类的时间序列包应该是解决方案;然而我没有成功地使用它们并最终使用了 cut。由于我目前只需要绘制直方图,并按间隔对行进行分组,这就足够了。

cut 的用法如下:

interv <- function(x, start, period, num.intervals) {
  return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}

From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?

Intervals could be an hour, a day, etc.

There's the aggregate function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate, but if there's a better solution it'd be great to hear it.

Thanks for any pointers!


Example Data

Five rows with timestamps divided into 15-minute intervals starting at 03:00.

Interval 1

  • "2010-01-13 03:02:38 UTC"
  • "2010-01-13 03:08:14 UTC"
  • "2010-01-13 03:14:52 UTC"

Interval 2

  • "2010-01-13 03:20:42 UTC"
  • "2010-01-13 03:22:19 UTC"

Conclusion

Using a time series package such as xts should be the solution; however I had no success using them and winded up using cut. As I presently only need to plot histograms, with rows grouped by interval, this was enough.

cut is used liked so:

interv <- function(x, start, period, num.intervals) {
  return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

安稳善良 2024-09-01 19:00:04

分割向量的标准函数是 cutfindInterval

v <- as.POSIXct(c(
  "2010-01-13 03:02:38 UTC",
  "2010-01-13 03:08:14 UTC",
  "2010-01-13 03:14:52 UTC",
  "2010-01-13 03:20:42 UTC",
  "2010-01-13 03:22:19 UTC"
))

# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"

# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00

# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2

郑重声明:cut 有一个用于 POSIXt 类型的方法,但不幸的是,没有办法提供 start 参数,效果是:

cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00

如您所见,它从 03:02:00 开始。您可能会弄乱输出因子的标签(将标签转换为时间,以某种方式舍入并转换回字符)。

Standard functions to split vectors are cut and findInterval:

v <- as.POSIXct(c(
  "2010-01-13 03:02:38 UTC",
  "2010-01-13 03:08:14 UTC",
  "2010-01-13 03:14:52 UTC",
  "2010-01-13 03:20:42 UTC",
  "2010-01-13 03:22:19 UTC"
))

# Your function return list:
interv(v, as.POSIXlt("2010-01-13 03:00:00 UTC"), 900)
# [[1]]
# [1] "2010-01-13 03:00:00"
# [[2]]
# [1] "2010-01-13 03:00:00"
# [[3]]
# [1] "2010-01-13 03:00:00"
# [[4]]
# [1] "2010-01-13 03:15:00 CET"
# [[5]]
# [1] "2010-01-13 03:15:00 CET"

# cut returns factor, you must provide proper breaks:
cut(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 2010-01-13 03:00:00 2010-01-13 03:00:00 2010-01-13 03:00:00
# [4] 2010-01-13 03:15:00 2010-01-13 03:15:00
# Levels: 2010-01-13 03:00:00 2010-01-13 03:15:00

# findInterval returns vector of interval id (breaks like in cut)
findInterval(v, as.POSIXlt("2010-01-13 03:00:00 UTC")+0:2*900)
# [1] 1 1 1 2 2

For the record: cut has a method for POSIXt type, but unfortunately there is no way to provide start argument, effect is:

cut(v,"15 min")
# [1] 2010-01-13 03:02:00 2010-01-13 03:02:00 2010-01-13 03:02:00
# [4] 2010-01-13 03:17:00 2010-01-13 03:17:00
# Levels: 2010-01-13 03:02:00 2010-01-13 03:17:00

As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).

从﹋此江山别 2024-09-01 19:00:04

使用 时间序列包。 xts 包具有专门设计用于这样做。或者看看zoo包中的aggregate和rollapply函数。

rmetrics 电子书有一个有用的讨论,包括各种软件包的性能比较:https:/ /www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf

编辑:查看我对这个问题的回答。基本上,您需要将每个时间戳截断为特定的时间间隔,然后使用这些新的截断时间戳作为分组向量进行聚合。

Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.

The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf

Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.

趁微风不噪 2024-09-01 19:00:04

这是一个有趣的问题;随着各种时间序列包和方法的激增,应该有一种对不规则时间序列进行分箱的方法,而不是通过OP建议的强力方法。这是一种获取间隔的“高级”方法,然后可以使用为 chroncut 版本,将其用于 aggregate 等。代码>对象。

require(chron)
require(timeSeries)

my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"

time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))

intervals  
[1] 1 1 1 2 2

现在可以将其附加到数据框并聚合。

上面的强制转换(从字符到 timeDate 到字符到 chron)有点不幸,所以如果有更干净的解决方案使用 xts 或任何其他 timeSeries 包对不规则时间数据进行分箱,我也很想听听它们!...

我也很想知道对大型高频不规则时间序列进行分箱的最有效方法是什么,例如为流动性非常强的股票在价格变动数据上创建 1 分钟成交量柱。

This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for aggregate et al, using a version of cut defined for chron objects.

require(chron)
require(timeSeries)

my.times <- "
2010-01-13 03:02:38 UTC
2010-01-13 03:08:14 UTC
2010-01-13 03:14:52 UTC
2010-01-13 03:20:42 UTC
2010-01-13 03:22:19 UTC
"

time.df <- read.delim(textConnection(my.times),header=FALSE,sep="\n",strip.white=FALSE)
time.seq <- seq(trunc(timeDate(time.df[1,1]),units="hours"),by=15*60,length=nrow(time.df))
intervals <- as.numeric(cut(as.chron(as.character(time.df$V1)),breaks=as.chron(as.character(time.seq))))

You get

intervals  
[1] 1 1 1 2 2

which you can now append to the data frame and aggregate.

The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..

I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文