对数据帧中带时间戳的行进行分区和聚合间隔的有效方法是什么?
从带有时间戳行的数据帧(strptime 结果)中,聚合间隔统计信息的最佳方法是什么?
间隔可以是一小时、一天等。
有aggregate
函数,但这无助于将每一行分配给一个间隔。我计划在数据框中添加一列来表示间隔,并将其与聚合一起使用,但如果有更好的解决方案,那么很高兴听到它。
感谢您的指点!
示例数据
五行时间戳,从 03:00 开始按 15 分钟间隔划分。
间隔 1
- "2010-01-13 03:02:38 UTC"
- "2010-01-13 03:08:14 UTC"
- "2010-01-13 03:14:52 UTC"
间隔 2
- "2010-01-13 03:20:42 UTC"
- "2010-01-13 03:22:19 UTC"
结论
使用诸如xts
之类的时间序列包应该是解决方案;然而我没有成功地使用它们并最终使用了 cut
。由于我目前只需要绘制直方图,并按间隔对行进行分组,这就足够了。
cut
的用法如下:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
From a data frame with timestamped rows (strptime results), what is the best method for aggregating statistics for intervals?
Intervals could be an hour, a day, etc.
There's the aggregate
function, but that doesn't help with assigning each row to an interval. I'm planning on adding a column to the data frame that denotes interval and using that with aggregate
, but if there's a better solution it'd be great to hear it.
Thanks for any pointers!
Example Data
Five rows with timestamps divided into 15-minute intervals starting at 03:00.
Interval 1
- "2010-01-13 03:02:38 UTC"
- "2010-01-13 03:08:14 UTC"
- "2010-01-13 03:14:52 UTC"
Interval 2
- "2010-01-13 03:20:42 UTC"
- "2010-01-13 03:22:19 UTC"
Conclusion
Using a time series package such as xts
should be the solution; however I had no success using them and winded up using cut
. As I presently only need to plot histograms, with rows grouped by interval, this was enough.
cut
is used liked so:
interv <- function(x, start, period, num.intervals) {
return(cut(x, as.POSIXlt(start)+0:num.intervals*period))
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
分割向量的标准函数是
cut
和findInterval
:郑重声明:
cut
有一个用于POSIXt
类型的方法,但不幸的是,没有办法提供start
参数,效果是:如您所见,它从 03:02:00 开始。您可能会弄乱输出因子的标签(将标签转换为时间,以某种方式舍入并转换回字符)。
Standard functions to split vectors are
cut
andfindInterval
:For the record:
cut
has a method forPOSIXt
type, but unfortunately there is no way to providestart
argument, effect is:As you see it's start at 03:02:00. You could mess with labels of output factor (convert labels to time, round somehow and convert back to character).
使用 时间序列包。 xts 包具有专门设计用于这样做。或者看看zoo包中的aggregate和rollapply函数。
rmetrics 电子书有一个有用的讨论,包括各种软件包的性能比较:https:/ /www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
编辑:查看我对这个问题的回答。基本上,您需要将每个时间戳截断为特定的时间间隔,然后使用这些新的截断时间戳作为分组向量进行聚合。
Use a time series package. The xts package has functions designed specifically to do that. Or look at the aggregate and rollapply functions in the zoo package.
The rmetrics ebook has a useful discussion, including a performance comparison of the various packages: https://www.rmetrics.org/files/freepdf/TimeSeriesFAQ.pdf
Edit: Look at my answer to this question. Basically you need to truncate every timestamp into a specific interval and then do the aggregation using those new truncated timestamps as your grouping vector.
这是一个有趣的问题;随着各种时间序列包和方法的激增,应该有一种对不规则时间序列进行分箱的方法,而不是通过OP建议的强力方法。这是一种获取间隔的“高级”方法,然后可以使用为
chroncut
版本,将其用于aggregate
等。代码>对象。您
现在可以将其附加到数据框并聚合。
上面的强制转换(从字符到 timeDate 到字符到 chron)有点不幸,所以如果有更干净的解决方案使用 xts 或任何其他 timeSeries 包对不规则时间数据进行分箱,我也很想听听它们!...
我也很想知道对大型高频不规则时间序列进行分箱的最有效方法是什么,例如为流动性非常强的股票在价格变动数据上创建 1 分钟成交量柱。
This is an interesting question; with the proliferation of the various time series packages and methods, there ought to be an approach for binning irregular time series other than by brute force that the OP suggests. Here is one "high-level" way to get the intervals that you can then use for
aggregate
et al, using a version ofcut
defined forchron
objects.You get
which you can now append to the data frame and aggregate.
The coersion acrobatics above (from character to timeDate to character to chron) is a little unfortunate, so if there are cleaner solutions for binning irregular time data using xts or any of the other timeSeries packages, I'd love to hear about them as well!..
I am also curious to know what would be the most efficient approach for binning large high-frequency irregular time series, e.g. creating 1-minute volume bars on tick data for a very liquid stock.