从日期中最佳提取工作日/周/月,以实现内存紧凑?
执行摘要:*提高下面聚合 fn 调用中使用的基于日期的提取函数的内存效率;以免超出 1Gb 内存限制。 *。
我有一个大型数据集 tr 存储在数据框中(3 列,1200 万行;~200Mb)
这些列是customer_id(整数)、visit_date 和visit_spend(数字) 数据集需要注册,因此这是尽可能可重现的:
数据集如下所示(完整文件位于此处,需要注册):
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
#... 12146637 rows in total
日期范围受到限制2010-04-01 ... 2011-06-30 之间(整数为 14700..15155)
这里我问选择的最佳表示是什么visit_date 字段。我做了一些聚合调用(底部的示例代码),这些调用会耗尽内存。我还使用日期实用程序 fns ,类似于此处底部附加的内容(它们需要重新编码以保持紧凑性,但这些是我想做很多的典型操作)。所以我需要一个日期的表示来避免这种情况。
据我所知,我可以在 visit_date 字段中使用三种可能的表示形式,以下是我想要做的事情的优点和缺点。 我的目标是获得不会耗尽内存的格式,并且在这些日期处理操作、聚合等过程中给予最少的痛苦:
- 整数或因子
缺点:
1) 不允许比较或排序操作,因此聚合很痛苦。
2)我需要对所有与日期相关的函数进行硬编码(例如 2010 年 4 月的 14700..14729) - 可行但痛苦。
3)需要手动处理图表。 - 数字
缺点:由于到处都需要 asDate() ,所以会耗尽内存。 - 日期
优点:对于 print()、图表和直方图来说最易读;绘图前无需人工处理。
缺点:如果我应用任何字符串操作(格式)或聚合,则会崩溃(内存不足失败)。 我认为这就是 chron::Date,它就是您设置 class(tr$visit_date)<-'Date') 或使用 read.csv(colClasses=c(.. .,"Date",...)
这些是我想要运行很多的日期实用程序 fns(但目前它们在聚合期间爆炸):
# Utility fns related to date
library(chron)
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
with( month.day.year(ddd), day.of.week(month,day,year) )
#do.call( "day.of.week", month.day.year(x) )
# as.numeric(x-3)%%7
}
weekofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%m"))
}
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
x$dotwa <- format(x$from_date_fld,"%a")
}
这里只是一个聚合( ) 调用失败内存不足:(
agg_dly_spend <- aggregate(tr$visit_spend,
list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend,
list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
这些aggregate()调用应该占用多少内存? 如果我错了,请纠正我,但混合类型使得使用 bigmemory
变得困难。所以我可能不得不使用 SQL,但这是一个很大的损失 - 我失去了 R 非常好的按日期子集: tr[tr$visit_date > "2010-09-01",]
)
(平台为 R 2.13.1、Windows Vista 32b,因此总体进程限制为 2Gb,这意味着任何数据帧不应超过 ~600-900Mb)
Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.
I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
#... 12146637 rows in total
The date range is restricted between 2010-04-01 ... 2011-06-30 (14700..15155 in integer)
Here I'm asking what is the optimal representation to choose for the visit_date field. I do some aggregate
calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.
As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do.
My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:
- integer or factor
Cons:
1) doesn't allow comparison or sort operations, hence aggregation is painful.
2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) - doable but painful.
3) Needs manual handling for graphs. - numeric
Cons: blows up memory due to requiring asDate() everywhere. - Date
Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate.
I think that's chron::Date, it's whatever you get when you setclass(tr$visit_date)<-'Date')
or useread.csv(colClasses=c(...,"Date",...)
These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):
# Utility fns related to date
library(chron)
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
with( month.day.year(ddd), day.of.week(month,day,year) )
#do.call( "day.of.week", month.day.year(x) )
# as.numeric(x-3)%%7
}
weekofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%m"))
}
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
x$dotwa <- format(x$from_date_fld,"%a")
}
and here's just one aggregate() call that fails out-of-memory:
agg_dly_spend <- aggregate(tr$visit_spend,
list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend,
list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
(How much memory should those aggregate() calls take?
Correct me if I'm wrong but the mixed-types make it hard to use bigmemory
. So I may have to go to SQL, but that's a big loss - I lose R's really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",]
)
(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
编辑:我复制的代码不是最终的函数,因此其中存在错误。错误现已修复。
我不完全同意结束投票,但您的问题确实需要一些阅读。据我了解,问题在于日期的表示。数字只是一个疯狂的想法,在这种情况下使用整数。正如不同格式及其相对空间的概述(使用 这个问题:)
内部
Date
可以与数字表示很好地竞争。character
会导致所有其他功能出现问题,所以别管这个了。我只使用Date
,这样就可以并保持功能正常。注意 POSIXlt 的大小:所有提取月、周、一年中的某一天等的函数都采用该格式。对于format()
以及base 中的函数
或weekdays()
、months()
... 来说都是如此。chron
包。一些评论:
memory.limit(3000)
:请参阅?memory.limit
。继续你的代码。我使用
Date
格式,它的大小与数字格式大致相同。让我们尝试使用以下包含 1460 万行的数据(您可以提供...)。我运行的是Windows7(64位),总共有4Gb内存。首先是您的 weekofyear 函数。如前所述,
format
函数使用底层 POSIXlt 格式,如图所示,该格式非常占用内存。尽管如此,您仍然可以通过直接访问来削减大约一半的内存(请参阅?POSIXlt
)。它返回整数,大约占用您返回的数字内存的一半:如果您需要更少的内存,则必须自己进行数学计算。但我建议你不要尝试这样做,而且绝对不要基于角色表示。像
strsplit()
和substr()
这样的字符串操作肯定会耗尽你的内存。就像chron
包的month.day.year()
函数一样。 远离大数据chron
。事实上,无论 POSIXlt 对象需要多大的空间,使用 POSIXlt 仍然是提取内存方面的最佳选择。关于
聚合
。这是针对数据帧的,因此聚合调用会再次生成大量数据副本。多手动拨打电话,可以再次节省副本。功能建议:现在,如果我们应用此功能并观察内存使用情况:
,我会得到以下结果:
红色方块是您的聚合呼叫,黄色方块是我的建议。聚合调用的内存使用量的第一个增加是您使用的
weekofyear
函数。我的建议既节省了weekofyear
和aggregate
调用的内存使用量,而且运行速度也快了很多。使用我的建议时,我从未获得超过 2.6Gb 的总内存。希望这有帮助。
EDIT : The code I copied were not the final functions, so there were bugs in it. Bugs now fixed.
I do not completely agree with the votes to close, but your question does need some reading. As I understood it, the problem is the representation of the date. Numeric is just a crazy idea, use integer in that case. Just as an overview of the different formats and their relative space (using the lsos function from this question:)
The internal
Date
can compete pretty well with the numeric representation.character
causes trouble with all the rest of the functionality, so forget about that one. I just useDate
, that'll do and keeps the functionality OK. Pay attention to the size ofPOSIXlt
: All functions for extraction of months, weeks, day of the year etc. go over that format. That's true forformat()
, and for the functionsweekdays()
,months()
, ... in either thebase
or thechron
package.Some remarks :
memory.limit(3000)
: See?memory.limit
.On to your code. I work on with the
Date
format, which is about the same size as the numeric format. Let's try it with following data (which you could have provided...) with 14.6 million rows. I run a Windows7 (64bit) with 4Gb memory in total.First your weekofyear function. As said, The
format
function uses the underlying POSIXlt format, which is, as shown, memory-intensive. Still, you can cut out about half of the memory by just accessing it directly (see?POSIXlt
). It returns integers, which take about half the memory of the numerics you return :If you need even less, you'll have to do the math yourself. But I advise you not to try that out, and definitely not based on character representation. string operations like
strsplit()
andsubstr()
will blow up your memory for sure. As does themonth.day.year()
function of thechron
package. Stay away fromchron
with big data. In fact, regardless of the huge space the POSIXlt objects need, using POSIXlt is still the best option memory-wise for extraction.On to the
aggregate
. This is meant for dataframes, and henceaggregate
call makes again a lot of copies of the data. Doing the call more manually, can save again on the copies. A proposal for a function :Now if we apply this and we watch the memory usage :
, I get the following result :
The red square is your aggregate call, the yellow square is my proposal. The first bump in the memory usage of you aggregate call is the
weekofyear
function you use. My proposal saves both on the memory usage ofweekofyear
and of theaggregate
call, and runs quite a bit faster too. I never got over 2.6Gb total memory using my proposal.Hope this helps.