从日期中最佳提取工作日/周/月，以实现内存紧凑？

发布于 2024-11-28 10:19:24 字数 2548 浏览 2 评论 0原文

执行摘要：*提高下面聚合 fn 调用中使用的基于日期的提取函数的内存效率；以免超出 1Gb 内存限制。 *。

我有一个大型数据集 tr 存储在数据框中（3 列，1200 万行；~200Mb）
这些列是customer_id（整数）、visit_date 和visit_spend（数字） 数据集需要注册，因此这是尽可能可重现的：
数据集如下所示（完整文件位于此处，需要注册）：

customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52
#... 12146637 rows in total

日期范围受到限制2010-04-01 ... 2011-06-30 之间（整数为 14700..15155）

这里我问选择的最佳表示是什么visit_date 字段。我做了一些聚合调用（底部的示例代码），这些调用会耗尽内存。我还使用日期实用程序 fns ，类似于此处底部附加的内容（它们需要重新编码以保持紧凑性，但这些是我想做很多的典型操作）。所以我需要一个日期的表示来避免这种情况。

据我所知，我可以在 visit_date 字段中使用三种可能的表示形式，以下是我想要做的事情的优点和缺点。我的目标是获得不会耗尽内存的格式，并且在这些日期处理操作、聚合等过程中给予最少的痛苦：

整数或因子
缺点：
1) 不允许比较或排序操作，因此聚合很痛苦。
2）我需要对所有与日期相关的函数进行硬编码（例如 2010 年 4 月的 14700..14729） - 可行但痛苦。
3）需要手动处理图表。
数字
缺点：由于到处都需要 asDate() ，所以会耗尽内存。
日期
优点：对于 print()、图表和直方图来说最易读；绘图前无需人工处理。
缺点：如果我应用任何字符串操作（格式）或聚合，则会崩溃（内存不足失败）。我认为这就是 chron::Date，它就是您设置 class(tr$visit_date)<-'Date') 或使用 read.csv(colClasses=c(.. .,"Date",...)

这些是我想要运行很多的日期实用程序 fns（但目前它们在聚合期间爆炸）：

# Utility fns related to date    
library(chron)    
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
    with( month.day.year(ddd), day.of.week(month,day,year) )
    #do.call( "day.of.week", month.day.year(x) )
    # as.numeric(x-3)%%7
}    
weekofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%m"))
}    
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
    x$dotwa <- format(x$from_date_fld,"%a")
}

这里只是一个聚合（ ) 调用失败内存不足：（

agg_dly_spend  <- aggregate(tr$visit_spend, 
        list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend, 
        list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")

这些aggregate()调用应该占用多少内存？如果我错了，请纠正我，但混合类型使得使用 bigmemory 变得困难。所以我可能不得不使用 SQL，但这是一个很大的损失 - 我失去了 R 非常好的按日期子集： tr[tr$visit_date > "2010-09-01",])

（平台为 R 2.13.1、Windows Vista 32b，因此总体进程限制为 2Gb，这意味着任何数据帧不应超过 ~600-900Mb）

原文

Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.

I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):

customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52
#... 12146637 rows in total

The date range is restricted between 2010-04-01 ... 2011-06-30 (14700..15155 in integer)

Here I'm asking what is the optimal representation to choose for the visit_date field. I do some aggregate calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.

As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do.
My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:

integer or factor
Cons:
1) doesn't allow comparison or sort operations, hence aggregation is painful.
2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) - doable but painful.
3) Needs manual handling for graphs.
numeric
Cons: blows up memory due to requiring asDate() everywhere.
Date
Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate.
I think that's chron::Date, it's whatever you get when you set class(tr$visit_date)<-'Date') or use read.csv(colClasses=c(...,"Date",...)

These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):

# Utility fns related to date    
library(chron)    
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
    with( month.day.year(ddd), day.of.week(month,day,year) )
    #do.call( "day.of.week", month.day.year(x) )
    # as.numeric(x-3)%%7
}    
weekofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%m"))
}    
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
    x$dotwa <- format(x$from_date_fld,"%a")
}

and here's just one aggregate() call that fails out-of-memory:

agg_dly_spend  <- aggregate(tr$visit_spend, 
        list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend, 
        list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")

(How much memory should those aggregate() calls take?
Correct me if I'm wrong but the mixed-types make it hard to use bigmemory. So I may have to go to SQL, but that's a big loss - I lose R's really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",])

(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉默的熊 2024-12-05 10:19:24

编辑：我复制的代码不是最终的函数，因此其中存在错误。错误现已修复。

我不完全同意结束投票，但您的问题确实需要一些阅读。据我了解，问题在于日期的表示。数字只是一个疯狂的想法，在这种情况下使用整数。正如不同格式及其相对空间的概述（使用这个问题:)

Dates <- rep(seq.Date(as.Date("2010-04-01"),
             as.Date("2011-04-01"),by='day'),each=100)

chardates <- as.character(Dates)
POSIXctDates <- as.POSIXct(Dates)
POSIXltDates <- as.POSIXlt(Dates)
integerDates <- as.integer(Dates)
chronDates <- as.chron(Dates)
numericDates <- as.numeric(Dates)

 > lsos()
                  Type    Size PrettySize  Rows Columns
POSIXltDates   POSIXlt 1464976     1.4 Mb 36600      NA
chronDates       dates  293400   286.5 Kb 36600      NA
POSIXctDates   POSIXct  292976   286.1 Kb 36600      NA
Dates             Date  292944   286.1 Kb 36600      NA
numericDates   numeric  292824     286 Kb 36600      NA
charDates    character  161064   157.3 Kb 36600      NA
integerDates   integer  146424     143 Kb 36600      NA

内部Date 可以与数字表示很好地竞争。 character 会导致所有其他功能出现问题，所以别管这个了。我只使用 Date，这样就可以并保持功能正常。注意 POSIXlt 的大小：所有提取月、周、一年中的某一天等的函数都采用该格式。对于 format() 以及 base 中的函数 weekdays()、months()... 来说都是如此。 或 chron 包。

一些评论：

Vista 在具有 4Gb 的机器上理论上有 3Gb 的限制。您是否尝试过 memory.limit(3000) ：请参阅 ?memory.limit。
如果您关心适当的内存管理，请尽快更新到 Windows 7。 Vista 简直糟透了。很开心。如果可能的话，如果您使用该数据，请购买 64 位机器。
偶尔清理一下你的工作空间。我可以在 R 32 位上毫无问题地运行您的代码，并在下面给出的数据帧上使用 2Gb 内存限制（14+ 百万行）。

继续你的代码。我使用 Date 格式，它的大小与数字格式大致相同。让我们尝试使用以下包含 1460 万行的数据（您可以提供...）。我运行的是Windows7（64位），总共有4Gb内存。

n <- 40000
tr <- data.frame(
  customer_id = rep(1:366,n),
  visit_date = rep(seq.Date(as.Date("2010-04-01"),
                     as.Date("2011-04-01"),by='day'),each=n),
  visit_spend =runif(366*n,3,12)
 )

首先是您的 weekofyear 函数。如前所述，format 函数使用底层 POSIXlt 格式，如图所示，该格式非常占用内存。尽管如此，您仍然可以通过直接访问来削减大约一半的内存（请参阅 ?POSIXlt）。它返回整数，大约占用您返回的数字内存的一半：

dayofweek2 <- function(dt) {
  as.POSIXlt(dt)$wday
}

monthofyear2 <- function(dt){
  as.POSIXlt(dt)$mon +1L
}

weekofyear2 <- function(dt) {
  n <- length(dt)
  lt <- as.POSIXlt(dt)
  yearstart <- c(0,which(diff(lt$year)!=0))
  fday <- rep(lt[yearstart+1L]$wday,
              diff(c(yearstart,n)+1L)
          )
  as.integer((lt$yday+fday-1L)%/%7)
}

> lsos()
                   Type    Size PrettySize  Rows Columns
tr           data.frame  733128   715.9 Kb 36600       3
Dates              Date  293048   286.2 Kb 36600      NA
x1              numeric  292840     286 Kb 36600      NA # from your function
x2              integer  146440     143 Kb 36600      NA # from my function

如果您需要更少的内存，则必须自己进行数学计算。但我建议你不要尝试这样做，而且绝对不要基于角色表示。像 strsplit() 和 substr() 这样的字符串操作肯定会耗尽你的内存。就像 chron 包的 month.day.year() 函数一样。 远离大数据chron。事实上，无论 POSIXlt 对象需要多大的空间，使用 POSIXlt 仍然是提取内存方面的最佳选择。

关于聚合。这是针对数据帧的，因此聚合调用会再次生成大量数据副本。多手动拨打电话，可以再次节省副本。功能建议：

my.agg <- function(x,d,AGGFUN=NULL,FUN="sum"){
  FUN <- match.fun(FUN)
  if(!is.null(AGGFUN)) {
    AGGFUN <- match.fun(AGGFUN)
    d <- AGGFUN(d)
  }
  ud <- sort(unique(d))

  xout <- sapply(ud,function(i){
    id <- which(d==i)  # find which values respond to a certain value of d
    FUN(x[id])         # apply the function to only those values
  },USE.NAMES=FALSE)

  data.frame(d=ud,x=xout)    
}

现在，如果我们应用此功能并观察内存使用情况：

gc(TRUE,reset=TRUE)
x1 <-  aggregate(tr$visit_spend, 
           list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
rm(x1)
gc(TRUE,reset=TRUE)
Sys.sleep(5)
x2 <- my.agg(tr$visit_spend,tr$visit_date,weekofyear2,sum)    
gc(TRUE,reset=TRUE)

，我会得到以下结果：

红色方块是您的聚合呼叫，黄色方块是我的建议。聚合调用的内存使用量的第一个增加是您使用的 weekofyear 函数。我的建议既节省了 weekofyear 和 aggregate 调用的内存使用量，而且运行速度也快了很多。使用我的建议时，我从未获得超过 2.6Gb 的总内存。

希望这有帮助。

EDIT : The code I copied were not the final functions, so there were bugs in it. Bugs now fixed.

I do not completely agree with the votes to close, but your question does need some reading. As I understood it, the problem is the representation of the date. Numeric is just a crazy idea, use integer in that case. Just as an overview of the different formats and their relative space (using the lsos function from this question:)

Dates <- rep(seq.Date(as.Date("2010-04-01"),
             as.Date("2011-04-01"),by='day'),each=100)

chardates <- as.character(Dates)
POSIXctDates <- as.POSIXct(Dates)
POSIXltDates <- as.POSIXlt(Dates)
integerDates <- as.integer(Dates)
chronDates <- as.chron(Dates)
numericDates <- as.numeric(Dates)

 > lsos()
                  Type    Size PrettySize  Rows Columns
POSIXltDates   POSIXlt 1464976     1.4 Mb 36600      NA
chronDates       dates  293400   286.5 Kb 36600      NA
POSIXctDates   POSIXct  292976   286.1 Kb 36600      NA
Dates             Date  292944   286.1 Kb 36600      NA
numericDates   numeric  292824     286 Kb 36600      NA
charDates    character  161064   157.3 Kb 36600      NA
integerDates   integer  146424     143 Kb 36600      NA

The internal Date can compete pretty well with the numeric representation. character causes trouble with all the rest of the functionality, so forget about that one. I just use Date, that'll do and keeps the functionality OK. Pay attention to the size of POSIXlt: All functions for extraction of months, weeks, day of the year etc. go over that format. That's true for format(), and for the functions weekdays(), months(), ... in either the base or the chron package.

Some remarks :

Vista has a theoretical 3Gb limit on a machine with 4Gb. did you try memory.limit(3000) : See ?memory.limit.
Please update asap to Windows 7 if you care anything at all about decent memory management. Vista just sucks. Big time. If possible, get a 64bit machine if you work with that data.
Clean up your workspace once in a while. I can run your code without problem on R 32bit with a 2Gb memory limit on the dataframe given below (14+ million rows).

On to your code. I work on with the Date format, which is about the same size as the numeric format. Let's try it with following data (which you could have provided...) with 14.6 million rows. I run a Windows7 (64bit) with 4Gb memory in total.

n <- 40000
tr <- data.frame(
  customer_id = rep(1:366,n),
  visit_date = rep(seq.Date(as.Date("2010-04-01"),
                     as.Date("2011-04-01"),by='day'),each=n),
  visit_spend =runif(366*n,3,12)
 )

First your weekofyear function. As said, The format function uses the underlying POSIXlt format, which is, as shown, memory-intensive. Still, you can cut out about half of the memory by just accessing it directly (see ?POSIXlt). It returns integers, which take about half the memory of the numerics you return :

dayofweek2 <- function(dt) {
  as.POSIXlt(dt)$wday
}

monthofyear2 <- function(dt){
  as.POSIXlt(dt)$mon +1L
}

weekofyear2 <- function(dt) {
  n <- length(dt)
  lt <- as.POSIXlt(dt)
  yearstart <- c(0,which(diff(lt$year)!=0))
  fday <- rep(lt[yearstart+1L]$wday,
              diff(c(yearstart,n)+1L)
          )
  as.integer((lt$yday+fday-1L)%/%7)
}

> lsos()
                   Type    Size PrettySize  Rows Columns
tr           data.frame  733128   715.9 Kb 36600       3
Dates              Date  293048   286.2 Kb 36600      NA
x1              numeric  292840     286 Kb 36600      NA # from your function
x2              integer  146440     143 Kb 36600      NA # from my function

If you need even less, you'll have to do the math yourself. But I advise you not to try that out, and definitely not based on character representation. string operations like strsplit() and substr() will blow up your memory for sure. As does the month.day.year() function of the chron package. Stay away from chron with big data. In fact, regardless of the huge space the POSIXlt objects need, using POSIXlt is still the best option memory-wise for extraction.

On to the aggregate. This is meant for dataframes, and hence aggregate call makes again a lot of copies of the data. Doing the call more manually, can save again on the copies. A proposal for a function :

my.agg <- function(x,d,AGGFUN=NULL,FUN="sum"){
  FUN <- match.fun(FUN)
  if(!is.null(AGGFUN)) {
    AGGFUN <- match.fun(AGGFUN)
    d <- AGGFUN(d)
  }
  ud <- sort(unique(d))

  xout <- sapply(ud,function(i){
    id <- which(d==i)  # find which values respond to a certain value of d
    FUN(x[id])         # apply the function to only those values
  },USE.NAMES=FALSE)

  data.frame(d=ud,x=xout)    
}

Now if we apply this and we watch the memory usage :

gc(TRUE,reset=TRUE)
x1 <-  aggregate(tr$visit_spend, 
           list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
rm(x1)
gc(TRUE,reset=TRUE)
Sys.sleep(5)
x2 <- my.agg(tr$visit_spend,tr$visit_date,weekofyear2,sum)    
gc(TRUE,reset=TRUE)

, I get the following result :

enter image description here

The red square is your aggregate call, the yellow square is my proposal. The first bump in the memory usage of you aggregate call is the weekofyear function you use. My proposal saves both on the memory usage of weekofyear and of the aggregate call, and runs quite a bit faster too. I never got over 2.6Gb total memory using my proposal.

Hope this helps.

回复收藏 0 原文

~没有更多了~