如何矢量化和加速数据帧上的 strtime() logtime 转换

发布于 2024-12-22 23:47:46 字数 4659 浏览 9 评论 0原文

（编辑：这里的问题之一是规模，即适用于一行的内容会在 200,000 * 50 数据帧上炸毁/崩溃 R。例如，strptime 必须按列应用，而不是按行应用，以避免挂起。我正在寻找您实际在 200,000 * 50 上运行的工作代码解决方案，包括您测量的运行时间，而不仅仅是随意的“这很容易”的评论。很容易得到运行时间>如果您选错 fn，则 12 小时。接下来，我还要求你让我的零时间调整代码更快，直到完成为止，工作才算完成。到目前为止还没人尝试过。）

我想要矢量化并加速以下多步日志时间转换，精度为毫秒，包括将 strtime() 转换为单个数字，然后是减法，然后在大型数据帧上log()（200,000 行 * 300 列；省略其他（非时间）列）。代码如下。除了使其矢量化和快速之外，一个额外的问题是我不确定如何最好地表示每一步的（高维）中间值例如来自strtime、矩阵、向量的列表）。我已经尝试过 apply,sapply,lapply,vapply,ddply::maply(),... 但中间格式的不兼容一直困扰着我。

每行有 50 列time1..time50（chr，format="HH:MM:SS.sss"），将时间表示为毫秒分辨率的字符串。我需要毫秒精度。在每一行中，列 time1..time50 均按非递减顺序排列，我想将它们转换为 time50 之前的时间日志。转换 fn parse_hhmmsecms() 位于底部，需要认真的矢量化和加速，您可以看到注释掉的替代版本。到目前为止我的想法是： strtime() 比（多个）substr() 调用更快，然后我以某种方式转换为三个数字 (hh,mm,sec.ms) 的列表，然后转换为向量，假设下一步应该是与 < 向量相乘代码>%*% c(3600,60,1) 转换为数字秒。这是我对每一行和每个时间字符串所做的伪代码；完整代码位于底部：

 for each row in dataframe { # vectorize this, loop_apply(), or whatever...
 #for each time-column index i ('time1'..'time50') { # vectorize this...
 hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
 # Main computation
 xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
 # Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
 #}
 }

因此涉及五个子问题：

如何矢量化处理 strtime() 返回的列表？由于它返回 3 个项目的列表，因此当传递 2D 数据帧或 1D 时间字符串行时，我们将获得 3D 或 2D 中间对象。（我们内部使用list-of-list吗？列表矩阵？列表数组？）
如何向量化整个函数parse_hhmmsecms()？
然后进行减法并对
零时间修复代码进行对数向量化（这是目前为止最慢的部分）
如何加速步骤 1...4？

下面的代码片段使用十个示例列 time41..50（如果您想要更大的样本，请使用 random_hhmmsecms() ）

我尽力遵循这些建议，这是作为可重复，因为我可以在六个小时内得到它：

# Each of 200,000 rows has 50 time strings (chr) like this...    
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"), 
    time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
    ), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
    ), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
    ), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
    ), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
    ), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
    ), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    ), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    )), .Names = c("time41", "time42", "time43", "time44", "time45", 
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")

# Handle millisecond timing and time conversion
options('digits.secs'=3)

# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
  as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW

  #c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR

  #as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}

random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %%  60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}

xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42  <- xx$logtime41  <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47  <- xx$logtime46  <- NA

# (we pass index vectors as the dataframe column ordering may change) 
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
  #if (i%%100==0) { print(paste('... row',i)) }

  hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
  xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))

  # Now fix up all the 'zero-time' events to be evenly spaced between -3..0
  Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
  xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))  
}

原文

(EDIT: one of the issues here is scale, namely what works for one row will blow up/crash R on a 200,000 * 50 dataframe. For example, strptime must be applied column-wise, not row-wise, to avoid hanging.
I'm looking for working code solutions that you actually ran on 200,000 * 50 including your measured runtime, not just casual "this is easy" remarks. It's easy to get runtimes > 12 hrs if you pick the wrong fn. Next, I also asked you to make my zero-time adjustment code faster, the job's not finished till that's done. Noone attempted that so far.)

I want to vectorize and accelerate the following multistep log-time conversion, with millisecond accuracy, involving converting strtime() to a single numeric, followed by subtraction and then log() on a large data-frame (200,000 rows * 300 cols; other (non-time) columns omitted).
Code below.
As well as making it vectorized and fast, an extra problem is I'm not sure how best to represent the (higher-dimensional) intermediate values at each step e.g. as list from strtime, matrix, vector). I already tried apply,sapply,lapply,vapply,ddply::maply(),... but the incompatibility of intermediate format(s) keeps messing me up...

Each row has 50 columns time1..time50 (chr, format="HH:MM:SS.sss") representing time as string in millisecond resolution. I need millisecond accuracy.
Within each row, columns time1..time50 are in non-decreasing order, and I want to convert them into log of time before time50. The conversion fn parse_hhmmsecms() is at bottom, and needs serious vectorization and speeding up, you can see alternative versions commented out. What I figured so far: strtime() is faster than (multiple) substr() calls, I then convert somehow to list of three numeric (hh,mm,sec.ms), then convert to vector assuming the next step should be to vector-multiply with %*% c(3600,60,1) to convert to numeric seconds.
Here is pseudocode of what I do for each row, and each time-string; full code is at bottom:

 for each row in dataframe { # vectorize this, loop_apply(), or whatever...
 #for each time-column index i ('time1'..'time50') { # vectorize this...
 hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
 # Main computation
 xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
 # Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
 #}
 }

So there are five subproblems involved:

How to vectorize handling the list returned by strtime()? since it returns a list of 3 items, when passed a 2D dataframe or 1D row of time-strings, we will get a 3D or 2D intermediate object. (do we internally we use list-of-list? matrix of lists? array of lists?)
How to vectorize the entire function parse_hhmmsecms()?
Then do the subtraction and log
Vectorize the zero-time fixup code as well (this is now the slowest part by far)
How to accelerate steps 1...4.?

Code snippet below using ten example columns time41..50 (use random_hhmmsecms() if you want a bigger sample)

I did my best to follow these recommendations, this is as reproducible as I can get it in six hours' work:

# Each of 200,000 rows has 50 time strings (chr) like this...    
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"), 
    time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
    ), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
    ), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
    ), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
    ), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
    ), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
    ), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    ), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    )), .Names = c("time41", "time42", "time43", "time44", "time45", 
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")

# Handle millisecond timing and time conversion
options('digits.secs'=3)

# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
  as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW

  #c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR

  #as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}

random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %%  60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}

xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42  <- xx$logtime41  <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47  <- xx$logtime46  <- NA

# (we pass index vectors as the dataframe column ordering may change) 
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
  #if (i%%100==0) { print(paste('... row',i)) }

  hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
  xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))

  # Now fix up all the 'zero-time' events to be evenly spaced between -3..0
  Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
  xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))  
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

顾铮苏瑾 2024-12-29 23:47:46

你可能把事情过于复杂化了。

从处理毫秒（在适当的操作系统上甚至微秒）的基类开始，但请注意

您需要设置options("digits.secs"=7) （这是可以显示的最大值）要查看它们显示
，
您需要为 strptime 等人提供一个额外的解析字符

所有这些都在文档中，并且这里有无数的例子。

简单示例：

R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST"      # microsecond display on Linux
R> 
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS")    # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>

诸如 strptime 或 as.POSIXct 之类的关键函数都是矢量化的，您可以将整列扔给它们。

You may be overcomplicating things.

Start with base classes which do milliseconds very well (and on appropriate operating systems even microseconds) but note that

you need to set options("digits.secs"=7) (that's the max that can be displayed) to see them displayed
you need an additional parsing character for strptime et al

all of which is in the docs, and countless examples here on SO.

Quick examples:

R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST"      # microsecond display on Linux
R> 
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS")    # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>

And key functions such as strptime or as.POSIXct are all vectorised and you can throw entire columns at them.

回复收藏 0 原文

~没有更多了~