如何使 R 矩阵填充函数更快?
不久前,我编写了一个函数来填充时间序列矩阵,该矩阵的 NA 值根据所需的规格而增加,并且它偶尔会在一些大约 50000 行、350 列的矩阵上使用。矩阵可以包含数字或字符值。主要问题是修复矩阵很慢,我想我应该向一些专家咨询如何更快地完成此操作。
我想 rcpp 或并行它可能会有所帮助,但我认为这可能是我的设计而不是 R 本身效率低下。我通常对 R 中的所有内容进行 vecotrize,但由于缺失值不遵循任何模式,因此除了按行处理矩阵之外,我没有找到其他方法。
需要调用该函数,以便它可以继承缺失值,也可以调用该函数以快速用最后一个已知值填充最新值。
这是一个示例矩阵:
testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA,
4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA,
NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA,
66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA,
-12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA,
NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39,
NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA,
4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L,
9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e",
"ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31",
"2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31",
"2011-04-30", "2011-05-31")))
print(testMatrix)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
这是我当前使用的函数:
# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax --- The max number of columns to carry a number
# forward if there are no more values in the
# time series row.
#
# Returns:
# A matrix with gaps filled.
GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {
if("DEBUG_ON" %in% ls(globalenv())){browser()}
cntRow <- nrow(inputMatrix)
cntCol <- ncol(inputMatrix)
#
if (forwardLooking) {
for (i in 1:cntRow) {
# Store the location of the first non NA element in the row
firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
if (!(is.na(firstValueCol))) {
if (!(firstValueCol == cntCol)) {
nextValueCol <- firstValueCol
# If there is a a value number in the row and it's not at the end of the time
# series, start iterating through the row while there are more NA values and
# more data values and not at the end of the row continue.
while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
# Find the next NA element
nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
# Find the next value element
nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
# If there is another value element then fill up all NA elements in between with the last known value
if (!is.na(nextValueCol)) {
inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
} else {
# If there is no other value element then fill up all NA elements up to the max number supplied
# with the last known value unless it's close to the end of the row then just fill up to the end.
inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
nextValueCol <- cntCol
}
}
}
}
}
} else {
for (i in 1:cntRow) {
if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
if (length(tempRow[!is.na(tempRow)])>0) {
lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
}
}
}
}
return(inputMatrix)
}
然后我用类似以下内容调用它:
> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
或
> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39
这个示例运行得很快,但是有什么方法可以使其对于大型矩阵更快?
> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
86.47 0.06 87.24
这个虚拟行有很多仅 NA 行和完全填充的行,但普通行可能需要大约 15-20 分钟。
更新
关于 Charles 对 na.locf 的评论并不完全反映上述逻辑:下面是最终函数如何排除输入检查等的简化版本:
FillGaps <- function( dataMatrix, fillGapMax ) {
require("zoo")
numRow <- nrow(dataMatrix)
numCol <- ncol(dataMatrix)
iteration <- (numCol-fillGapMax)
if(length(iteration)>0) {
for (i in iteration:1) {
tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
}
}
return(dataMatrix)
}
A while back I wrote a function to fill time series matrices that had NA values up according to the specifications needed and it's had occational uses on a few matrices that are about 50000 rows, 350 columns. The matrix can contain either numeric or character values. The main problem is fixing the matrix is slow and I thought I'd gauge some experts on how to do this faster.
I guess going to rcpp or paralleling it might help but I think it's might be my design rather than R itself that's inefficient. I generally vecotrize everything in R but since the missing values follow no pattern I've found no other way than to work with the matrix on a per row basis.
The function needs to be called so it can carry forwards missing values and also be called to quickly just fill the latest values with the last known one.
Here is an example matrix:
testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA,
4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA,
NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA,
66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA,
-12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA,
NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39,
NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA,
4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L,
9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e",
"ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31",
"2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31",
"2011-04-30", "2011-05-31")))
print(testMatrix)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
This is the function I currently use:
# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax --- The max number of columns to carry a number
# forward if there are no more values in the
# time series row.
#
# Returns:
# A matrix with gaps filled.
GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {
if("DEBUG_ON" %in% ls(globalenv())){browser()}
cntRow <- nrow(inputMatrix)
cntCol <- ncol(inputMatrix)
#
if (forwardLooking) {
for (i in 1:cntRow) {
# Store the location of the first non NA element in the row
firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
if (!(is.na(firstValueCol))) {
if (!(firstValueCol == cntCol)) {
nextValueCol <- firstValueCol
# If there is a a value number in the row and it's not at the end of the time
# series, start iterating through the row while there are more NA values and
# more data values and not at the end of the row continue.
while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
# Find the next NA element
nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
# Find the next value element
nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
# If there is another value element then fill up all NA elements in between with the last known value
if (!is.na(nextValueCol)) {
inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
} else {
# If there is no other value element then fill up all NA elements up to the max number supplied
# with the last known value unless it's close to the end of the row then just fill up to the end.
inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
nextValueCol <- cntCol
}
}
}
}
}
} else {
for (i in 1:cntRow) {
if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
if (length(tempRow[!is.na(tempRow)])>0) {
lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
}
}
}
}
return(inputMatrix)
}
I'm then calling it with something like:
> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
or
> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39
This example runs quickly but is there any way to make it faster for large matrices?
> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
86.47 0.06 87.24
This dummy one has a lot of NA only rows and completely filled ones but the normal ones can take about 15-20 min.
UPDATE
Regarding Charles' comment about na.locf not completely mirroring the logic of the above: Below is a simplified version of how the final function is excluding checks for inputs etc:
FillGaps <- function( dataMatrix, fillGapMax ) {
require("zoo")
numRow <- nrow(dataMatrix)
numCol <- ncol(dataMatrix)
iteration <- (numCol-fillGapMax)
if(length(iteration)>0) {
for (i in iteration:1) {
tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
}
}
return(dataMatrix)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我可能是错的,但我认为这是在 zoo 包中实现的:使用 <代码>na.locf函数。
对于给定的示例矩阵,首先我们应该对其进行转置,然后在调用
na
函数后,我们“重新转置”结果矩阵。例如:对于较小的
maxgap
:可以看到使用
na.locf
获得的性能:I might be wrong but I think this is implemented in the zoo package: use the
na.locf
function.With your given example matrix, first we should transpose it, and after calling the
na
function we 'retranspose' the result matrix. E.g.:And with small
maxgap
:The performance gained using
na.locf
could be seen: