就存储 R 矩阵的查询速度而言的最佳数据库设计
我有数百个矩阵需要在 R 中使用,其中大多数每个矩阵的大小约为 45000x350。我想做的是找到一个最佳的数据库软件选择和模式来存储数据并能够从数据库中调用矩阵的子集。这样做的目的是尽可能快地提取数据。
作为基础,这里的代码创建了 5 个矩阵,类似于我正在处理的矩阵:
if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)
numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)
dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})
symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
for(i in 1:numVariables) {
assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
nrow=numSymbols,
ncol=(numMonths+1),
dimnames=list(symbolNames,dateVector)))
}
基本上所有矩阵都有大约一半的值填充双精度数和其余的 NA。
# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"
# > dim(Variable_1)
# [1] 45000 346
# > Variable_1[1:10,1:3]
# 31/12/1982 31/01/1983 28/02/1983
# AF3HM5V NA NA -1.15076100366945755
# DL8TVIY NA NA -1.59412257037490046
# JEFDYPO NA NA NA
# 21ZV689 NA NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC NA NA -1.31113363079930023
# XG3ZA1W NA 0.8531074740045220 0.06797987526470438
# W1JCXIE 0.2782029710832690 -1.2668560986048898 NA
# X3RKT2B 1.5220172324681460 -1.0460218516729356 NA
# 3EUB8VN -0.9405417187846803 1.1151437940206490 1.60520458945005262
我希望能够将这些存储在数据库中。 RDBMS 将是默认选项,但我愿意考虑其他选项。最大的部分是快速查询的最佳解决方案,无论是整个矩阵还是矩阵的子集,例如2000个符号,100个日期等。
我当前使用的解决方案是将每个矩阵保存为RData文件,然后加载整个矩阵并截断大小以供使用。这确实很快,但我觉得数据库设计在扩展符号+日期和数据备份方面的矩阵方面会更有利。
到目前为止,我在 RDBMS 选项方面尝试过的是:
A)
- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes
B)
- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes
C) [也许应该尝试一下此处基于列的数据库]
- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes
与上述数据库设置相比,从 RData 文件加载整个变量在本地需要 5 秒,通过网络需要 20 秒。所有数据库时间都通过网络。
我可以做些什么来使数据库路由接近二进制文件速度吗?
也许表格 nosql 数据库之一就是我所需要的?
就附加符号+日期而言,其规模如何?
如果有处理过类似问题的人提供帮助,我们将不胜感激。
更新:我想对此发布更新。最后,我采纳了 Iterator 的建议,数据现在托管在大内存内存映射文件中,然后托管在 RData 中,以便快速使用拖放检查,并输出到 csv 并由 SQL Server 拉取以用于备份目的。任何数据库解决方案都太慢而无法被多个用户使用。另外,在 SQL 服务器上使用 RODBC 速度非常慢,但尝试通过 CSV 与 SQL 进行 R 的输入和输出,这还可以,但毫无意义。
另外供参考,字节编译的load方法对于bigmemory确实有影响。以下是我对 RData 与 bigmemory 进行负载测试的结果。
workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")
LoadVariablesInFolder <- function(folder, sedols, dates) {
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
filesToLoad <- paste(folder,filesToLoad,sep="/")
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
loadedVar <- load(fToLoad)
assign(loadedVar,get(loadedVar)[sedols,dates])
gc() -> ans
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar))
}
return(variablesThatWereLoaded)
}
cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)
BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
workD <- getwd()
setwd(folder)
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
tempVar <- attach.big.matrix(dget(fToLoad))
loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
assign(loadedVar,tempVar[sedols,dates])
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar,"tempVar"))
gc() -> ans
}
setwd(workD)
return(variablesThatWereLoaded)
}
cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)
testCases <- list(
list(numSedols=1000,numDates=120),
list(numSedols=5000,numDates=120),
list(numSedols=50000,numDates=120),
list(numSedols=1000,numDates=350),
list(numSedols=5000,numDates=350),
list(numSedols=50000,numDates=350))
load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))
for (testCase in testCases) {
results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
columns=c("test", "replications","elapsed", "relative"),
order="relative", replications=3)
cat("Results for testcase:\n")
print(testCase)
print(results)
}
基本上,子集越小,获得的收益就越多,因为您不需要花时间加载整个矩阵。但是从大内存加载整个矩阵比 RData 慢,我猜这是转换开销:
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 6.799999999999955 1.000000000000000
# 3 3 14.389999999999986 2.116176470588247
# 1 3 235.639999999999986 34.652941176470819
# 2 3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 7.080000000000155 1.000000000000000
# 3 3 32.730000000000018 4.622881355932105
# 1 3 249.389999999999873 35.224576271185654
# 2 3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 120
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 146.3499999999999 1.000000000000000
# 4 3 148.1799999999998 1.012504270584215
# 2 3 238.3200000000002 1.628425008541171
# 1 3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 83.88000000000011 1.000000000000000
# 4 3 91.71000000000004 1.093347639484977
# 1 3 235.69000000000005 2.809847401049115
# 2 3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 135.6999999999998 1.000000000000000
# 4 3 155.8900000000003 1.148784082535008
# 2 3 233.3699999999999 1.719749447310245
# 1 3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 350
# test
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 2 3 236.5000000000000 1.000000000000000
# 1 3 237.2100000000000 1.003002114164905
# 3 3 388.2900000000000 1.641818181818182
# 4 3 393.6300000000001 1.664397463002115
I have hundreds of matrices that need to be used in R and most of them are around 45000x350 each. What I'd like to do is find an optimal database software choice and schema to store the data in and be able to call subsets of the matrices from the database. This needs to be done with the object of extracting the data be as fast as possible.
As a base here is code that creates 5 matrices similar to what I'm dealing with:
if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)
numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)
dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})
symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
for(i in 1:numVariables) {
assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
nrow=numSymbols,
ncol=(numMonths+1),
dimnames=list(symbolNames,dateVector)))
}
Basically all the matrices will have about half the values filled with doubles and the rest NAs.
# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"
# > dim(Variable_1)
# [1] 45000 346
# > Variable_1[1:10,1:3]
# 31/12/1982 31/01/1983 28/02/1983
# AF3HM5V NA NA -1.15076100366945755
# DL8TVIY NA NA -1.59412257037490046
# JEFDYPO NA NA NA
# 21ZV689 NA NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC NA NA -1.31113363079930023
# XG3ZA1W NA 0.8531074740045220 0.06797987526470438
# W1JCXIE 0.2782029710832690 -1.2668560986048898 NA
# X3RKT2B 1.5220172324681460 -1.0460218516729356 NA
# 3EUB8VN -0.9405417187846803 1.1151437940206490 1.60520458945005262
I want to be able to store these in a database. RDBMS would be the default option but I'm willing to look at other options. The biggest part is the optimal solution for quick querying, be it for the whole matrix or a subset of the matrix, e.g. 2000 symbols, 100 dates etc.
The current solution I've been using is saving each matrix as a RData file and then loading in the whole matrix and truncating in size to use. This is really quick but I feel a database design would be more benefitial in terms of scaling the matrices in terms of symbols + dates and backups for the data.
What I've tried so far in terms of RDBMS options is:
A)
- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes
B)
- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes
C) [Should maybe have tried a column based database here]
- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes
In comparison with the above database set ups loading in the whole variable from the RData files takes 5 sec locally and 20 sec over the network. All the database times are over the network.
Is there anything I can do to make the database route come anywhere close to the binary file speeds?
Maybe one of the tabular nosql databases is what I need?
How does that scale in terms of additional symbols + dates?
Any help from someone who's dealt with something similar would be appreciated.
Update: Thought I'd post an update to this. In the end I went with Iterator's suggestion and the data is now hosted in bigmemory memory mapped files and then RData for quick use drag and drop checking as well as outputted to csv and pulled by SQL Server for backup purposes. Any database solution is too slow to be used by multiple users. Also using RODBC against SQL server is crazy slow, but tried input and output to R via CSV to and from SQL and that was okay but pointless.
Also for references, byte compiling the the load method for bigmemory does have an impact. Here are the results of my load test for RData vs bigmemory.
workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")
LoadVariablesInFolder <- function(folder, sedols, dates) {
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
filesToLoad <- paste(folder,filesToLoad,sep="/")
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
loadedVar <- load(fToLoad)
assign(loadedVar,get(loadedVar)[sedols,dates])
gc() -> ans
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar))
}
return(variablesThatWereLoaded)
}
cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)
BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
workD <- getwd()
setwd(folder)
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
tempVar <- attach.big.matrix(dget(fToLoad))
loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
assign(loadedVar,tempVar[sedols,dates])
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar,"tempVar"))
gc() -> ans
}
setwd(workD)
return(variablesThatWereLoaded)
}
cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)
testCases <- list(
list(numSedols=1000,numDates=120),
list(numSedols=5000,numDates=120),
list(numSedols=50000,numDates=120),
list(numSedols=1000,numDates=350),
list(numSedols=5000,numDates=350),
list(numSedols=50000,numDates=350))
load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))
for (testCase in testCases) {
results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
columns=c("test", "replications","elapsed", "relative"),
order="relative", replications=3)
cat("Results for testcase:\n")
print(testCase)
print(results)
}
Basically the smaller the subset the more is gained because you don't spend time loading in the whole matrix. But loading the whole matrix is slower from bigmemory than RData, I guess it's the conversion overhead:
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 6.799999999999955 1.000000000000000
# 3 3 14.389999999999986 2.116176470588247
# 1 3 235.639999999999986 34.652941176470819
# 2 3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 7.080000000000155 1.000000000000000
# 3 3 32.730000000000018 4.622881355932105
# 1 3 249.389999999999873 35.224576271185654
# 2 3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 120
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 146.3499999999999 1.000000000000000
# 4 3 148.1799999999998 1.012504270584215
# 2 3 238.3200000000002 1.628425008541171
# 1 3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 83.88000000000011 1.000000000000000
# 4 3 91.71000000000004 1.093347639484977
# 1 3 235.69000000000005 2.809847401049115
# 2 3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 135.6999999999998 1.000000000000000
# 4 3 155.8900000000003 1.148784082535008
# 2 3 233.3699999999999 1.719749447310245
# 1 3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 350
# test
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 2 3 236.5000000000000 1.000000000000000
# 1 3 237.2100000000000 1.003002114164905
# 3 3 388.2900000000000 1.641818181818182
# 4 3 393.6300000000001 1.664397463002115
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我强烈推荐使用 HDF5。我假设您的数据足够复杂,以至于各种
bigmemory
文件(即内存映射矩阵)无法轻松满足您的需求(参见注释 1),但 HDF5 只是达不到内存映射的速度文件。请参阅此更长地回答另一个问题以了解我如何比较 HDF5 和 .RDat 文件。最值得注意的是,HDF5 支持随机访问这一事实意味着您应该能够获得显着的速度提升。
另一种选择是使用
readBin
和writeBin
,具体取决于您是否愿意设计自己的二进制格式,尽管这不具备 HDF5 所具有的所有出色功能,包括并行I/O、版本信息、可移植性等。注1:如果每行只有几种类型,即1个字符,其余都是数字,则可以简单地创建2个内存映射矩阵,其中一个用于字符,另一个为数字价值观。这将允许您使用
bigmemory
、mwhich
、bigtabulate
以及bigmemory
套件中的许多其他不错的函数。我会付出合理的努力,因为它是一个非常简单的系统,可以与大量 R 代码顺利集成:矩阵不需要进入内存,只需您碰巧需要的任何子集,并且许多实例可以同时访问相同的文件。此外,使用foreach()
的多核后端可以轻松实现并行访问。我曾经有一个操作,每个 .Rdat 文件大约需要 3 分钟:加载大约 2 分钟,子选择我需要的内容大约 20 秒,分析大约 10 秒,保存结果大约 30 秒。切换到 bigmemory 后,我的分析时间减少到了大约 10 秒,I/O 分析时间减少到了大约 5-15 秒。更新 1:我忽略了 ff 包 - 这是另一个不错的选择,尽管它比 bigmemory 复杂得多。
I would strongly recommend using HDF5. I assume that your data is complex enough that a variety of
bigmemory
files (i.e. memory mapped matrices) would not easily satisfy your needs (see note 1), but HDF5 is just short of the speed of memory mapped files. See this longer answer to another question to understand how I compare HDF5 and .RDat files.Most notably, the fact that HDF5 supports random access means that you should be able to get substantial speed improvements.
Another option, depending on your willingness to design your own binary format, is to use
readBin
andwriteBin
, though this doesn't have all of the nice features that HDF5 has, including parallel I/O, version information, portability, etc.Note 1: If you have just a few types per row, i.e. 1 character and the rest are numeric, you can simply create 2 memory mapped matrices, one of which is for characters, the other for numeric values. This will allow you to use
bigmemory
,mwhich
,bigtabulate
and lots of other nice functions in thebigmemory
suite. I'd give that a reasonable effort, as it's a very easy system to smoothly integrate with lots of R code: the matrix need never enter memory, just whatever subsets you happen to need, and many instances can access the same files simultaneously. What's more, it is easy to parallelize access using multicore backends forforeach()
. I used to have an operation that would take about 3 minutes per .Rdat file: about 2 minutes to load, about 20 seconds to subselect what I needed, about 10 seconds to analyze, and about 30 seconds to save the results. After switching tobigmemory
, I got down to about 10 seconds to analyze and about 5-15 seconds on the I/O.Update 1: I overlooked the ff package - this is another good option, though it is a lot more complex than bigmemory.
也许 TSdbi 包的数据库设计非常鼓舞人心...
对于 nosql 解决方案,hdf5 可能是一个选择。不过,我对此了解不多。
Maybe the database design of the TSdbi package is inspiring...
For a nosql solution, hdf5 might be an option. I do not know much about it, though.
您还可以深入研究 sqldf 的用途/内部。看来他们的数据库工作已经整理得很好了。他们的页面上还给出了很多有趣的示例。
You could also delve into the uses/internals of sqldf. It seems they got their database work pretty well sorted out. There are also a lot of interesting examples given on their page.