就存储 R 矩阵的查询速度而言的最佳数据库设计

发布于 2024-12-09 20:52:20 字数 12918 浏览 1 评论 0原文

我有数百个矩阵需要在 R 中使用，其中大多数每个矩阵的大小约为 45000x350。我想做的是找到一个最佳的数据库软件选择和模式来存储数据并能够从数据库中调用矩阵的子集。这样做的目的是尽可能快地提取数据。

作为基础，这里的代码创建了 5 个矩阵，类似于我正在处理的矩阵：

if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)

numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))             
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)

dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})

symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
                    
for(i in 1:numVariables) {
    assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
                                              nrow=numSymbols,
                                              ncol=(numMonths+1),
                                              dimnames=list(symbolNames,dateVector)))
}

基本上所有矩阵都有大约一半的值填充双精度数和其余的 NA。

# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"

# > dim(Variable_1)
# [1] 45000   346

# > Variable_1[1:10,1:3]
#                  31/12/1982          31/01/1983           28/02/1983
# AF3HM5V                  NA                  NA -1.15076100366945755
# DL8TVIY                  NA                  NA -1.59412257037490046
# JEFDYPO                  NA                  NA                   NA
# 21ZV689                  NA                  NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC                  NA                  NA -1.31113363079930023
# XG3ZA1W                  NA  0.8531074740045220  0.06797987526470438
# W1JCXIE  0.2782029710832690 -1.2668560986048898                   NA
# X3RKT2B  1.5220172324681460 -1.0460218516729356                   NA
# 3EUB8VN -0.9405417187846803  1.1151437940206490  1.60520458945005262

我希望能够将这些存储在数据库中。 RDBMS 将是默认选项，但我愿意考虑其他选项。最大的部分是快速查询的最佳解决方案，无论是整个矩阵还是矩阵的子集，例如2000个符号，100个日期等。

我当前使用的解决方案是将每个矩阵保存为RData文件，然后加载整个矩阵并截断大小以供使用。这确实很快，但我觉得数据库设计在扩展符号+日期和数据备份方面的矩阵方面会更有利。

到目前为止，我在 RDBMS 选项方面尝试过的是：

- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes

- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes

C) [也许应该尝试一下此处基于列的数据库]

- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes

与上述数据库设置相比，从 RData 文件加载整个变量在本地需要 5 秒，通过网络需要 20 秒。所有数据库时间都通过网络。

我可以做些什么来使数据库路由接近二进制文件速度吗？

也许表格 nosql 数据库之一就是我所需要的？

就附加符号+日期而言，其规模如何？

如果有处理过类似问题的人提供帮助，我们将不胜感激。

更新：我想对此发布更新。最后，我采纳了 Iterator 的建议，数据现在托管在大内存内存映射文件中，然后托管在 RData 中，以便快速使用拖放检查，并输出到 csv 并由 SQL Server 拉取以用于备份目的。任何数据库解决方案都太慢而无法被多个用户使用。另外，在 SQL 服务器上使用 RODBC 速度非常慢，但尝试通过 CSV 与 SQL 进行 R 的输入和输出，这还可以，但毫无意义。

另外供参考，字节编译的load方法对于bigmemory确实有影响。以下是我对 RData 与 bigmemory 进行负载测试的结果。

workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")

LoadVariablesInFolder <- function(folder, sedols, dates) {
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
    filesToLoad <- paste(folder,filesToLoad,sep="/")
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        loadedVar <- load(fToLoad)
        assign(loadedVar,get(loadedVar)[sedols,dates])
        gc() -> ans
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar))
    }
    return(variablesThatWereLoaded)
}

cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)

BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
    workD <- getwd()
    setwd(folder)
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        tempVar <- attach.big.matrix(dget(fToLoad))
        loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
        assign(loadedVar,tempVar[sedols,dates])
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar,"tempVar"))
        gc() -> ans
    }
    setwd(workD)
    return(variablesThatWereLoaded)
}

cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)

testCases <- list(
                list(numSedols=1000,numDates=120),
                list(numSedols=5000,numDates=120),
                list(numSedols=50000,numDates=120),
                list(numSedols=1000,numDates=350),
                list(numSedols=5000,numDates=350),
                list(numSedols=50000,numDates=350))

load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))

for (testCase in testCases) {
    results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              columns=c("test", "replications","elapsed", "relative"),
              order="relative", replications=3)
    cat("Results for testcase:\n")
    print(testCase)
    print(results)
}

基本上，子集越小，获得的收益就越多，因为您不需要花时间加载整个矩阵。但是从大内存加载整个矩阵比 RData 慢，我猜这是转换开销：

# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   6.799999999999955  1.000000000000000
# 3            3  14.389999999999986  2.116176470588247
# 1            3 235.639999999999986 34.652941176470819
# 2            3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   7.080000000000155  1.000000000000000
# 3            3  32.730000000000018  4.622881355932105
# 1            3 249.389999999999873 35.224576271185654
# 2            3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 120

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 146.3499999999999 1.000000000000000
# 4            3 148.1799999999998 1.012504270584215
# 2            3 238.3200000000002 1.628425008541171
# 1            3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications            elapsed          relative
# 3            3  83.88000000000011 1.000000000000000
# 4            3  91.71000000000004 1.093347639484977
# 1            3 235.69000000000005 2.809847401049115
# 2            3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 135.6999999999998 1.000000000000000
# 4            3 155.8900000000003 1.148784082535008
# 2            3 233.3699999999999 1.719749447310245
# 1            3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 350

                                                                                                                              # test
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 2            3 236.5000000000000 1.000000000000000
# 1            3 237.2100000000000 1.003002114164905
# 3            3 388.2900000000000 1.641818181818182
# 4            3 393.6300000000001 1.664397463002115

原文

I have hundreds of matrices that need to be used in R and most of them are around 45000x350 each. What I'd like to do is find an optimal database software choice and schema to store the data in and be able to call subsets of the matrices from the database. This needs to be done with the object of extracting the data be as fast as possible.

As a base here is code that creates 5 matrices similar to what I'm dealing with:

if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)

numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))             
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)

dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})

symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
                    
for(i in 1:numVariables) {
    assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
                                              nrow=numSymbols,
                                              ncol=(numMonths+1),
                                              dimnames=list(symbolNames,dateVector)))
}

Basically all the matrices will have about half the values filled with doubles and the rest NAs.

# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"

# > dim(Variable_1)
# [1] 45000   346

# > Variable_1[1:10,1:3]
#                  31/12/1982          31/01/1983           28/02/1983
# AF3HM5V                  NA                  NA -1.15076100366945755
# DL8TVIY                  NA                  NA -1.59412257037490046
# JEFDYPO                  NA                  NA                   NA
# 21ZV689                  NA                  NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC                  NA                  NA -1.31113363079930023
# XG3ZA1W                  NA  0.8531074740045220  0.06797987526470438
# W1JCXIE  0.2782029710832690 -1.2668560986048898                   NA
# X3RKT2B  1.5220172324681460 -1.0460218516729356                   NA
# 3EUB8VN -0.9405417187846803  1.1151437940206490  1.60520458945005262

I want to be able to store these in a database. RDBMS would be the default option but I'm willing to look at other options. The biggest part is the optimal solution for quick querying, be it for the whole matrix or a subset of the matrix, e.g. 2000 symbols, 100 dates etc.

The current solution I've been using is saving each matrix as a RData file and then loading in the whole matrix and truncating in size to use. This is really quick but I feel a database design would be more benefitial in terms of scaling the matrices in terms of symbols + dates and backups for the data.

What I've tried so far in terms of RDBMS options is:

- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes

- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes

C) [Should maybe have tried a column based database here]

- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes

In comparison with the above database set ups loading in the whole variable from the RData files takes 5 sec locally and 20 sec over the network. All the database times are over the network.

Is there anything I can do to make the database route come anywhere close to the binary file speeds?

Maybe one of the tabular nosql databases is what I need?

How does that scale in terms of additional symbols + dates?

Any help from someone who's dealt with something similar would be appreciated.

Update: Thought I'd post an update to this. In the end I went with Iterator's suggestion and the data is now hosted in bigmemory memory mapped files and then RData for quick use drag and drop checking as well as outputted to csv and pulled by SQL Server for backup purposes. Any database solution is too slow to be used by multiple users. Also using RODBC against SQL server is crazy slow, but tried input and output to R via CSV to and from SQL and that was okay but pointless.

Also for references, byte compiling the the load method for bigmemory does have an impact. Here are the results of my load test for RData vs bigmemory.

workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")

LoadVariablesInFolder <- function(folder, sedols, dates) {
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
    filesToLoad <- paste(folder,filesToLoad,sep="/")
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        loadedVar <- load(fToLoad)
        assign(loadedVar,get(loadedVar)[sedols,dates])
        gc() -> ans
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar))
    }
    return(variablesThatWereLoaded)
}

cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)

BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
    workD <- getwd()
    setwd(folder)
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        tempVar <- attach.big.matrix(dget(fToLoad))
        loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
        assign(loadedVar,tempVar[sedols,dates])
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar,"tempVar"))
        gc() -> ans
    }
    setwd(workD)
    return(variablesThatWereLoaded)
}

cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)

testCases <- list(
                list(numSedols=1000,numDates=120),
                list(numSedols=5000,numDates=120),
                list(numSedols=50000,numDates=120),
                list(numSedols=1000,numDates=350),
                list(numSedols=5000,numDates=350),
                list(numSedols=50000,numDates=350))

load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))

for (testCase in testCases) {
    results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              columns=c("test", "replications","elapsed", "relative"),
              order="relative", replications=3)
    cat("Results for testcase:\n")
    print(testCase)
    print(results)
}

Basically the smaller the subset the more is gained because you don't spend time loading in the whole matrix. But loading the whole matrix is slower from bigmemory than RData, I guess it's the conversion overhead:

# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   6.799999999999955  1.000000000000000
# 3            3  14.389999999999986  2.116176470588247
# 1            3 235.639999999999986 34.652941176470819
# 2            3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   7.080000000000155  1.000000000000000
# 3            3  32.730000000000018  4.622881355932105
# 1            3 249.389999999999873 35.224576271185654
# 2            3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 120

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 146.3499999999999 1.000000000000000
# 4            3 148.1799999999998 1.012504270584215
# 2            3 238.3200000000002 1.628425008541171
# 1            3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications            elapsed          relative
# 3            3  83.88000000000011 1.000000000000000
# 4            3  91.71000000000004 1.093347639484977
# 1            3 235.69000000000005 2.809847401049115
# 2            3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 135.6999999999998 1.000000000000000
# 4            3 155.8900000000003 1.148784082535008
# 2            3 233.3699999999999 1.719749447310245
# 1            3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 350

                                                                                                                              # test
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 2            3 236.5000000000000 1.000000000000000
# 1            3 237.2100000000000 1.003002114164905
# 3            3 388.2900000000000 1.641818181818182
# 4            3 393.6300000000001 1.664397463002115

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍白女子 2024-12-16 20:52:20

我强烈推荐使用 HDF5。我假设您的数据足够复杂，以至于各种 bigmemory 文件（即内存映射矩阵）无法轻松满足您的需求（参见注释 1），但 HDF5 只是达不到内存映射的速度文件。请参阅此更长地回答另一个问题以了解我如何比较 HDF5 和 .RDat 文件。

最值得注意的是，HDF5 支持随机访问这一事实意味着您应该能够获得显着的速度提升。

另一种选择是使用 readBin 和 writeBin，具体取决于您是否愿意设计自己的二进制格式，尽管这不具备 HDF5 所具有的所有出色功能，包括并行I/O、版本信息、可移植性等。

注1：如果每行只有几种类型，即1个字符，其余都是数字，则可以简单地创建2个内存映射矩阵，其中一个用于字符，另一个为数字价值观。这将允许您使用 bigmemory、mwhich、bigtabulate 以及 bigmemory 套件中的许多其他不错的函数。我会付出合理的努力，因为它是一个非常简单的系统，可以与大量 R 代码顺利集成：矩阵不需要进入内存，只需您碰巧需要的任何子集，并且许多实例可以同时访问相同的文件。此外，使用 foreach() 的多核后端可以轻松实现并行访问。我曾经有一个操作，每个 .Rdat 文件大约需要 3 分钟：加载大约 2 分钟，子选择我需要的内容大约 20 秒，分析大约 10 秒，保存结果大约 30 秒。切换到 bigmemory 后，我的分析时间减少到了大约 10 秒，I/O 分析时间减少到了大约 5-15 秒。

更新 1：我忽略了 ff 包 - 这是另一个不错的选择，尽管它比 bigmemory 复杂得多。