rgdal 有效读取大型多波段栅格

发布于 2024-10-02 04:34:55 字数 2485 浏览 8 评论 0原文

我正在使用 rgdal 包在 R 中编写图像分类脚本。所讨论的栅格是一个具有 28 个通道的 PCIDSK 文件:一个训练数据通道、一个验证数据通道和 26 个光谱数据通道。目标是填充一个数据帧,其中包含训练数据通道中不为零的每个像素的值,以及 26 个波段中的相关光谱值。

在Python/Numpy中,我可以轻松地将整个图像的所有波段导入到多维数组中,但是,由于内存限制,R中的唯一选择似乎是逐块导入此数据,这非常慢

library(rgdal)

raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500

# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]


# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
 if(column + BlockWidth > xsize){
  BlockWidth = xsize - column
 }
 row = 0
 while(row < ysize){
  if(row + BlockHeight > ysize){
   BlockHeight = ysize - row
  }

  # Do stuff here
  myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
  blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
  for(i in 1:(dim(myblock)[2])){
   bandname = paste("myblock", names(myblock)[i], sep="$")
   blockdata[,i]= as.matrix(eval(parse(text=bandname)))
  }
  blockdata = as.data.frame(blockdata)
  blockdata = subset(blockdata, blockdata[,1] > 0)
  if (dim(blockdata)[1] > 0){
   training_data = rbind(training_data, blockdata)
  }

  row = row + BlockHeight
 }
 column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)

:有没有更快/更好的方法可以在不耗尽内存的情况下做同样的事情?

收集训练数据后的下一步是创建分类器(randomForest 包),这也需要大量内存,具体取决于请求的树数量。这引出了我的第二个问题,即考虑到训练数据已经占用的内存量,创建一个由 500 棵树组成的森林是不可能的:

myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)

有没有办法分配更多内存?我错过了什么吗?谢谢...

[编辑] 正如 Jan 所建议的,使用“raster”包要快得多;然而据我所知,就收集训练数据而言,它并没有解决内存问题,因为它最终需要位于内存中的数据帧中:

library(raster)
library(randomForest)

# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1

# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)

因此,虽然这要快得多(并且需要更少的代码),它仍然没有解决没有足够的可用内存来创建分类器的问题...是否有一些我还没有找到的“光栅”包函数可以完成此任务?谢谢...

I am working on an image classification script in R using the rgdal package. The raster in question is a PCIDSK file with 28 channels: a training data channel, a validation data channel, and 26 spectral data channels. The objective is to populate a data frame containing the values of each pixel which is not zero in the training data channel, plus the associated spectral values in the 26 bands.

In Python/Numpy, I can easily import all the bands for the entire image into a multi-dimensional array, however, due to memory limitations the only option in R seems to be importing this data block by block, which is very slow:

library(rgdal)

raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500

# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]


# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
 if(column + BlockWidth > xsize){
  BlockWidth = xsize - column
 }
 row = 0
 while(row < ysize){
  if(row + BlockHeight > ysize){
   BlockHeight = ysize - row
  }

  # Do stuff here
  myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
  blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
  for(i in 1:(dim(myblock)[2])){
   bandname = paste("myblock", names(myblock)[i], sep="$")
   blockdata[,i]= as.matrix(eval(parse(text=bandname)))
  }
  blockdata = as.data.frame(blockdata)
  blockdata = subset(blockdata, blockdata[,1] > 0)
  if (dim(blockdata)[1] > 0){
   training_data = rbind(training_data, blockdata)
  }

  row = row + BlockHeight
 }
 column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)

Is there a faster/better way of doing the same thing without running out of memory?

The next step after this training data is collected is to create the classifier (randomForest package) which also requires a lot of memory, depending on the number of trees requested. This brings me to my second problem, which is that creating a forest of 500 trees is not possible given the amount of memory already occupied by the training data:

myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)

Is there a way to allocate more memory? Am I missing something? Thanks...

[EDIT]
As suggested by Jan, using the "raster" package is much faster; however as far as I can tell, it does not solve the memory problem as far as gathering the training data is concerned because it eventually needs to be in a dataframe, in memory:

library(raster)
library(randomForest)

# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1

# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)

So while this is much faster (and takes less code), it still does not solve the issue of not having enough free memory to create the classifier... Is there some "raster" package function that I have not found that can accomplish this? Thanks...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

好菇凉咱不稀罕他 2024-10-09 04:34:55

查看光栅包。 Raster 包为 Rgdal 提供了一个方便的包装器,无需将其加载到内存中。

http://raster.r-forge.r-project.org/

希望这个帮助。

“raster”包处理基本的
空间栅格(网格)数据访问和
操纵。它定义了栅格
课程;可以处理非常大的
文件(存储在磁盘上);并包括
标准栅格函数,例如
叠加、聚合和合并。

“raster”包的用途是
提供易于使用的功能
栅格操作和分析。
这些包括高级功能
例如叠加、合并、聚合、
投影、重采样、距离、
多边形到栅格的转换。全部
这些函数适用于非常大的
无法加载的栅格数据集
进入记忆。此外,该包
提供较低级别的功能,例如
逐行读取和写入(至
许多格式(通过 rgdal)用于构建
其他功能。

通过使用 Raster 包,您可以避免在使用 randomForest 之前填充内存。

[编辑] 为了解决随机森林的内存问题,如果您可以通过子样本(大小<

Check out the Raster package. The Raster package provides a handy wrapper for Rgdal without loading it into memory.

http://raster.r-forge.r-project.org/

Hopefully this help.

The 'raster' package deals with basic
spatial raster (grid) data access and
manipulation. It defines raster
classes; can deal with very large
files (stored on disk); and includes
standard raster functions such as
overlay, aggregation, and merge.

The purpose of the 'raster' package is
to provide easy to use functions for
raster manipulation and analysis.
These include high level functions
such as overlay, merge, aggregate,
projection, resample, distance,
polygon to raster conversion. All
these functions work for very large
raster datasets that cannot be loaded
into memory. In addition, the package
provides lower level functions such as
row by row reading and writing (to
many formats via rgdal) for building
other functions.

By using the Raster package you can avoid filling your memory before using randomForest.

[EDIT] To solve the memory problem with randomForest maybe it helps if you could learn the individual trees within the random forest on subsamples (of size << n) rather than bootstrap samples (of size n).

柏林苍穹下 2024-10-09 04:34:55

我认为这里的关键是:“包含训练数据通道中不为零的每个像素值的数据帧”。如果生成的 data.frame 小到足以保存在内存中,您可以通过仅读取该带区来确定这一点,然后修剪为仅那些非零值,然后尝试创建一个包含那么多行和总行数的 data.frame你想要的专栏。

你能运行这个吗?

 training_band = 2 

 df = readGDAL("image.pix",  band = training_band) 
 df = as.data.frame(df[!df[,1] == 0, ])

然后,您可以通过单独读取每个带并根据训练带进行修剪来逐一填充 data.frame 的列。

如果该 data.frame 太大,那么您就会陷入困境 - 我不知道 randomForest 是否可以使用“ff”中的内存映射数据对象,但可能值得尝试。

编辑:一些示例代码,请注意,栅格为您提供内存映射访问,但问题是 randomForest 是否可以使用内存映射数据结构。您可以只读入您需要的数据,一次一个波段 - 您可能希望首先尝试构建完整的 data.frame,而不是附加列。

另外,如果您可以从一开始就生成完整的 data.frame,那么您就会知道它是否应该工作。通过在代码中执行 rbind() 操作,您确实需要越来越大的连续内存块,而这可能是可以避免的。

I think the key here is this: " a data frame containing the values of each pixel which is not zero in the training data channel". If the resulting data.frame is small enough to hold in memory you could determine this by reading just that band, then trimming to only those non-zero values, then try to create a data.frame with that many rows and the total number of colums you want.

Can you run this?

 training_band = 2 

 df = readGDAL("image.pix",  band = training_band) 
 df = as.data.frame(df[!df[,1] == 0, ])

Then you could populate the data.frame's columns one by one by reading each band separately and trimming as for the training band.

If that data.frame is too big then you're stuck - I don't know if randomForest can use the memory-mapped data objects in "ff", but it might be worth trying.

EDIT: some example code, and note that raster gives you memory-mapped access but the problem is whether randomForest can use memory mapped data structures. You can possibly read in only the data you need, one band at a time - you would want to try to build the full data.frame first, rather than append columns.

Also, if you can generate the full data.frame from the start then you'll know if it should work. By rbind()ing your way through as your code does you need increasingly larger chunks of contiguous memory, and that might be avoidable.

黎夕旧梦 2024-10-09 04:34:55

这不是您正在做的事情的精确解决方案,但我只是将其发布在这里供可能看到此内容的其他人使用,这就是我管理非常大的栅格内存的方式。

if (!require(rgdal)) install.packages('rgdal')
library(rgdal)
if (!require(raster)) install.packages('raster')
library(raster)
if (!require(doParallel)) { install.packages('doParallel'); library(doParallel); }
predictions_raster <- NULL

input_files = c()
for (climate_var in climate_vars) {
  input_files = c(input_files, paste0(climate_path,'/',folders[climate_index],'/',file_prefixes[climate_index],climate_var,file_postfixes[climate_index]))
}

out_img_path = paste0(output_climate_path,'/',folders[climate_index])#,'.tif')

#Combine all input files into one string vector
input_files = c(input_files, aspect_x_w, aspect_y_w, elev_west, posidex_west, slope_west,
              awc_0_30, awc_31_200, clay_0_30, clay_31_200, om_0_30, om_31_200, sand_0_30, sand_31_200, silt_0_30, silt_31_200)

print(input_files)

rasters <- list()
vals <- list()
load_rasters <- function(input_files, row, nrows, col, ncols) {
  for (i in 1:length(input_files)) {
    filename <- input_files[i]
    cat(paste0("Trying to open: ", filename, "\n"))
    raster_new <- raster::raster(filename)
    rasters[[i]] <<- raster_new
  }
}

for (file in input_files) {
  if (!file.exists(file)) {
    stop(paste0("File does not exist: ", file))
  }
}

load_rasters(input_files, 1, 1, 1, 1)

gam_parallel_predict <- function(gam, ras, output_img) {
  predictions_raster <<- NULL

  applyKernel <- function(newX, gam, cluster, chunksize = 10000, ...) {
    mgcv::predict.bam(gam, newdata = newX, cluster = cluster)
  }

  last_i <- 1
  max_cores <- parallel::detectCores()
  max_row <- ras@nrows
  max_chunk_size <- 100000

  predictions_parallel <- c()

  cores <- max(1, max_cores)

  cl <- parallel::makeCluster(cores)
  registerDoParallel(cl)
  start.time <- Sys.time()
  output_raster <- output_img
  predictions_raster <<- raster::writeStart(x = raster(ras), filename = output_raster, format = "GTiff", datatype = "INT1U", overwrite = T)
  tr <- blockSize(predictions_raster)
  for (i in 1:tr$n) {
    vals <- as.data.frame(raster::getValuesBlock(x = ras, row = tr$row[i], nrows = tr$nrows[i]))
    names(vals) <- dt_col_names

    predictions_parallel <- applyKernel(vals, gam = gam, cluster = cl, chunksize = min(floor((ras@ncols*ras@nrows)/cores), max_chunk_size))
    predictions_raster <<- raster::writeValues(predictions_raster, predictions_parallel, start = tr$row[i])
    cat(paste0("predict on value: ", i, "/", tr$n, " (", (100*i/tr$n), "%)", " dim=", length(predictions_parallel), "\n"))
  }

  predictions_raster <<- raster::writeStop(predictions_raster)
  end.time <- Sys.time()
  stopCluster(cl)
  cat(paste0("predict last dim=", length(predictions_parallel), "\n"))
  (time.taken <- end.time - start.time)
  cat(paste0("time-taken: ",time.taken,"\n"))

  return(predictions_raster)
}

ras <- raster::stack(rasters)
gam_parallel_predict(gam.bare, ras, out_img_path)

Not an exact solution to what you are doing but I'm just posting this here for others that may see this, this is how I manage memory for very large rasters.

if (!require(rgdal)) install.packages('rgdal')
library(rgdal)
if (!require(raster)) install.packages('raster')
library(raster)
if (!require(doParallel)) { install.packages('doParallel'); library(doParallel); }
predictions_raster <- NULL

input_files = c()
for (climate_var in climate_vars) {
  input_files = c(input_files, paste0(climate_path,'/',folders[climate_index],'/',file_prefixes[climate_index],climate_var,file_postfixes[climate_index]))
}

out_img_path = paste0(output_climate_path,'/',folders[climate_index])#,'.tif')

#Combine all input files into one string vector
input_files = c(input_files, aspect_x_w, aspect_y_w, elev_west, posidex_west, slope_west,
              awc_0_30, awc_31_200, clay_0_30, clay_31_200, om_0_30, om_31_200, sand_0_30, sand_31_200, silt_0_30, silt_31_200)

print(input_files)

rasters <- list()
vals <- list()
load_rasters <- function(input_files, row, nrows, col, ncols) {
  for (i in 1:length(input_files)) {
    filename <- input_files[i]
    cat(paste0("Trying to open: ", filename, "\n"))
    raster_new <- raster::raster(filename)
    rasters[[i]] <<- raster_new
  }
}

for (file in input_files) {
  if (!file.exists(file)) {
    stop(paste0("File does not exist: ", file))
  }
}

load_rasters(input_files, 1, 1, 1, 1)

gam_parallel_predict <- function(gam, ras, output_img) {
  predictions_raster <<- NULL

  applyKernel <- function(newX, gam, cluster, chunksize = 10000, ...) {
    mgcv::predict.bam(gam, newdata = newX, cluster = cluster)
  }

  last_i <- 1
  max_cores <- parallel::detectCores()
  max_row <- ras@nrows
  max_chunk_size <- 100000

  predictions_parallel <- c()

  cores <- max(1, max_cores)

  cl <- parallel::makeCluster(cores)
  registerDoParallel(cl)
  start.time <- Sys.time()
  output_raster <- output_img
  predictions_raster <<- raster::writeStart(x = raster(ras), filename = output_raster, format = "GTiff", datatype = "INT1U", overwrite = T)
  tr <- blockSize(predictions_raster)
  for (i in 1:tr$n) {
    vals <- as.data.frame(raster::getValuesBlock(x = ras, row = tr$row[i], nrows = tr$nrows[i]))
    names(vals) <- dt_col_names

    predictions_parallel <- applyKernel(vals, gam = gam, cluster = cl, chunksize = min(floor((ras@ncols*ras@nrows)/cores), max_chunk_size))
    predictions_raster <<- raster::writeValues(predictions_raster, predictions_parallel, start = tr$row[i])
    cat(paste0("predict on value: ", i, "/", tr$n, " (", (100*i/tr$n), "%)", " dim=", length(predictions_parallel), "\n"))
  }

  predictions_raster <<- raster::writeStop(predictions_raster)
  end.time <- Sys.time()
  stopCluster(cl)
  cat(paste0("predict last dim=", length(predictions_parallel), "\n"))
  (time.taken <- end.time - start.time)
  cat(paste0("time-taken: ",time.taken,"\n"))

  return(predictions_raster)
}

ras <- raster::stack(rasters)
gam_parallel_predict(gam.bare, ras, out_img_path)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文