Slurm Job Array无法使用ShapeFiles运行RScript
我想通过slurm在HPC群集上运行一个作业阵列,将单个圆形的型号与大量的人口普查块相交,然后保存所得的相交Shapefile。然后,我将这些单独的Shapefiles在我自己的机器上将其组合成一个大型文件。这是避免我在一个较早的问题中描述的并行化问题的一种方法: mapply错误在r
但是,在运行作业数组时,我会收到以下错误:
sbatch:错误:批处理作业提交失败:无效的作业数组规范
这是指向R脚本,.sh文件和文件名CSV的链接,我在HPC群集上使用: https://github.com/msghankinson/slurm_job_array 。
R代码依赖于3个文件:
- “缓冲区” - 这些是圆形多边形。我已经将3,086个圆圈的大型文件分为3,086个单独的shapefiles,每个Shapefiles每个1个圆圈(保存在 / luster / in in lihtc_bites”文件夹中。 R脚本的目标是与脚本每次运行中的普查块相交1个圆圈,然后将该相交作为shapefile保存。然后,我将这3,086个交叉路口的ShapeFiles组合到我自己的笔记本电脑上的一个数据框架中。对于Reprex,我仅包括3,086个Shapefiles中的2个。
- “ lihtc” - 这是我用作R函数中索引的shapefile。此Shapefile有3个版本。每个圆形的file匹配这些“ lihtc” shapefiles之一。对于preprex,我只包括一个与我的2个圆形形状纤维匹配的shapefile。
- “块” - 这些是710,000人口普查块。对于R脚本的每次运行,该文件保持不变,无论在交叉路口中使用哪个圆圈。对于Reprex,我仅包括旧金山县7,386个街区的形状。
我已经在特定的,单独的缓冲区和lihtc shapefiles上运行R代码,并且该功能有效。因此,我的主要重点是.sh文件启动作业数组(“ lihtc_array_example.sh”)。在这里,我试图使用任务ID和“ master_example.csv”(也在reprex中)在每个“缓冲区” shapefile上运行我的R脚本,以定义哪些文件加载到R中。我需要的buffer文件名和lihtc文件名。这些文件名需要传递给R脚本,并用于为每个交叉路口加载正确的文件。例如,任务1加载Master_example.csv第1行中列出的文件。我发现的代码试图通过.sh文件中的这些名称通过:
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
虽然我知道很难运行reprex,但我想知道.sh文件之间的管道中是否有明确的崩溃,CSV名称和R脚本?我很高兴提供任何可能有帮助的其他信息。
完整的.SH文件,易于访问:
#SBATCH -t 2:00:00
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -o jobArrayScript_%A_%a.out
#SBATCH -e jobArrayScript_%A_%a.err
#SBATCH -a 1-3086%1000
line_N=$( awk "NR==$SLURM_ARRAY_TASK_ID" master_example.csv ) # NR means row-# in Awk
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
module load R/4.1.1
module load libudunits2/2.2.28
module load gdal/3.5.0
module load proj/6.3.0
module load geos/3.10.3
Rscript slurm_job_array.R $shp_filename $lihtc_filename
供参考:
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.9 ggmap_3.0.0 ggplot2_3.3.6 sf_1.0-7
loaded via a namespace (and not attached):
[1] xfun_0.28 tidyselect_1.1.2 purrr_0.3.4 lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2
[8] htmltools_0.5.2 s2_1.0.7 utf8_1.2.2 rlang_1.0.2 e1071_1.7-9 pillar_1.7.0 glue_1.6.2
[15] withr_2.5.0 DBI_1.1.1 sp_1.4-6 wk_0.5.0 jpeg_0.1-9 lifecycle_1.0.1 plyr_1.8.7
[22] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 RgoogleMaps_1.4.5.3 evaluate_0.15 knitr_1.36 fastmap_1.1.0
[29] curl_4.3.2 class_7.3-19 fansi_1.0.3 highr_0.9 Rcpp_1.0.8.3 KernSmooth_2.23-20 scales_1.2.0
[36] classInt_0.4-3 farver_2.1.0 rjson_0.2.20 png_0.1-7 digest_0.6.29 stringi_1.7.6 grid_4.1.0
[43] cli_3.3.0 tools_4.1.0 bitops_1.0-7 magrittr_2.0.3 proxy_0.4-26 tibble_3.1.7 crayon_1.5.1
[50] tidyr_1.2.0 pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2 rstudioapi_0.13
[57] R6_2.5.1 units_0.7-2 compiler_4.1.0
I would like to run a job array via Slurm on an HPC cluster, intersecting individual circle shapefiles with a large shapefile of Census blocks, then saving the resulting intersection shapefile. I will then combine these individual shapefiles into one large one on my own machine. This is a way to avoid the parallelization problems I describe in an earlier question: mapply error on list from sf (simple features) object in R
However, when running the job array, I receive the following error:
sbatch: error: Batch job submission failed: Invalid job array specification
Here is a link to the R script, .sh file, and filename csv I am using on my HPC cluster: https://github.com/msghankinson/slurm_job_array.
The R code relies on 3 files:
- "buffer" - these are the circle polygons. I've split a large shapefile of 3,086 circles into 3,086 individual shapefiles of 1 circle each (saved in /lustre/ in the "lihtc_bites" folder). The goal for the R script is to intersect 1 circle with the Census blocks in each run of the script, then save that intersection as a shapefile. I will then combine these 3,086 intersection shapefiles into one dataframe on my own laptop. For the reprex, I only include 2 of the 3,086 shapefiles.
- "lihtc" - this is a shapefile that I use as an index in my R function. There are 3 versions of this shapefile. Each circle shapefile matches one of these "lihtc" shapefiles. For the reprex, I only include the one shapefile which matches my 2 circle shapefiles.
- "blocks" - these are the 710,000 Census blocks. This file remains the same for each run of the R script, regardless of which circle is being used in the intersection. For the reprex, I only include a shapefile of the 7,386 blocks in San Francisco County.
I've run the R code on specific, individual buffer and lihtc shapefiles and the function works. So my main focus is the .sh file launching the job array ("lihtc_array_example.sh"). Here, I am trying to run my R script on each "buffer" shapefile using the task ID and the "master_example.csv" (also in the reprex) to define which files are loaded into R. Each row of master_example.csv contains the buffer filename and the lihtc filename I need. These filenames need to be passed to the R script and used to load the correct files for each intersection. E.g., Task 1 loads files listed on in row 1 of master_example.csv. The code I found tries to pull these names in the .sh file via:
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
While I understand that it is difficult to run the reprex, I would like to know if there are any clear breakdowns in the pipeline between the .sh file, the csv of names, and the R script? I am happy to provide any additional information which may be helpful.
Full .sh file, for ease of access:
#SBATCH -t 2:00:00
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -o jobArrayScript_%A_%a.out
#SBATCH -e jobArrayScript_%A_%a.err
#SBATCH -a 1-3086%1000
line_N=$( awk "NR==$SLURM_ARRAY_TASK_ID" master_example.csv ) # NR means row-# in Awk
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )
module load R/4.1.1
module load libudunits2/2.2.28
module load gdal/3.5.0
module load proj/6.3.0
module load geos/3.10.3
Rscript slurm_job_array.R $shp_filename $lihtc_filename
For reference:
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.9 ggmap_3.0.0 ggplot2_3.3.6 sf_1.0-7
loaded via a namespace (and not attached):
[1] xfun_0.28 tidyselect_1.1.2 purrr_0.3.4 lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2
[8] htmltools_0.5.2 s2_1.0.7 utf8_1.2.2 rlang_1.0.2 e1071_1.7-9 pillar_1.7.0 glue_1.6.2
[15] withr_2.5.0 DBI_1.1.1 sp_1.4-6 wk_0.5.0 jpeg_0.1-9 lifecycle_1.0.1 plyr_1.8.7
[22] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 RgoogleMaps_1.4.5.3 evaluate_0.15 knitr_1.36 fastmap_1.1.0
[29] curl_4.3.2 class_7.3-19 fansi_1.0.3 highr_0.9 Rcpp_1.0.8.3 KernSmooth_2.23-20 scales_1.2.0
[36] classInt_0.4-3 farver_2.1.0 rjson_0.2.20 png_0.1-7 digest_0.6.29 stringi_1.7.6 grid_4.1.0
[43] cli_3.3.0 tools_4.1.0 bitops_1.0-7 magrittr_2.0.3 proxy_0.4-26 tibble_3.1.7 crayon_1.5.1
[50] tidyr_1.2.0 pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2 rstudioapi_0.13
[57] R6_2.5.1 units_0.7-2 compiler_4.1.0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
3个问题已确定并已解决:
最大数组大小是指整个数组。油门只是设定了一次安排多少工作。因此,我需要将3,086个工作任务分成4个单独的批次。这可以在.sh文件中完成为:
#sbatch -a 1-999
for Job 1#sbatch -a 1000-1999
作业2等。R脚本需要从命令行捕获参数。脚本现在开始:
args = commandArgs(trielingonly = true)shp_filename< - args [1] lihtc_filename< - args [2]
提交文件是带着引号发送的,防止
paste0
创建可用的文件名。noquote()
noquote> noquote(x,quotes = f)都无法删除这些报价。但是gsub('“','',x)
工作。我的一部分/懒惰并行化,但它有效。案例已关闭。
3 problems identified and now solved:
Max array size refers to the entire array. The throttle just sets how many jobs get scheduled at one time. So I needed to break my 3,086 job task into 4 separate batches. This can be done in the .sh file as:
#SBATCH -a 1-999
for job 1#SBATCH -a 1000-1999
for job 2, and so on.The R script needs to catch the arguments from the command line. The script now begins:
args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
The submission file was sending arguments with quotations, which was preventing
paste0
from creating usable file names. Neithernoquote()
norprint(x, quotes = F)
was able to remove these quotes. Howevergsub('"', '', x)
worked.An inelegant/lazy parallelization on my part, but it works. Case closed.