是什么导致了我的随机:“joblib.externals.loky.process_executor.TermulatedWorkerError”错误?
我正在制作基于GIS的数据分析,在该分析中,我计算了全国范围的广泛预测图(例如天气地图等)。因为我的目标区域非常大(整个国家),所以我使用的是超级计算机(SLURM)并并行化来计算预测图。也就是说,我将预测映射分为多个部分,每个部分都在其自身的过程(令人尴尬的并行过程)中计算出来,在每个过程中,使用多个CPU核心来计算该件(进一步分为较小的片段对于CPU内核)。
我使用Python的Joblib-library利用了我可以使用的多个内核,而且大多数情况下一切都可以顺利进行。但是有时候,大约有1.5%的时间随机,我会收到以下错误:
Traceback (most recent call last):
File "main.py", line 557, in <module>
sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGBUS(-7)}
是什么原因导致了这个问题,有什么想法?以及如何确保这不会发生?这很令人讨厌,因为例如,如果我有200个地图片,并且有197个成功,并且有3个错误,那么我需要再次计算这3件。
I'm making GIS-based data-analysis, where I calculate wide area nation wide prediction maps (e.g. weather maps etc.). Because my target area is very big (whole country) I am using supercomputers (Slurm) and parallelization to calculate the prediction maps. That is, I split the prediction map into multiple pieces with each piece being calculated in its own process (embarrassingly parallel processes), and within each process, multiple CPU cores are used to calculate that piece (the map piece is further split into smaller pieces for the CPU cores).
I used Python's joblib-library for taking advantage of the multiple cores at my disposal and most of the time everything works smoothly. But sometimes, randomly with about 1.5% of the time, I get the following error:
Traceback (most recent call last):
File "main.py", line 557, in <module>
sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGBUS(-7)}
What causes this problem, any ideas? And how to make sure this does not happen? This is irritating because, for example, if I have 200 map pieces being calculated and 197 succeed and 3 have this error, then I need to calculate these 3 pieces again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
a:
a)
python解释器过程(即使在超级计算机上运行)都生活在实际的Local-Host Rammory中。
b)
给定的(a),此类局部主机CPU-cores的数量控制
joblib.parallel()
行为。c) c) set
n_jobs = -1
以及pre_dispatch ='2*n_jobs'
使这样的python解释器开始请求许多loky
loky backend -backend -backend backend特定的独立流程实例,作为此类局部主机数量的CPU核数量的明确倍数(可能来自4、8、16,...,...,80,... 8192-是的,取决于实际的“超级计算机”硬件 / SDS组成)< br> d)给定(c),每个新的python解释器过程(在8、16、32,...,...,160,... 16384之间的任何地方要求启动的流程)请求与Local-Host O/S内存管理器
e)
给定(d)此类累积的RAM分配(每个Python流程可能会要求在30 MB-3000 MB的RAM之间进行任何内容,具体取决于实际
Joblib
backend所使用的后back和__ MAIM __ MAIM __
的内存(内部状态的丰富度) - (((joblib.parallel()
-launching)-Python解释器)很容易并且很快就会在物理-RAM上生长,在物理范围内,交换开始通过在物理RAM和磁盘存储之间交换RAM内容的块来模拟缺失的能力 - 与没有被迫换成缺失的物理射线资源的虚拟记忆能力f)
f)
,这与没有被迫交换的虚拟记忆能力
相比,它的成本约为10,000x-100,000 x。 (e)“超级计算”管理通常禁止通过行政工具过度分配并杀死所有流程,这些过程试图超出某些公平使用的阈值或用户封装的配额
e)
给定(e)和WRT记录的痕迹:
上述归纳的证据链被证实是(要么),要么是segfault(在python的解释器领域中不可能)或故意杀害,因此由于“超级计算机”公平使用政策违反了(s),由于内存过多而在这里。
对于
sigbus(-7)
,您可以防御性地避免使用光泽的冲洗和修改有关mmap
usage的详细信息,可能会尝试读取“ 以外的eof ” ,如果适用:咨询适用于您的“超级计算机”技术支持部的公平用法策略,以获取有效的天花板详细信息。
接下来,重构您的代码不
pre_dispatch
,如果仍然想使用单个节点过程复制的策略,而不是其他较少的,不太震撼的HPC计算策略。A :
a)
Python Interpreter process ( even if run on supercomputers ) is living in an actual localhost RAM-memory.
b)
given (a), the number of such localhost CPU-cores controls the
joblib.Parallel()
behaviour.c)
given (b) and having set
n_jobs = -1
and alsopre_dispatch = '2*n_jobs'
makes such Python Interpreter start requesting that manyloky
-backend specific separate processes instantiations, as an explicit multiple of such localhost number of CPU-cores ( could be anywhere from 4, 8, 16, ..., 80, ... 8192 - yes, depends on the actual "supercomputer" hardware / SDS composition )d)
given (c), each such new Python Interpreter process ( being there anywhere between 8, 16, 32, ..., 160, ... 16384 such new Python Interpreter processes demanded to be launched ) requests a new, separate RAM-allocation from the localhost O/S memory manager
e)
given (d) such accumulating RAM-allocations ( each Python Process may ask for anything between 30 MB - 3000 MB of RAM, depending on the actual
joblib
-backend used and the memory (richness of the internal state) of the__main__
-(joblib.Parallel()
-launching )-Python Interpreter ) may easily and soon grow over physical-RAM, where swap starts to emulate the missing capacities by exchanging blocks of RAM content between physical RAM and disk storage - that at costs about 10,000x - 100,000x higher latencies, than if it were not forced into such a swapping virtual-memory capacities emulation of the missing physical-RAM resourcesf)
given (e) "supercomputing" administration often prohibits over-allocations by administrative tools and kills all processes, that tried to oversubscribe RAM-resources beyond some fair-use threshold or user-profiled quota
e)
given (e) and w.r.t. the documented trace:
the above inducted chain of evidence was confirmed to be (either) a SegFAULT (not being probable in Python Interpreter realms) or deliberate KILL, due to "supercomputer" Fair Usage Policy violation(s), here due to excessive memory usage.
For
SIGBUS(-7)
you may defensively try avoid Lustre flushing and revise details aboutmmap
-usage, potentially trying to read "beyond EoF", if applicable:Consult Fair Usage Policies applicable with your "supercomputer" Technical Support Dept. so as to get valid ceiling details.
Next refactor your code not to
pre_dispatch
that many processes, if still would like to use the strategy of single-node process-replication, instead of other, less RAM-blocking, more efficient HPC computing strategy.错误的原因之一(“工作线程的退出代码是 {SIGBUS(-7)}”)是
joblib
只能执行一级并行化:以下情况:一个启动了多个并行工作程序,并且每个工作程序启动另一个多个并行工作程序被禁止(导致 SIGBUS 崩溃)。One reason of the error (“The exit codes of the workers are {SIGBUS(-7)}”) is that
joblib
can do only one level of parallelization: situations when one starts several parallel workers and each of them starts another several parallel workers are forbidden (lead to SIGBUS crashes).