是什么导致了我的随机：“joblib.externals.loky.process_executor.TermulatedWorkerError”错误？

发布于 2025-01-19 09:27:31 字数 1651 浏览 6 评论 0原文

我正在制作基于GIS的数据分析，在该分析中，我计算了全国范围的广泛预测图（例如天气地图等）。因为我的目标区域非常大（整个国家），所以我使用的是超级计算机（SLURM）并并行化来计算预测图。也就是说，我将预测映射分为多个部分，每个部分都在其自身的过程（令人尴尬的并行过程）中计算出来，在每个过程中，使用多个CPU核心来计算该件（进一步分为较小的片段对于CPU内核）。

我使用Python的Joblib-library利用了我可以使用的多个内核，而且大多数情况下一切都可以顺利进行。但是有时候，大约有1.5％的时间随机，我会收到以下错误：

Traceback (most recent call last):
  File "main.py", line 557, in <module>
    sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGBUS(-7)}

是什么原因导致了这个问题，有什么想法？以及如何确保这不会发生？这很令人讨厌，因为例如，如果我有200个地图片，并且有197个成功，并且有3个错误，那么我需要再次计算这3件。

原文

I'm making GIS-based data-analysis, where I calculate wide area nation wide prediction maps (e.g. weather maps etc.). Because my target area is very big (whole country) I am using supercomputers (Slurm) and parallelization to calculate the prediction maps. That is, I split the prediction map into multiple pieces with each piece being calculated in its own process (embarrassingly parallel processes), and within each process, multiple CPU cores are used to calculate that piece (the map piece is further split into smaller pieces for the CPU cores).

I used Python's joblib-library for taking advantage of the multiple cores at my disposal and most of the time everything works smoothly. But sometimes, randomly with about 1.5% of the time, I get the following error:

Traceback (most recent call last):
  File "main.py", line 557, in <module>
    sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGBUS(-7)}

What causes this problem, any ideas? And how to make sure this does not happen? This is irritating because, for example, if I have 200 map pieces being calculated and 197 succeed and 3 have this error, then I need to calculate these 3 pieces again.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枕花眠 2025-01-26 09:27:31

Q：
_{“是什么原因导致了这个问题，任何想法？}

a：
a）
python解释器过程（即使在超级计算机上运行）都生活在实际的Local-Host Rammory中。

b）

给定的（a），此类局部主机CPU-cores的数量控制joblib.parallel（）行为。
c） c） set n_jobs = -1以及pre_dispatch ='2*n_jobs'使这样的python解释器开始请求许多loky loky backend -backend -backend backend特定的独立流程实例，作为此类局部主机数量的CPU核数量的明确倍数（可能来自4、8、16，...，...，80，... 8192-是的，取决于实际的“超级计算机”硬件 / SDS组成）< br> d）
给定（c），每个新的python解释器过程（在8、16、32，...，...，160，... 16384之间的任何地方要求启动的流程）请求与Local-Host O/S内存管理器
e）
给定（d）此类累积的RAM分配（每个Python流程可能会要求在30 MB-3000 MB的RAM之间进行任何内容，具体取决于实际Joblib backend所使用的后back和__ MAIM __ MAIM __的内存（内部状态的丰富度） - （（（ joblib.parallel（） -launching）-Python解释器）很容易并且很快就会在物理-RAM上生长，在物理范围内，交换开始通过在物理RAM和磁盘存储之间交换RAM内容的块来模拟缺失的能力 - 与没有被迫换成缺失的物理射线资源的虚拟记忆能力

f）
f）

，这与没有被迫交换的虚拟记忆能力
相比，它的成本约为10,000x-100,000 x。（e）“超级计算”管理通常禁止通过行政工具过度分配并杀死所有流程，这些过程试图超出某些公平使用的阈值或用户封装的配额

e）
给定（e）和WRT记录的痕迹：

...
joblib.externals.loky.process_executor.TerminatedWorkerError:
       A worker process managed by the executor
       was unexpectedly terminated. This could be caused
           by a segmentation fault while calling the function
         or
           by an excessive memory usage
       causing the Operating System to kill the worker.

上述归纳的证据链被证实是（要么），要么是segfault（在python的解释器领域中不可能）或故意杀害，因此由于“超级计算机”公平使用政策违反了（s），由于内存过多而在这里。

对于sigbus（-7），您可以防御性地避免使用光泽的冲洗和修改有关mmap usage的详细信息，可能会尝试读取“ 以外的eof ” ，如果适用：

_{默认情况下，Slurm在完成每个作业步骤后刷新光泽文件系统和内核缓存。如果多个应用程序同时在计算节点（来自单个slurm作业的多个应用程序或多个作业）上运行，则结果可能是性能降低，甚至是总线错误。当在单个计算节点上同时执行更多应用程序时，故障会更频繁。当使用光泽文件系统时，故障也更为常见。}
存在两种解决此问题的方法。一种是禁用缓存的冲洗，可以通过添加“ abenthParameters = lustre_no_flush”将其完成/sub>

咨询适用于您的“超级计算机”技术支持部的公平用法策略，以获取有效的天花板详细信息。

接下来，重构您的代码不pre_dispatch，如果仍然想使用单个节点过程复制的策略，而不是其他较少的，不太震撼的HPC计算策略。

Q :
_{" What causes this problem, any ideas? - I am using supercomputers "}

A :
a)
Python Interpreter process ( even if run on supercomputers ) is living in an actual localhost RAM-memory.
b)
given (a), the number of such localhost CPU-cores controls the joblib.Parallel() behaviour.
c)
given (b) and having set n_jobs = -1 and also pre_dispatch = '2*n_jobs' makes such Python Interpreter start requesting that many loky-backend specific separate processes instantiations, as an explicit multiple of such localhost number of CPU-cores ( could be anywhere from 4, 8, 16, ..., 80, ... 8192 - yes, depends on the actual "supercomputer" hardware / SDS composition )
d)
given (c), each such new Python Interpreter process ( being there anywhere between 8, 16, 32, ..., 160, ... 16384 such new Python Interpreter processes demanded to be launched ) requests a new, separate RAM-allocation from the localhost O/S memory manager
e)
given (d) such accumulating RAM-allocations ( each Python Process may ask for anything between 30 MB - 3000 MB of RAM, depending on the actual joblib-backend used and the memory (richness of the internal state) of the __main__-( joblib.Parallel()-launching )-Python Interpreter ) may easily and soon grow over physical-RAM, where swap starts to emulate the missing capacities by exchanging blocks of RAM content between physical RAM and disk storage - that at costs about 10,000x - 100,000x higher latencies, than if it were not forced into such a swapping virtual-memory capacities emulation of the missing physical-RAM resources
f)
given (e) "supercomputing" administration often prohibits over-allocations by administrative tools and kills all processes, that tried to oversubscribe RAM-resources beyond some fair-use threshold or user-profiled quota
e)
given (e) and w.r.t. the documented trace:

...
joblib.externals.loky.process_executor.TerminatedWorkerError:
       A worker process managed by the executor
       was unexpectedly terminated. This could be caused
           by a segmentation fault while calling the function
         or
           by an excessive memory usage
       causing the Operating System to kill the worker.

the above inducted chain of evidence was confirmed to be (either) a SegFAULT (not being probable in Python Interpreter realms) or deliberate KILL, due to "supercomputer" Fair Usage Policy violation(s), here due to excessive memory usage.

For SIGBUS(-7) you may defensively try avoid Lustre flushing and revise details about mmap-usage, potentially trying to read "beyond EoF", if applicable:

_{By default, Slurm flushes Lustre file system and kernel caches upon completion of each job step. If multiple applications are run simultaneously on compute nodes (either multiple applications from a single Slurm job or multiple jobs) the result can be significant performance degradation and even bus errors. Failures occur more frequently when more applications are executed at the same time on individual compute nodes. Failures are also more common when Lustre file systems are used.}
Two approaches exist to address this issue. One is to disable the flushing of caches, which can be accomplished by adding "LaunchParameters=lustre_no_flush" to your Slurm configuration file "slurm.conf".

Consult Fair Usage Policies applicable with your "supercomputer" Technical Support Dept. so as to get valid ceiling details.

Next refactor your code not to pre_dispatch that many processes, if still would like to use the strategy of single-node process-replication, instead of other, less RAM-blocking, more efficient HPC computing strategy.

回复收藏 0 原文