Python: _pickle.PicklingError: 无法 pickle >
我正在运行 Python 3.9.1
注意:我知道有类似标题的问题。但这些问题嵌入在复杂的代码中,导致很难理解问题。这是问题的简单实现,我认为其他人会发现更容易理解。
编辑:我的代码中有 Pool(processes=64)
。但大多数其他人可能必须根据计算机上有多少个核心来更改此设置。如果花费的时间太长,请将 listLen
更改为较小的数字。
我正在尝试了解多处理以解决工作中的问题。我有一个数组列表,需要对数组进行成对比较。但为了简单起见,我使用简单的整数而不是数组和加法函数而不是调用一些复杂的比较函数来重新创建问题的要点。使用下面的代码,我遇到了标题错误
import time
from multiprocessing import Pool
import itertools
import random
def add_nums(a, b):
return(a + b)
if __name__ == "__main__":
listLen = 1000
# Create a list of random numbers to do pairwise additions of
myList = [random.choice(range(1000)) for i in range(listLen)]
# Create a list of all pairwise combinations of the indices of the list
index_combns = [*itertools.combinations(range(len(myList)),2)]
# Do the pairwise operation without multiprocessing
start_time = time.time()
sums_no_mp = [*map(lambda x: add_nums(myList[x[0]], myList[x[1]]), index_combns)]
end_time = time.time() - start_time
print(f"Process took {end_time} seconds with no MP")
# Do the pairwise operations with multiprocessing
start_time = time.time()
pool = Pool(processes=64)
sums_mp = pool.map(lambda x: add_nums(myList[x[0]], myList[x[1]]), index_combns)
end_time = time.time() - start_time
print(f"Process took {end_time} seconds with MP")
pool.close()
pool.join()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不太确定为什么(尽管彻底阅读了 multiprocessing docs 可能会有答案),但是 python 的多处理涉及一个 pickling 过程,其中子进程会传递某些东西。虽然我预计 lambda 会被继承,而不是通过 pickle-ing 传递,但我想事实并非如此。
按照评论中的讨论,考虑类似这种方法:
它使用 multiprocessing.shared_memory 在主机进程和子进程之间共享单个 numpy (N+1) 维数组(而不是 N 维数组的列表)。
其他不同但无关紧要的事情:
Pool
用作上下文管理器,以防止必须显式关闭并加入它。Timer
是一个简单的上下文管理器,用于对代码块进行计时。pool.map
替换为对pool.apply_async
的调用pool.map
也可以,但您' d 希望在 .map 调用之前构建参数列表并将其解压到工作函数中,例如:I'm not exactly sure why (though a thorough read through the multiprocessing docs would probably have an answer), but there's a pickling process involved in python's multiprocessing where child processes are passed certain things. While I would have expected the lambdas to be inherited and not passed via pickle-ing, I guess that's not what's happening.
Following the discussion in the comments, consider something like this approach:
It uses multiprocessing.shared_memory to share a single numpy (N+1)-dimensional array (instead of a list of N-dimensional arrays) between the host process and child processes.
Other things that are different but don't matter:
Pool
is used as a context manager to prevent having to explicitly close and join it.Timer
is a simply context manager to time blocks of code.pool.map
replaced with calls topool.apply_async
pool.map
would be fine too, but you'd want to build the argument list before the.map
call and unpack it in the worker function, e.g.:Python 无法pickle lambda 函数。相反,您应该定义函数并传递函数名称。以下是解决此问题的方法:
我修改了
index_combns
以也从myList
中提取值,因为myList
无法从 < code>foo 并传入myList
的多个副本将增加脚本的空间复杂度。运行此打印:
Python cannot pickle lambda functions. Instead you should define the function and pass the function name instead. Here is how you may approach this:
I modified
index_combns
to also extract the value frommyList
in place, becausemyList
will not be accessible fromfoo
and passing in multiple copies ofmyList
will increase space complexity of your script.Running this prints:
答:
需要学习的最重要的经验
是(流程)-实例的成本有多大,
所有其他附加间接费用
(仍然绝不是不可忽视的,随着问题规模的扩大,更多)
与这个巨大的&相比,细节是更多的。校长一.
在通读并完全理解答案之前,这里有一个实时 GUI 交互式模拟器我们需要支付多少费用才能开始使用超过 1 个流程编排流(成本各不相同 - 线程较低,MPI 较高基于分布式操作,最高为
multiprocessing
-进程,如 Python 解释器中使用的那样,其中首先复制主 Python 解释器进程的 N 个副本(分配 RAM 并生成 O/S 调度程序 - 2022 年第二季度仍然报告问题,如果较便宜的后端试图避免这种成本,但由于错误共享或错误复制或忘记复制一些已经阻塞的 MUTEX-es 和类似的内部结构而导致死锁问题 - 因此即使是完整副本2022 年并不总是安全 - 正如无数专业人士所记录的那样,没有亲自见到它们并不意味着它们仍然存在 - a 关于一群鲨鱼的故事是一个很好的起点)问题清单:
a) pickling lambdas < support> (以及许多其他 SER/DES 拦截器)
很简单 -
conda install dill
和import dill as 就足够了pickle
作为 dill 可以,多年来pickle它们 - 归功于@MikeMcKearns并且您的代码不需要重构简单的pickle.dumps()
调用接口的使用。因此,使用pathos.multiprocess默认在内部使用dill,并且可以避免多年来已知的multiprocessing SER/DES弱点。b ) 性能杀手
-
multiprocessing.Pool.map()
是一个端到端的性能反模式此处 - 成本...,如果我们开始不忽视它们,请显示有多少 CPU 时钟和内存。阻塞的物理 RAM-I/O 传输是为如此多的进程实例(60+)付出代价的,这些实例最终“占用”了几乎所有物理 CPU 核心,但为真正的高性能留下了几乎为零的空间numpy< /code>-核心问题的本机多核计算(最终性能预计会得到提升,不是吗?)
- 只需移动p - 模拟器中的滑块小于 100%(没有
[SERIAL]
- 问题执行的一部分,这在理论上,但在实践中永远不可能,甚至程序启动也是纯粹的-[SERIAL]
,按设计)-只需移动Overhead - 将模拟器中的滑块设置为任何高于普通零的值(表示生成之一的相对附加成本NCPUcore 个进程,以百分比表示,相对于指令的这种
[PARALLEL]
部分部分数量 - 数学上“密集”的工作有很多这样的“有用”-说明,并且可能,假设没有其他性能杀手跳出盒子,可能会花费一些合理数量的“附加”成本,以产生一定数量的并发或并行操作(实际数量仅取决于实际的成本经济性,不在于有多少个CPU核心,更不在于我们的“愿望”或学术甚至更糟糕的复制/粘贴“建议”)。相反,数学上“浅薄”的工作几乎总是“加速”<< 1(巨大减速),因为几乎没有机会证明已知的附加成本(在流程实例上支付、数据 SER/xfer/DES 移动在(参数)和返回(结果))- 接下来将模拟器中的 Overhead-滑块移动到最右边缘<代码>== 1。这显示了这种情况,当实际的进程生成开销(损失的时间)成本仍然不超过所有计算的
<= 1 %
时接下来是相关的指令,这些指令将针对工作的“有用”部分执行,这将在此类生成的流程实例内进行计算。因此,即使是这样的 1:100 比例因子(比损失的 CPU 时间多 100 倍的“有用”工作,因为安排许多副本并使 O/S 调度程序在可用系统虚拟内存中协调其并发执行而被烧毁)加速降级进程图表中已经显示了所有警告 - 只需稍微玩一下Overhead - 模拟器中的滑块,在接触其他滑块之前...- 避免“共享”的罪恶(如果性能是目标) - 同样,在现在独立的多个 Python 解释器进程之间运行此类编排的成本需要额外的附加成本,而在获得性能提升方面永远不合理,因为争夺共享资源(CPU 核心、物理 RAM-I/ O通道)只会造成破坏CPU 核心缓存重用命中率、操作系统调度程序操作的进程上下文切换以及所有这些进一步降低了端到端性能(这是我们不想要的,不是吗?)
c) 提高绩效
-尊重有关任何类型计算操作的实际成本
- 避免“浅层”计算步骤,
- 最大限度地提高一组分布式进程的成本(如果仍然需要)所以),
-避免所有增加开销的操作(例如添加本地临时变量,其中内联操作允许就地存储部分结果)
和
-尝试使用超高性能的缓存-线路友好&优化的、原生的
numpy
-矢量化多核和striding-tricks 功能,不会因为调度如此多(约 60 个)Python 解释器进程副本而被预重载的 CPU 核心阻塞,每个副本都尝试调用 numpy 代码,因此没有任何可用核心实际上将这种高性能、缓存重用友好的矢量化计算置于(我们得到或失去大部分性能,而不是在运行缓慢的串行迭代器中,而不是在产生 60 多个基于进程的“__main__
”Python 解释器的完整副本,在对我们的大量数据进行单个有用的工作之前,昂贵的 RAM 分配和物理复制 60 多次)-对实际问题的重构绝不能违背已收集的有关性能的知识,因为重复不起作用的事情会不带任何优势,是吗?
- 尊重您的物理平台限制,忽略它们会降低您的性能
- 基准测试、配置文件、重构
- 基准测试、配置文件、重构
- 基准测试、配置文件、重构
这里没有其他可用的魔杖
一旦已经致力于性能的前沿,请在将Python解释器生成N个复制副本之前设置
gc.disable()
,不等待自发追求终极性能时的垃圾收集A :
the single most important piece of experience to learn
is how BIG are the COSTS-of-( process )-INSTANTIATION(s),
all other add-on overhead costs
( still by no means not negligible, the more in growing the scales of the problem )
are details in comparison to this immense & principal one.
Before the answer is read-through and completely understood, here is a Live GUI-interactive simulator of how much we will have to pay to start using more than 1 stream of process-flow orchestration ( costs vary - lower for threads, larger for MPI-based distributed operations, highest for
multiprocessing
-processes, as used in Python Interpreter, where N-many copies of the main Python Interpreter process get first copied (RAM-allocated and O/S scheduler spawned - as 2022-Q2 still reports issues if less expensive backends try to avoid this cost, yet at problems with deadlocking on wrong-shared or ill-copied or forgotten to copy some already blocking MUTEX-es and similar internalities - so that even the full-copy is not always safe in 2022 - not having met them in person does not mean these do not still exist, as documented by countless professionals - a story about a pool of sharks is a good place to start from )Inventory of problems :
a ) pickling lambdas ( and many other SER/DES blockers )
is easy - it is enough to
conda install dill
andimport dill as pickle
as dill can, for years pickle them - credit to @MikeMcKearns and your code does not need to refactor the use of the plainpickle.dumps()
-call interface. So usingpathos.multiprocess
defaults to usedill
internally, and this, for years knownmultiprocessing
SER/DES weakness gets avoided.b ) performance killers
-
multiprocessing.Pool.map()
is rather an End-to-End performance anti-pattern here - The Costs..., if we start not to neglect them, show, how much CPU-clocks & blocked physical-RAM-I/O transfers are paid for so many process-instantiations ( 60+ ), which finally "occupy" almost all physical CPU-cores, yet leaving thus almost zero-space for indeed high performancenumpy
-native multicore-computing of the core-problem ( for which the ultimate performance was expected to be boosted up, wasn't it? )- just move the p-slider in the simulator to anything less than 100% ( having no
[SERIAL]
-part of the problem execution, which is nice in theory, yet never doable in practice, even the program launch is a pure-[SERIAL]
, by design )- just move the Overhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of NCPUcores processes, as a number of percent, relative to the such
[PARALLEL]
-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speedups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on process-instantiations, data SER/xfer/DES moving in (params) and back (results) )- next move the Overhead-slider in the simulator to the righmost edge
== 1
. This shows the case, when the actual process-spawning overhead-( a time lost )-costs are still not more than a just<= 1 %
of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speedup-degradation - just play a bit with the Overhead-slider in the simulator, before touching the others...- avoid a sin of "sharing" ( if performance is the goal ) - again, costs of operating such orchestration among several Python Interpreter processes, now independent, takes additional add-on costs, never justified in gaining performance boosted, as the fight for occupying shared resources ( CPU-cores, physical-RAM-I/O channels ) only devastates CPU-core-cache re-use hit-rates, O/S-scheduler operated process context-switches and all this further downgrades resulting End-to-End performance (which is something we do not want, do we?)
c ) boosting performance
- respect facts about the actual costs of any kind of computing operation
- avoid "shallow"-computing steps,
- maximise what gets so expensively into a set of distributed-processes (if a need remains so),
- avoid all overhead-adding operations (like adding local temporary variables, where inline operations permit to inplace store of partial results)
and
- try go into using the ultra-performant, cache-line friendly & optimised, native
numpy
-vectorised multicore & striding-tricks capabilities, not blocked by pre-overloaded CPU-cores by scheduling so many (~60) Python Interpreter process copies, each one trying to callnumpy
-code, thus not having any free cores to actually place such high-performance, cache-reuse-friendly vectorised computing onto (there we get-or-loose most of the performance, not in slow-running serial-iterators, not in spawning 60+ process-based full-copies of "__main__
" Python Interpreter, before doing a single piece of the useful work on our great data, expensively RAM-allocated and physically copied 60+ times therein)- refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
- respect your physical platform constraints, ignoring them will degrade your performance
- benchmark, profile, refactor
- benchmark, profile, refactor
- benchmark, profile, refactor
no other magic wand available here
and once already working on the bleeding edge of performance, set
gc.disable()
before you spawn the Python Interpreter into N-many replicated copies, not to wait for spontaneous garbage-collections when going for ultimate performance