可以在避免/无挑战之前访问/保存Python多处理结果吗?

发布于 2025-02-09 11:43:07 字数 478 浏览 2 评论 0原文

我正在使用Python 多处理软件包并并行运行模拟的几个迭代。

pool =多探索.pool()

result = pool.map(runsimulation,args)

每个仿真返回一个大数据字典。考虑到我必须运行的大量迭代,结果中包含的最终输出是巨大的。理想情况下,保存之前,我会腌制如此大的数据文件。但是,我对多处理模块的理解是,它由Outque组成,该保留每个任务的串行返回值。稍后,一种称为_RESULT_HANDLER的绑定方法应将这些值归档,并将它们返回pool.map()

我的问题是:我可以在通过_RESULT_HANDLER的必要性化之前访问并保存这些结果的序列化版本,从而避免了多个序列化/序列化的回合?

I am using the Python multiprocessing package to run several iterations of a simulation in parallel.

pool = multiprocessing.Pool()

result = pool.map(runSimulation, args)

Each simulation returns a large dictionary of data. Given the large number of iterations I have to run, the final output contained in result is huge. Ideally, I'd pickle such a large data file before saving it. However, my understanding of the multiprocessing module is that it consists of an outque which holds the serialized return values of each task. Later, a bound method called _result_handler deserializes these values, and returns them to pool.map().

My question is this: can I access and save the serialized version of these results before it is deserialized by _result_handler, thus avoiding multiple rounds of serializing/deserializing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

内心荒芜 2025-02-16 11:43:08

即使您可以避免它,您也依赖于多处理当前使用的序列化格式。取而代之的是,在返回之前(或者是序列化版本)之前,您可以自己序列化字典。然后,传输序列化和估算化bytes,而不是具有许多对象的字典,这要高得多。

测试结果具有一百万个项目的命令,比较了您当前的(重新)序列化之后的方法, 传输与序列化 。 The latter is over three times faster (and probably uses a lot less memory, too), and barely takes longer than just serializing the dictionary:

1185 ms  after
 351 ms  before
 337 ms  just_pickle

Code (Try it online!):

from timeit import timeit
import pickle

def transfer(obj):
    return pickle.loads(pickle.dumps(obj))

def just_pickle():
    return pickle.dumps(dct)

def after():
    return pickle.dumps(transfer(dct))

def before():
    return transfer(pickle.dumps(dct))

dct = {str(i): i for i in range(1000000)}

print(pickle.loads(after()) == pickle.loads(before()))

for func in [after, before, just_pickle] * 3:
    t = timeit(func, number=1)
    print('%4d ms ' % (t * 1e3), func.__name__)

Even if you can avoid it, you then depend on the serialization format currently used by multiprocessing. You could instead serialize your dictionary yourself before returning it (or rather its serialized version then). Then the transfer serializes and deserializes bytes instead of a dictionary with lots of objects, which is much more efficient.

Test results with a dict of a million items, comparing your current way of (re)serializing after the transfer versus serializing before the transfer. The latter is over three times faster (and probably uses a lot less memory, too), and barely takes longer than just serializing the dictionary:

1185 ms  after
 351 ms  before
 337 ms  just_pickle

Code (Try it online!):

from timeit import timeit
import pickle

def transfer(obj):
    return pickle.loads(pickle.dumps(obj))

def just_pickle():
    return pickle.dumps(dct)

def after():
    return pickle.dumps(transfer(dct))

def before():
    return transfer(pickle.dumps(dct))

dct = {str(i): i for i in range(1000000)}

print(pickle.loads(after()) == pickle.loads(before()))

for func in [after, before, just_pickle] * 3:
    t = timeit(func, number=1)
    print('%4d ms ' % (t * 1e3), func.__name__)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文