如何使大型 python 数据结构更有效地 unpickle?

发布于 2024-10-18 20:21:19 字数 380 浏览 2 评论 0 原文

我有一个约 170 万个“令牌”对象的列表,以及一个约 130,000 个“结构”对象的列表,这些对象引用令牌对象并将它们分组为结构。天气好的时候,内存占用约为 800MB。

我使用 __slots__ 来减少内存占用,因此我的 __getstate__ 返回一个可序列化值的元组,该元组会被 __setstate__ 塞回原位。我也不会腌制所有实例数据,仅腌制 5 个令牌项、7-9 个结构项、所有字符串或整数。

当然,我使用的是cPickle和HIGHEST_PROTOCOL,它恰好是2(python 2.6)。生成的 pickle 文件约为 120MB。

在我的开发机器上,解开 pickle 大约需要 2 分钟。我想让这更快。除了更快的硬件和我已经在做的事情之外,我还可以使用哪些方法?

I have a list of ~1.7 million "token" objects, along with a list of ~130,000 "structure" objects which reference the token objects and group them into, well, structures. It's an ~800MB memory footprint, on a good day.

I'm using __slots__ to keep my memory footprint down, so my __getstate__ returns a tuple of serializable values, which __setstate__ bungs back into place. I'm also not pickling all the instance data, just 5 items for tokens, 7-9 items for structures, all strings or integers.

Of course, I'm using cPickle, and HIGHEST_PROTOCOL, which happens to be 2 (python 2.6). The resulting pickle file is ~120MB.

On my development machine, it takes ~2 minutes to unpickle the pickle. I'd like to make this faster. What methods might be available to me, beyond faster hardware and what I'm already doing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

醉生梦死 2024-10-25 20:21:19

Pickle 并不是存储大量相似数据的最佳方法。对于大型数据集来说,它可能会很慢,更重要的是,它非常脆弱:改变源很容易破坏所有现有的数据集。 (我建议您阅读 pickle 的本质:一堆字节码表达式。它会吓到您考虑其他数据存储/检索方式。)

您应该考虑使用 PyTables,它使用 HDF5(跨平台和一切)来存储任意大量的数据。您甚至不必立即将文件中的所有内容加载到内存中;您可以分段访问它。您描述的结构听起来非常适合“表”对象,该对象具有一组字段结构(由固定长度字符串、整数、小型 Numpy 数组等组成)并且可以非常有效地保存大量数据。为了存储元数据,我建议使用表的 ._v_attrs 属性。

Pickle is not the best method for storing large amounts of similar data. It can be slow for large data sets, and more importantly, it is very fragile: changing around your source can easily break all existing datasets. (I would recommend reading what pickle at its heart actually is: a bunch of bytecode expressions. It will frighten you into considering other means of data storage/retrieval.)

You should look into using PyTables, which uses HDF5 (cross-platform and everything) to store arbitrarily large amounts of data. You don't even have to load everything off of a file into memory at once; you can access it piecewise. The structure you're describing sounds like it would fit very well into a "table" object, which has a set field structure (comprised of fixed-length strings, integers, small Numpy arrays, etc.) and can hold large amounts very efficiently. For storing metadata, I'd recommend using the ._v_attrs attribute of your tables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文