我有一个约 170 万个“令牌”对象的列表,以及一个约 130,000 个“结构”对象的列表,这些对象引用令牌对象并将它们分组为结构。天气好的时候,内存占用约为 800MB。
我使用 __slots__ 来减少内存占用,因此我的 __getstate__ 返回一个可序列化值的元组,该元组会被 __setstate__ 塞回原位。我也不会腌制所有实例数据,仅腌制 5 个令牌项、7-9 个结构项、所有字符串或整数。
当然,我使用的是cPickle和HIGHEST_PROTOCOL,它恰好是2(python 2.6)。生成的 pickle 文件约为 120MB。
在我的开发机器上,解开 pickle 大约需要 2 分钟。我想让这更快。除了更快的硬件和我已经在做的事情之外,我还可以使用哪些方法?
I have a list of ~1.7 million "token" objects, along with a list of ~130,000 "structure" objects which reference the token objects and group them into, well, structures. It's an ~800MB memory footprint, on a good day.
I'm using __slots__
to keep my memory footprint down, so my __getstate__
returns a tuple of serializable values, which __setstate__
bungs back into place. I'm also not pickling all the instance data, just 5 items for tokens, 7-9 items for structures, all strings or integers.
Of course, I'm using cPickle, and HIGHEST_PROTOCOL, which happens to be 2 (python 2.6). The resulting pickle file is ~120MB.
On my development machine, it takes ~2 minutes to unpickle the pickle. I'd like to make this faster. What methods might be available to me, beyond faster hardware and what I'm already doing?
发布评论
评论(1)
Pickle 并不是存储大量相似数据的最佳方法。对于大型数据集来说,它可能会很慢,更重要的是,它非常脆弱:改变源很容易破坏所有现有的数据集。 (我建议您阅读 pickle 的本质:一堆字节码表达式。它会吓到您考虑其他数据存储/检索方式。)
您应该考虑使用 PyTables,它使用 HDF5(跨平台和一切)来存储任意大量的数据。您甚至不必立即将文件中的所有内容加载到内存中;您可以分段访问它。您描述的结构听起来非常适合“表”对象,该对象具有一组字段结构(由固定长度字符串、整数、小型 Numpy 数组等组成)并且可以非常有效地保存大量数据。为了存储元数据,我建议使用表的
._v_attrs
属性。Pickle is not the best method for storing large amounts of similar data. It can be slow for large data sets, and more importantly, it is very fragile: changing around your source can easily break all existing datasets. (I would recommend reading what pickle at its heart actually is: a bunch of bytecode expressions. It will frighten you into considering other means of data storage/retrieval.)
You should look into using PyTables, which uses HDF5 (cross-platform and everything) to store arbitrarily large amounts of data. You don't even have to load everything off of a file into memory at once; you can access it piecewise. The structure you're describing sounds like it would fit very well into a "table" object, which has a set field structure (comprised of fixed-length strings, integers, small Numpy arrays, etc.) and can hold large amounts very efficiently. For storing metadata, I'd recommend using the
._v_attrs
attribute of your tables.