如果我有足够的 RAM,如何加快大型对象的 unpickle 速度?

发布于 2024-08-31 10:28:51 字数 531 浏览 5 评论 0原文

使用 cPickle 读取 1 GB 的 NetworkX 图形数据结构(当作为二进制 pickle 文件存储在磁盘上时为 1 GB)需要花费一个小时的时间。

请注意,该文件会快速加载到内存中。换句话说,如果我运行:

import cPickle as pickle

f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages

如何加快最后一个操作的速度?

请注意,我已尝试使用两种二进制协议(1 和 2)来腌制数据,并且我使用的协议似乎没有太大区别。另请注意,虽然我使用上面的“loads”(意思是“加载字符串”)函数,但它加载的是二进制数据,而不是 ascii 数据。

我正在使用的系统上有 128GB 的​​ RAM,所以我希望有人能告诉我如何增加埋在 pickle 实现中的一些读取缓冲区。

It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).

Note that the file quickly loads into memory. In other words, if I run:

import cPickle as pickle

f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages

How can I speed this last operation up?

Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.

I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

猫瑾少女 2024-09-07 10:28:51

我使用 cPickle 本身成功读取了约 750 MB igraph 数据结构(二进制 pickle 文件)。这是通过简单地包装pickle加载调用来实现的,如此处所述,

您的案例中的示例片段如下所示:

import cPickle as pickle
import gc

f = open("bigNetworkXGraph.pickle", "rb")

# disable garbage collector
gc.disable()

graph = pickle.load(f)

# enable garbage collector again
gc.enable()
f.close()

这绝对是这不是最合适的方法,但是,它大大减少了所需的时间。
(对我来说,它从 843.04 秒减少到 41.28 秒,大约是 20 倍)

I had great success in reading a ~750 MB igraph data structure (a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

Example snippet in your case would be something like:

import cPickle as pickle
import gc

f = open("bigNetworkXGraph.pickle", "rb")

# disable garbage collector
gc.disable()

graph = pickle.load(f)

# enable garbage collector again
gc.enable()
f.close()

This definitely isn't the most apt way to do it, however, it reduces the time required drastically.
(For me, it reduced from 843.04s to 41.28s, around 20x)

与他有关 2024-09-07 10:28:51

您可能会受到 Python 对象创建/分配开销的限制,而不是 unpickling 本身。
如果是这样,除了不创建所有对象之外,您几乎无能为力来加快速度。您是否需要立即获得整个结构?如果没有,您可以使用数据结构的惰性填充(例如:用 pickled 字符串表示结构的一部分,然后仅在访问它们时取消它们)。

You're probably bound by Python object creation/allocation overhead, not the unpickling itself.
If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

雨的味道风的声音 2024-09-07 10:28:51

为什么不尝试编组你的数据并使用memcached将其存储在RAM中(例如例子)。是的,它有一些限制,但指出了编组比酸洗快得多(20至30倍)。

当然,您还应该花尽可能多的时间优化数据结构,以最大限度地减少要存储的数据量和复杂性。

Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.

Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

无法言说的痛 2024-09-07 10:28:51

这太荒谬了。

我有一个巨大的~150MB 字典(实际上是collections.Counter),我正在使用 cPickle 以二进制格式读写它。

写下来大约需要3分钟。
我在读到 16 分钟时就停止读了,我的 RAM 完全塞满了。

我现在使用 marshal,它需要:
写入:~3s
阅读:〜5s

我浏览了一下,发现了这个文章
我猜我从来没有看过 pickle 源代码,但它构建了整个虚拟机来重建字典?
恕我直言,文档中应该有关于非常大对象的性能的注释。

This is ridiculous.

I have a huge ~150MB dictionary (collections.Counter actually) that I was reading and writing using cPickle in the binary format.

Writing it took about 3 min.
I stopped reading it in at the 16 min mark, with my RAM completely choked up.

I'm now using marshal, and it takes:
write: ~3s
read: ~5s

I poked around a bit, and came across this article.
Guess I've never looked at the pickle source, but it builds an entire VM to reconstruct the dictionary?
There should be a note about performance on very large objects in the documentation IMHO.

蓝天 2024-09-07 10:28:51

我还试图加快 networkx 图的加载/存储速度。我正在使用 adjacency_graph方法将图转换为可序列化的东西,例如参见以下代码:

from networkx.generators import fast_gnp_random_graph
from networkx.readwrite import json_graph

G = fast_gnp_random_graph(4000, 0.7)

with open('/tmp/graph.pickle', 'wb+') as f:
  data = json_graph.adjacency_data(G)
  pickle.dump(data, f)

with open('/tmp/graph.pickle', 'rb') as f:
  d = pickle.load(f)
  H = json_graph.adjacency_graph(d)

但是,这个 adjacency_graph 转换方法非常慢,因此在 pickling 中获得的时间可能会在转换时丢失。

所以这实际上并没有加快速度,真糟糕。运行此代码给出以下时间:

N=1000

    0.666s ~ generating
    0.790s ~ converting
    0.237s ~ storing
    0.295s ~ loading
    1.152s ~ converting

N=2000

    2.761s ~ generating
    3.282s ~ converting
    1.068s ~ storing
    1.105s ~ loading
    4.941s ~ converting

N=3000

    6.377s ~ generating
    7.644s ~ converting
    2.464s ~ storing
    2.393s ~ loading
    12.219s ~ converting

N=4000

    12.458s ~ generating
    19.025s ~ converting
    8.825s ~ storing
    8.921s ~ loading
    27.601s ~ converting

这种指数增长可能是由于图形获得指数级更多的边。这是一个测试要点,如果您想自己尝试

https://gist.github.com/wires/ 5918834712a64297d7d1

I'm also trying to speed up the loading/storing of networkx graphs. I'm using the adjacency_graph method to convert the graph to something serialisable, see for instance this code:

from networkx.generators import fast_gnp_random_graph
from networkx.readwrite import json_graph

G = fast_gnp_random_graph(4000, 0.7)

with open('/tmp/graph.pickle', 'wb+') as f:
  data = json_graph.adjacency_data(G)
  pickle.dump(data, f)

with open('/tmp/graph.pickle', 'rb') as f:
  d = pickle.load(f)
  H = json_graph.adjacency_graph(d)

However, this adjacency_graph conversion method is quite slow, so time gained in pickling is probably lost on converting.

So this actually doesn't speed things up, bummer. Running this code gives the following timings:

N=1000

    0.666s ~ generating
    0.790s ~ converting
    0.237s ~ storing
    0.295s ~ loading
    1.152s ~ converting

N=2000

    2.761s ~ generating
    3.282s ~ converting
    1.068s ~ storing
    1.105s ~ loading
    4.941s ~ converting

N=3000

    6.377s ~ generating
    7.644s ~ converting
    2.464s ~ storing
    2.393s ~ loading
    12.219s ~ converting

N=4000

    12.458s ~ generating
    19.025s ~ converting
    8.825s ~ storing
    8.921s ~ loading
    27.601s ~ converting

This exponential growth is probably due to the graph getting exponentially more edges. Here is a test gist, in case you want to try yourself

https://gist.github.com/wires/5918834712a64297d7d1

酒中人 2024-09-07 10:28:51

也许你能做的最好的事情就是将大数据分割成小于 50MB 的最小对象,这样就可以存储在 RAM 中,然后重新组合它。

Afaik 没有办法通过 pickle 模块自动分割数据,所以你必须自己做。

无论如何,另一种方法(相当困难)是使用一些 NoSQL 数据库,例如 MongoDB 来存储您的数据...

Maybe the best thing you can do is to split the big data into smallest object smaller, let's say, than 50MB, so can be stored in ram, and recombine it.

Afaik there's no way to automatic splitting data via pickle module, so you have to do by yourself.

Anyway, another way (which is quite harder) is to use some NoSQL Database like MongoDB to store your data...

姜生凉生 2024-09-07 10:28:51

一般来说,我发现如果可能的话,在 python 中将大型对象保存到磁盘时,使用 numpy ndarrays 或 scipy.sparse 矩阵会更有效。

因此,对于像示例中这样的巨大图形,我可以将图形转换为 scipy 稀疏矩阵(networkx 有一个函数可以执行此操作,并且编写一个并不难),然后以二进制格式保存该稀疏矩阵。

In general, I've found that if possible, when saving large objects to disk in python, it's much more efficient to use numpy ndarrays or scipy.sparse matrices.

Thus for huge graphs like the one in the example, I could convert the graph to a scipy sparse matrix (networkx has a function that does this, and it's not hard to write one), and then save that sparse matrix in binary format.

疏忽 2024-09-07 10:28:51

你为什么不使用pickle.load?

f = open('fname', 'rb')
graph = pickle.load(f)

why don't you use pickle.load?

f = open('fname', 'rb')
graph = pickle.load(f)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文