保存和加载大型字典的最快且最有效的方法
我有一个问题。我有一个巨大的字典
。我想保存并加载这个巨大的字典。但不幸的是我遇到了MemoryError
。字典不应该太大。从数据库中读取的内容约为 4GB。我现在想保存这个字典并读出它。 但是,它应该是高效的(不会消耗更多内存)并且不会花费太长时间。
目前有哪些选择?我无法进一步使用 pickle
,出现内存错误。我还剩 200GB 可用磁盘空间。
我查看了 保存和加载文件的最快方法Python 中的大字典 以及其他一些问题和博客。
import pickle
from pathlib import Path
def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
Path(path).mkdir(parents=True, exist_ok=True)
pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))
save_file_as_pickle(dict, "dict")
[OUT]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<timed eval> in <module>
~\AppData\Local\Temp/ipykernel_1532/54965140.py in save_file_as_pickle(file, filename, path)
1 def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
2 Path(path).mkdir(parents=True, exist_ok=True)
----> 3 pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))
MemoryError:
什么有效,但是花了 1 小时并且使用了 26GB 空间磁盘
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(dict, f, ensure_ascii=False, indent=4)
我查了一下我的字典有多大(以字节为单位)。 我遇到了这个问题 如何知道Python对象(如数组和字典)的字节大小? - 简单的方法,它显示该字典只有 8448728 字节。
import sys
sys.getsizeof(dict)
[OUT] 8448728
我的数据是什么样的(示例)
{
'_key': '1',
'group': 'test',
'data': {},
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': {
'selector': {
'number': '12312',
'isTrue': True,
'requirements': [{
'type': 'customer',
'requirement': '1'}]
}
}
'identCode': [],
}
I have a problem. I have a huge dict
. I want to save and load this huge dict. But unfortunately I got an MemoryError
. The dict should not be too big. What is read out of the database is around 4GB. I would now like to save this dict and read it out.
However, it should be efficient (not consume much more memory) and not take too long.
What options are there at the moment? I can't get any further with pickle
, I get a memory error. I have 200GB of free disk space left.
I looked at Fastest way to save and load a large dictionary in Python and some others questions and blogs.
import pickle
from pathlib import Path
def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
Path(path).mkdir(parents=True, exist_ok=True)
pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))
save_file_as_pickle(dict, "dict")
[OUT]
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<timed eval> in <module>
~\AppData\Local\Temp/ipykernel_1532/54965140.py in save_file_as_pickle(file, filename, path)
1 def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
2 Path(path).mkdir(parents=True, exist_ok=True)
----> 3 pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))
MemoryError:
What worked, but took 1 hour and 26GB space disk is used
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(dict, f, ensure_ascii=False, indent=4)
I looked up how big my dict is in bytes.
I came across this question How to know bytes size of python object like arrays and dictionaries? - The simple way and it shows that the dict is only 8448728 bytes.
import sys
sys.getsizeof(dict)
[OUT] 8448728
What my data looks like (example)
{
'_key': '1',
'group': 'test',
'data': {},
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': {
'selector': {
'number': '12312',
'isTrue': True,
'requirements': [{
'type': 'customer',
'requirement': '1'}]
}
}
'identCode': [],
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
当您的RAM(不是硬盘文件系统)无法持有DICS数据的串行形式时,会发生内存错误。序列化需要在键和值中存储各种有关数据的元数据,搜索和删除重复引用的对象,数据类型的任何属性和属性(尤其是数据库类型不是内置Python类型的一部分)将单个字节写入文件。由于JSON仅针对数据值生产26GB,因此我必须假设在其上添加的所有元数据都会增加序列化形式的记忆大小。
压缩无济于事,因为序列化数据必须在进行任何压缩之前以非压缩形式。它只能保存磁盘空间,而不是RAM内存。
JSON可能会起作用,因为它开始按照读取数据来流式传输数据,而不是在内存中转换为JSON。否则可能是JSON形式而没有所有外部元数据信息,可以在您的RAM中保留。
If you want to optimize and solve without using JSON, here is one solution:
硬件解决方案当然是为了增加您的RAM内存并选择硬盘。
另一个解决方案是在Linux中尝试使用,该Linux往往比Windows具有更好的内存优化。
The memory error occurs when your RAM (not your hard disk filesystem) cannot hold the serialized form of the dict data. Serialization requires storing all kinds of metadata about the data in keys and values, searching and removing duplicate referenced objects, any properties and attributes of data types (especially database types not part of built-in Python types) all done in RAM memory first before even writing a single byte into the file. Since json produced 26GB just for the data values, I'd have to assume all the metadata added on top of that would have increased the memory size of the serialized form.
Compression doesn't help since the serialized data must be in non-compressed form before doing any compression. It only saves the disk space, not RAM memory.
JSON may have worked because it starts to stream data as it is read, instead of converting to JSON all in memory. Or it could be that JSON form without all the extraneous metadata info can be held in your RAM just fine.
If you want to optimize and solve without using JSON, here is one solution:
A hardware solution is of course to increase your RAM memory and optionally your hard disk.
Another solution is try this in Linux, which tends to have better memory optimization than Windows.
有两种方法可以使腌制更多 perferant :
gzip
加速时禁用垃圾收集器可以尝试一下:
There are two ways to make the pickling more performant:
gzip
to generate a compressed output fileGive this a try:
我会考虑尝试一些新格式,尽管我不能100%确定它们更好。
堆栈溢出答案
对于HDF5,我会什么可以尝试 dict to hdf5库查看它是否有效。
I would consider trying out some new formats although I am not 100% sure that they are better.
stack overflow answers
For HDF5, I would what to try out dict to hdf5 library to see if it works.
如果没有其他办法,您可能会考虑拆分数据集并将其保存为块。您可以使用线程,也可以重写下面的代码以串行执行。我假设你的字典是字典列表,如果它是字典字典,你需要相应地调整代码。另请注意,此示例还需要进行调整,因为根据您选择步长的方式,最后的条目可能不会保存或加载。
If nothing else works you might consider to split the dataset and save it in chunks. You can use threading or you can rewrite the code below to do it serial. I assumed that your dictionary is a list of dictionaries if its a dictionary of dictionaries you need to adjust the code accordingly. Als note that this example also needs to be adjusted as, depending on how you chose the step size, the last entries might not be saved or loaded.
TL;DR
The main issue here is the lack of streaming-like data format.
我建议阅读和写作 jsonl 格式,但请继续使用常规dict。 Try the 2 options:
Full details below:
JSON Lines Format
The idea is to provide as close as posible format to json, while staying split-able.
这允许通过线路或通过局部或分布的多处理(线路)划分(线路)。
(一个常见的大型练习可能是在HDF上存储,用于Spark的处理,为Ex')。
它与
gzip
压缩很好地搭配,该压缩本身是对拆分的 - 允许进行连续的读取和写入。我们将读写包裹,以使应用程序对此不可知,并且仍然可以处理常见的命令。
A data simulator
I created 1M dict entries from your sample, with varying keys, currency and year (to challenge the gzip compression a bit).
我使用了MacBook Pro M1。
Option #1 - gzip file api
For the 1M dataset it took ~10s to write and ~6s to read again.
Option #2 - mmap api
For the 1M dataset it took ~5s to write and ~8s to read again.
内存映射的文件是一种可能非常强大的技术,可以使我们的IO稳定。
基本想法是将[巨大]文件映射到虚拟内存系统中,允许部分 /快速 /并发读取和写入。
因此,对于两个巨大文件(不能安装在内存中)和性能提升。
该代码更加麻烦,并且并非总是最快的,但是您可以根据需要进行进一步调整。
有关它的细节很多,所以我建议您在 wiki> wiki 和Python api上阅读更多信息。 ,不要在这里淹没答案。
There was also a 3rd Option: mmap + gzip, but the write was slow and there were issues with decompressing back the lines.
不过,我建议这样做 - 这将使磁盘上的文件大小要小得多。
TL;DR
The main issue here is the lack of streaming-like data format.
I recommend reading and writing jsonl format, but keep working with your regular dict. Try the 2 options:
Full details below:
JSON Lines Format
The idea is to provide as close as posible format to json, while staying split-able.
This allows for a line by line, or a block(of lines) by block multiprocessing, locally or distributed.
(a common big-date practice might be storing on HDFS, processing by Spark, for ex').
It goes nicely with the
gzip
compression, which is split-friendly by itself - allowing for sequential reads and writes.We'll wrap the read and write so that the application will be agnostic to it, and could still deal with the common dict.
A data simulator
I created 1M dict entries from your sample, with varying keys, currency and year (to challenge the gzip compression a bit).
I used a macbook pro m1.
Option #1 - gzip file api
For the 1M dataset it took ~10s to write and ~6s to read again.
Option #2 - mmap api
For the 1M dataset it took ~5s to write and ~8s to read again.
The Memory Mapped File is a potentially very strong technique for robust-ing our IO.
The basic idea is mapping [huge] files into the virtual-memory system, allowing partial / fast / concurrent reads and writes.
So, good for both huge files (that can't be fitted into memory) and a performance boost.
The code is more cumbersome, and not always the fastest, but you can further tweak it for your needs.
There are so many details about it, so I recommend reading more on wiki and python api, not to overwhelm the answer here.
There was also a 3rd Option: mmap + gzip, but the write was slow and there were issues with decompressing back the lines.
I recommend pursuing this, though - this will allow for a much smaller file size on disk.