保存和加载大型字典的最快且最有效的方法

发布于 2025-01-20 22:30:42 字数 2322 浏览 3 评论 0原文

我有一个问题。我有一个巨大的字典。我想保存并加载这个巨大的字典。但不幸的是我遇到了MemoryError。字典不应该太大。从数据库中读取的内容约为 4GB。我现在想保存这个字典并读出它。但是，它应该是高效的（不会消耗更多内存）并且不会花费太长时间。

目前有哪些选择？我无法进一步使用 pickle，出现内存错误。我还剩 200GB 可用磁盘空间。

我查看了保存和加载文件的最快方法Python 中的大字典以及其他一些问题和博客。

import pickle
from pathlib import Path

def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
    Path(path).mkdir(parents=True, exist_ok=True)
    pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

save_file_as_pickle(dict, "dict")

[OUT]

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<timed eval> in <module>

~\AppData\Local\Temp/ipykernel_1532/54965140.py in save_file_as_pickle(file, filename, path)
      1 def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
      2     Path(path).mkdir(parents=True, exist_ok=True)
----> 3     pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

MemoryError:

什么有效，但是花了 1 小时并且使用了 26GB 空间磁盘

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(dict, f, ensure_ascii=False, indent=4)

我查了一下我的字典有多大（以字节为单位）。我遇到了这个问题如何知道Python对象（如数组和字典）的字节大小？ - 简单的方法，它显示该字典只有 8448728 字节。

import sys
sys.getsizeof(dict)
[OUT] 8448728

我的数据是什么样的（示例）

{
'_key': '1',
 'group': 'test',
 'data': {},
 'type': '',
 'code': '007',
 'conType': '1',
 'flag': None,
 'createdAt': '2021',
 'currency': 'EUR',
 'detail': {
        'selector': {
            'number': '12312',
            'isTrue': True,
            'requirements': [{
                'type': 'customer',
                'requirement': '1'}]
            }
        }   

 'identCode': [],
 }

原文

I have a problem. I have a huge dict. I want to save and load this huge dict. But unfortunately I got an MemoryError. The dict should not be too big. What is read out of the database is around 4GB. I would now like to save this dict and read it out.
However, it should be efficient (not consume much more memory) and not take too long.

What options are there at the moment? I can't get any further with pickle, I get a memory error. I have 200GB of free disk space left.

I looked at Fastest way to save and load a large dictionary in Python and some others questions and blogs.

import pickle
from pathlib import Path

def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
    Path(path).mkdir(parents=True, exist_ok=True)
    pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

save_file_as_pickle(dict, "dict")

[OUT]

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<timed eval> in <module>

~\AppData\Local\Temp/ipykernel_1532/54965140.py in save_file_as_pickle(file, filename, path)
      1 def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
      2     Path(path).mkdir(parents=True, exist_ok=True)
----> 3     pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

MemoryError:

What worked, but took 1 hour and 26GB space disk is used

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(dict, f, ensure_ascii=False, indent=4)

I looked up how big my dict is in bytes.
I came across this question How to know bytes size of python object like arrays and dictionaries? - The simple way and it shows that the dict is only 8448728 bytes.

import sys
sys.getsizeof(dict)
[OUT] 8448728

What my data looks like (example)

{
'_key': '1',
 'group': 'test',
 'data': {},
 'type': '',
 'code': '007',
 'conType': '1',
 'flag': None,
 'createdAt': '2021',
 'currency': 'EUR',
 'detail': {
        'selector': {
            'number': '12312',
            'isTrue': True,
            'requirements': [{
                'type': 'customer',
                'requirement': '1'}]
            }
        }   

 'identCode': [],
 }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

潜移默化 2025-01-27 22:30:42

当您的RAM（不是硬盘文件系统）无法持有DICS数据的串行形式时，会发生内存错误。序列化需要在键和值中存储各种有关数据的元数据，搜索和删除重复引用的对象，数据类型的任何属性和属性（尤其是数据库类型不是内置Python类型的一部分）将单个字节写入文件。由于JSON仅针对数据值生产26GB，因此我必须假设在其上添加的所有元数据都会增加序列化形式的记忆大小。

压缩无济于事，因为序列化数据必须在进行任何压缩之前以非压缩形式。它只能保存磁盘空间，而不是RAM内存。

JSON可能会起作用，因为它开始按照读取数据来流式传输数据，而不是在内存中转换为JSON。否则可能是JSON形式而没有所有外部元数据信息，可以在您的RAM中保留。

If you want to optimize and solve without using JSON, here is one solution:

Create custom dict reader from database that casts common data types to built-in Python types or your own custom lean data types, rather than what the default database reader provides using它自己的类型。
为您的数据类型创建自定义序列化/除外序列化方法，该方法仅处理需要存储的数据，甚至（DE）在（DE）序列化方法中即时压缩数据。

硬件解决方案当然是为了增加您的RAM内存并选择硬盘。

另一个解决方案是在Linux中尝试使用，该Linux往往比Windows具有更好的内存优化。

回复收藏 0 原文

与之呼应 2025-01-27 22:30:42

有两种方法可以使腌制更多 perferant ：

在使用gzip加速时禁用垃圾收集器
以生成压缩输出文件，

可以尝试一下：

import gc
import gzip
import os
import pickle
from pathlib import Path


def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), "dict")):
    Path(path).mkdir(parents=True, exist_ok=True)
    file_path = os.path.join(path, str(filename + ".pickle"))

    gc.disable()
    try:
        gc.collect()
        with gzip.open(file_path, "wb") as fp:
            pickle.dump(file, fp)
    finally:
        gc.enable()


save_file_as_pickle(my_dict, "dict")

There are two ways to make the pickling more performant:

disabling the Garbage Collector while pickling for a speedup
using gzip to generate a compressed output file

Give this a try:

import gc
import gzip
import os
import pickle
from pathlib import Path


def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), "dict")):
    Path(path).mkdir(parents=True, exist_ok=True)
    file_path = os.path.join(path, str(filename + ".pickle"))

    gc.disable()
    try:
        gc.collect()
        with gzip.open(file_path, "wb") as fp:
            pickle.dump(file, fp)
    finally:
        gc.enable()


save_file_as_pickle(my_dict, "dict")

回复收藏 0 原文

酒几许 2025-01-27 22:30:42

我会考虑尝试一些新格式，尽管我不能100％确定它们更好。

堆栈溢出答案

对于HDF5，我会什么可以尝试 dict to hdf5库查看它是否有效。

import hdfdict
import numpy as np


d = {
    'a': np.random.randn(10),
    'b': [1, 2, 3],
    'c': 'Hallo',
    'd': np.array(['a', 'b']).astype('S'),
    'e': True,
    'f': (True, False),
}
fname = 'test_hdfdict.h5'
hdfdict.dump(d, fname)
res = hdfdict.load(fname)

print(res)

I would consider trying out some new formats although I am not 100% sure that they are better.

stack overflow answers

For HDF5, I would what to try out dict to hdf5 library to see if it works.

import hdfdict
import numpy as np


d = {
    'a': np.random.randn(10),
    'b': [1, 2, 3],
    'c': 'Hallo',
    'd': np.array(['a', 'b']).astype('S'),
    'e': True,
    'f': (True, False),
}
fname = 'test_hdfdict.h5'
hdfdict.dump(d, fname)
res = hdfdict.load(fname)

print(res)

回复收藏 0 原文

酸甜透明夹心 2025-01-27 22:30:42

如果没有其他办法，您可能会考虑拆分数据集并将其保存为块。您可以使用线程，也可以重写下面的代码以串行执行。我假设你的字典是字典列表，如果它是字典字典，你需要相应地调整代码。另请注意，此示例还需要进行调整，因为根据您选择步长的方式，最后的条目可能不会保存或加载。

import pickle
import threading
    
# create a huge list of dicts
size = 1000000
mydict_list = [{'_key':f'{i}','group': 'test'} for i in range(size)]

# try to save it as full file just to see how large it is
#with open(f'whole_list.pkl', 'wb') as f:
#    pickle.dump(mydict_list, f)


# define function to save the smaller parts
def savedata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'wb') as f:
        pickle.dump(tmp, f)

# define function to load the smaller parts
def loaddata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'rb') as f:
        results[f'{istart}_{iend}'] = pickle.load(f)

# define into how many chunks you want to split the file
steps = int(size/10)

# split the list and save it using threading
results = {}
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=savedata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()


# load the list using threading
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=loaddata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()

If nothing else works you might consider to split the dataset and save it in chunks. You can use threading or you can rewrite the code below to do it serial. I assumed that your dictionary is a list of dictionaries if its a dictionary of dictionaries you need to adjust the code accordingly. Als note that this example also needs to be adjusted as, depending on how you chose the step size, the last entries might not be saved or loaded.

import pickle
import threading
    
# create a huge list of dicts
size = 1000000
mydict_list = [{'_key':f'{i}','group': 'test'} for i in range(size)]

# try to save it as full file just to see how large it is
#with open(f'whole_list.pkl', 'wb') as f:
#    pickle.dump(mydict_list, f)


# define function to save the smaller parts
def savedata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'wb') as f:
        pickle.dump(tmp, f)

# define function to load the smaller parts
def loaddata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'rb') as f:
        results[f'{istart}_{iend}'] = pickle.load(f)

# define into how many chunks you want to split the file
steps = int(size/10)

# split the list and save it using threading
results = {}
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=savedata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()


# load the list using threading
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=loaddata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()

回复收藏 0 原文

感性 2025-01-27 22:30:42

TL;DR

The main issue here is the lack of streaming-like data format.
我建议阅读和写作 jsonl 格式，但请继续使用常规dict。 Try the 2 options:

gzip + jsonl, using the file api (faster write)
clear jsonl, using the mmap api (faster read)

Full details below:

JSON Lines Format

The idea is to provide as close as posible format to json, while staying split-able.

这允许通过线路或通过局部或分布的多处理（线路）划分（线路）。
（一个常见的大型练习可能是在HDF上存储，用于Spark的处理，为Ex'）。

它与gzip压缩很好地搭配，该压缩本身是对拆分的 - 允许进行连续的读取和写入。

我们将读写包裹，以使应用程序对此不可知，并且仍然可以处理常见的命令。

A data simulator

I created 1M dict entries from your sample, with varying keys, currency and year (to challenge the gzip compression a bit).
我使用了MacBook Pro M1。

import json
import gzip
import mmap
import subprocess

d = {}
years = { 0: 2019, 1: 2020, 2:2121 }
currencies = { 0: 'EUR', 1: 'USD', 2: 'GBP' }
n = int(1e6)

for i in range(n):
    rem = i % 3
    d[i] = {
        '_key': str(i),
        'group': 'test',
        'data': {},
        'type': '',
        'code': '007',
        'conType': '1',
        'flag': None,
        'createdAt': years[rem],
        'currency': currencies[rem],
        'detail': {
            'selector': {
                'number': '12312',
                'isTrue': True,
                'requirements': [{
                    'type': 'customer',
                    'requirement': '1'}]
                }
            },
        'identCode': [],
    }

Option #1 - gzip file api

For the 1M dataset it took ~10s to write and ~6s to read again.

file_name_jsonl_gz = './huge_dict.jsonl.gz'

# write
with gzip.open(file_name_jsonl_gz, 'wt') as f:
    for k, v in d.items():
        f.write(f'{{"{k}":{json.dumps(v)}}}\n') # from k, v pair into a json line

# read again
_d = {}
with gzip.open(file_name_jsonl_gz, 'rt') as f:
    for line in f:
        __d = json.loads(line)
        k, v = tuple(__d.items())[0] # from a single json line into k, v pair
        _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

true

Option #2 - mmap api

For the 1M dataset it took ~5s to write and ~8s to read again.

内存映射的文件是一种可能非常强大的技术，可以使我们的IO稳定。
基本想法是将[巨大]文件映射到虚拟内存系统中，允许部分 /快速 /并发读取和写入。
因此，对于两个巨大文件（不能安装在内存中）和性能提升。

该代码更加麻烦，并且并非总是最快的，但是您可以根据需要进行进一步调整。
有关它的细节很多，所以我建议您在 wiki> wiki 和Python api上阅读更多信息。，不要在这里淹没答案。

file_name_mmap_jsonl = './huge_dict_mmap.jsonl'
# an initial large empty file (hard to estimate in advance)
# change the size for your actual needs
subprocess.Popen(['truncate', '-s', '10G', file_name_mmap_jsonl])

pos_counter = 0
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    # mmap gets its file descriptor from the file object
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_WRITE) as mm:
        buffer = []
        for k, v in d.items():
            s = f'{{"{k}":{json.dumps(v)}}}\n' # from k, v pair into a json line
            b = s.encode()
            buffer.append(b)
            pos_counter += len(b)

            # using buffer; not to abuse the write for every line
            # try and tweak it further
            if len(buffer) >= 100:
                mm.write(b''.join(buffer))
                buffer = []
        
        mm.write(b''.join(buffer))
        mm.flush()

# shrink to the excat needed size
subprocess.Popen(['truncate', '-s', str(pos_counter), file_name_mmap_jsonl])

# read again
_d = {}
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
        while True:
            line = mm.readline()
            if len(line) == 0: # EOF
                break
            __d = json.loads(line)
            k, v = tuple(__d.items())[0] # from a json line into k, v pair
            _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

true

There was also a 3rd Option: mmap + gzip, but the write was slow and there were issues with decompressing back the lines.
不过，我建议这样做 - 这将使磁盘上的文件大小要小得多。

TL;DR

The main issue here is the lack of streaming-like data format.
I recommend reading and writing jsonl format, but keep working with your regular dict. Try the 2 options:

gzip + jsonl, using the file api (faster write)
clear jsonl, using the mmap api (faster read)

Full details below:

JSON Lines Format

The idea is to provide as close as posible format to json, while staying split-able.

This allows for a line by line, or a block(of lines) by block multiprocessing, locally or distributed.
(a common big-date practice might be storing on HDFS, processing by Spark, for ex').

It goes nicely with the gzip compression, which is split-friendly by itself - allowing for sequential reads and writes.

We'll wrap the read and write so that the application will be agnostic to it, and could still deal with the common dict.

A data simulator

I created 1M dict entries from your sample, with varying keys, currency and year (to challenge the gzip compression a bit).
I used a macbook pro m1.

import json
import gzip
import mmap
import subprocess

d = {}
years = { 0: 2019, 1: 2020, 2:2121 }
currencies = { 0: 'EUR', 1: 'USD', 2: 'GBP' }
n = int(1e6)

for i in range(n):
    rem = i % 3
    d[i] = {
        '_key': str(i),
        'group': 'test',
        'data': {},
        'type': '',
        'code': '007',
        'conType': '1',
        'flag': None,
        'createdAt': years[rem],
        'currency': currencies[rem],
        'detail': {
            'selector': {
                'number': '12312',
                'isTrue': True,
                'requirements': [{
                    'type': 'customer',
                    'requirement': '1'}]
                }
            },
        'identCode': [],
    }

Option #1 - gzip file api

For the 1M dataset it took ~10s to write and ~6s to read again.

file_name_jsonl_gz = './huge_dict.jsonl.gz'

# write
with gzip.open(file_name_jsonl_gz, 'wt') as f:
    for k, v in d.items():
        f.write(f'{{"{k}":{json.dumps(v)}}}\n') # from k, v pair into a json line

# read again
_d = {}
with gzip.open(file_name_jsonl_gz, 'rt') as f:
    for line in f:
        __d = json.loads(line)
        k, v = tuple(__d.items())[0] # from a single json line into k, v pair
        _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

True

Option #2 - mmap api

For the 1M dataset it took ~5s to write and ~8s to read again.

The Memory Mapped File is a potentially very strong technique for robust-ing our IO.
The basic idea is mapping [huge] files into the virtual-memory system, allowing partial / fast / concurrent reads and writes.
So, good for both huge files (that can't be fitted into memory) and a performance boost.

The code is more cumbersome, and not always the fastest, but you can further tweak it for your needs.
There are so many details about it, so I recommend reading more on wiki and python api, not to overwhelm the answer here.

file_name_mmap_jsonl = './huge_dict_mmap.jsonl'
# an initial large empty file (hard to estimate in advance)
# change the size for your actual needs
subprocess.Popen(['truncate', '-s', '10G', file_name_mmap_jsonl])

pos_counter = 0
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    # mmap gets its file descriptor from the file object
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_WRITE) as mm:
        buffer = []
        for k, v in d.items():
            s = f'{{"{k}":{json.dumps(v)}}}\n' # from k, v pair into a json line
            b = s.encode()
            buffer.append(b)
            pos_counter += len(b)

            # using buffer; not to abuse the write for every line
            # try and tweak it further
            if len(buffer) >= 100:
                mm.write(b''.join(buffer))
                buffer = []
        
        mm.write(b''.join(buffer))
        mm.flush()

# shrink to the excat needed size
subprocess.Popen(['truncate', '-s', str(pos_counter), file_name_mmap_jsonl])

# read again
_d = {}
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
        while True:
            line = mm.readline()
            if len(line) == 0: # EOF
                break
            __d = json.loads(line)
            k, v = tuple(__d.items())[0] # from a json line into k, v pair
            _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

True

There was also a 3rd Option: mmap + gzip, but the write was slow and there were issues with decompressing back the lines.
I recommend pursuing this, though - this will allow for a much smaller file size on disk.

回复收藏 0 原文

~没有更多了~