熊猫/numpy，比字符串的内存数组大

发布于 2025-01-25 01:43:06 字数 1782 浏览 4 评论 0原文

我有一个比内存大的数据集，我需要对其进行处理。我在这个主题上没有经验，因此任何方向都可以提供帮助。

我主要想出了如何将原始数据加载为块，但是我需要对其进行处理并保存结果，这些结果可能也大于内存。我已经看到Pandas，Numpy和Python都支持某种形式的memmap，但我不完全了解如何处理并处理它。我希望抽象能够使用磁盘，因为我在使用MEMMAP时使用RAM并与在磁盘上保存在磁盘上的对象接口

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

...调整大小是否可以保存以将

任何建议保存？

原文

I have a data set that's larger than memory and I need to process it.
I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory.
I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it.
I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

分享到QQ

分享到微博