熊猫/numpy,比字符串的内存数组大

发布于 2025-01-25 01:43:06 字数 1782 浏览 4 评论 0原文

我有一个比内存大的数据集,我需要对其进行处理。 我在这个主题上没有经验,因此任何方向都可以提供帮助。

我主要想出了如何将原始数据加载为块,但是我需要对其进行处理并保存结果,这些结果可能也大于内存。 我已经看到Pandas,Numpy和Python都支持某种形式的memmap,但我不完全了解如何处理并处理它。 我希望抽象能够使用磁盘,因为我在使用MEMMAP时使用RAM并与在磁盘上保存在磁盘上的对象接口

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

...调整大小是否可以保存以将

任何建议保存?

I have a data set that's larger than memory and I need to process it.
I am not experienced in this subject thus any directions can help.

I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory.
I have seen that pandas, numpy and python all support some form of memmap but I don't exactly understand how to go about and handle it.
I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever

# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
    shape, fortran_order, dtype = _read_array_header(fp, version)
  File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
    raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None

Which means the resize isn't being saved to disk

Any suggestion?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

热血少△年 2025-02-01 01:43:06

np.require()制作了MEMMAP数组的副本,因为它没有“拥有”其数据。根据open_memmap()文档,您必须在打开撰写文件时指定形状。否则,它将“无”作为形状写,这使y array open_memmap()调用失败。

看起来MEMMAP数组不支持.resize()(请参阅 numpy问题),但是这样的答案如果您需要的话。

np.require() makes a copy of the memmap array, since it doesn't "own" its data. According to the open_memmap() docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes the y array open_memmap() call fail.

It looks like memmap arrays don't support resizing with .resize() (see numpy issue), but there's a workaround in this SO answer if you need that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文