熊猫/numpy,比字符串的内存数组大
我有一个比内存大的数据集,我需要对其进行处理。 我在这个主题上没有经验,因此任何方向都可以提供帮助。
我主要想出了如何将原始数据加载为块,但是我需要对其进行处理并保存结果,这些结果可能也大于内存。 我已经看到Pandas,Numpy和Python都支持某种形式的memmap
,但我不完全了解如何处理并处理它。 我希望抽象能够使用磁盘,因为我在使用MEMMAP时使用RAM并与在磁盘上保存在磁盘上的对象接口
# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
...调整大小是否可以保存以将
任何建议保存?
I have a data set that's larger than memory and I need to process it.
I am not experienced in this subject thus any directions can help.
I mostly figured out how to load the raw data as chunks but I need to process it and save the results, which likely to also be larger than memory.
I have seen that pandas, numpy and python all support some form of memmap
but I don't exactly understand how to go about and handle it.
I expected an abstraction to be able to use my disk as I use my ram and interface with the object saved on disk as normal python/numpy/etc object when using memmap... but that isn't working for me whatsoever
# Create file to store the results in
x = np.require(np.lib.format.open_memmap('bla.npy',mode='w+'), requirements=['O'])
# Mutate it and hopefully these changes will be reflected in the file on disk?
x.resize(10,refcheck=False)
x
memmap([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
x[:] = list(range(10))
x
memmap([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
x.flush()
y = np.require(np.lib.format.open_memmap('bla.npy',mode='r+'), requirements=['O'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 870, in open_memmap
shape, fortran_order, dtype = _read_array_header(fp, version)
File "/root/RedditAnalysis/env/lib/python3.8/site-packages/numpy/lib/format.py", line 614, in _read_array_header
raise ValueError(msg.format(d['shape']))
ValueError: shape is not valid: None
Which means the resize isn't being saved to disk
Any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
np.require()
制作了MEMMAP数组的副本,因为它没有“拥有”其数据。根据open_memmap()
文档,您必须在打开撰写文件时指定形状。否则,它将“无”作为形状写,这使y
arrayopen_memmap()
调用失败。看起来MEMMAP数组不支持
.resize()
(请参阅 numpy问题),但是这样的答案如果您需要的话。np.require()
makes a copy of the memmap array, since it doesn't "own" its data. According to theopen_memmap()
docs, you have to specify the shape when you open a file for writing. Otherwise, it writes "None" as the shape, which makes they
arrayopen_memmap()
call fail.It looks like memmap arrays don't support resizing with
.resize()
(see numpy issue), but there's a workaround in this SO answer if you need that.