在 numpy 数组中扩展一系列非均匀 netcdf 数据
我是 python 新手,如果已经有人问过这个问题,我深表歉意。
使用 python 和 numpy,我尝试通过迭代调用 append()
将许多 netcdf 文件中的数据收集到单个数组中。
天真地,我试图做这样的事情:
from numpy import *
from pupynere import netcdf_file
x = array([])
y = [...some list of files...]
for file in y:
ncfile = netcdf_file(file,'r')
xFragment = ncfile.variables["varname"][:]
ncfile.close()
x = append(x, xFragment)
我知道在正常情况下这是一个坏主意,因为它会在每个 append()
调用上重新分配新内存。但有两件事阻碍了 x 的预分配:
1)文件沿轴 0 的大小不一定相同(但沿后续轴的大小应该相同),因此我需要事先从每个文件中读取数组大小以预先计算最终值x 的大小。
然而...
2)据我所知,pupynere(和其他netcdf模块)在打开文件时将整个文件加载到内存中,而不仅仅是一个引用(例如其他环境中的许多netcdf模块)。因此,要预分配,我必须打开文件两次。
据我所知,有许多(>100)大(>1GB)文件,因此过度分配和重塑是不切实际的。
我的第一个问题是我是否缺少一些智能的预分配方法。
我的第二个问题更严重。上面的代码片段适用于一维数组。但如果我尝试加载矩阵,那么初始化就会成为问题。我可以将一维数组附加到空数组:
append( array([]), array([1, 2, 3]) )
但我不能将空数组附加到矩阵:
append( array([]), array([ [1, 2], [3, 4] ]), axis=0)
我相信像 x.extend(xFragment) 这样的东西会起作用,但我不认为 numpy 数组具有此功能。我还可以通过将第一个文件视为特殊情况来避免初始化问题,但如果有更好的方法,我宁愿避免这种情况。
如果有人可以提供帮助或建议,或者可以找出我的方法的问题,那么我将不胜感激。谢谢
I am new to python, apologies if this has been asked already.
Using python and numpy, I am trying to gather data across many netcdf files into a single array by iteratively calling append()
.
Naively, I am trying to do something like this:
from numpy import *
from pupynere import netcdf_file
x = array([])
y = [...some list of files...]
for file in y:
ncfile = netcdf_file(file,'r')
xFragment = ncfile.variables["varname"][:]
ncfile.close()
x = append(x, xFragment)
I know that under normal circumstances this is a bad idea, since it reallocates new memory on each append()
call. But two things discourage preallocation of x:
1) The files are not necessarily the same size along axis 0 (but should be the same size along subsequent axes), so I would need to read the array sizes from each file beforehand to precalculate the final size of x.
However...
2) From what I can tell, pupynere (and other netcdf modules) load the entire file into memory upon opening the file, rather than just a reference (such as many netcdf modules in other enviroments). So to preallocate, I'd have to open the files twice.
There are many (>100) large (>1GB) files, so overallocating and reshaping is not practical, from what I can tell.
My first question is whether I am missing some intelligent way to preallocate.
My second question is more serious. The above snippet works for a single-dimension array. But if I try to load in a matrix, then initialisation becomes a problem. I can append a one-dimensional array to an empty array:
append( array([]), array([1, 2, 3]) )
but I cannot append an empty array to a matrix:
append( array([]), array([ [1, 2], [3, 4] ]), axis=0)
Something like x.extend(xFragment) would work, I believe, but I don't think numpy arrays have this functionality. I could also avoid the initialisation problem by treating the first file as a special case, but I'd prefer to avoid that if there's a better way to do it.
If anyone can offer help or a suggestion, or can identify a problem with my approach, then I'd be grateful. Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以通过首先将文件中的数组加载到数组列表中,然后使用 连接 以连接所有数组。像这样的东西:
You can solve the two problems by first loading the arrays from the files files into a list of arrays, and then using concatenate to join all the arrays. Something like this: