在 numpy 数组中扩展一系列非均匀 netcdf 数据

发布于 2024-08-28 22:51:22 字数 1181 浏览 14 评论 0原文

我是 python 新手，如果已经有人问过这个问题，我深表歉意。

使用 python 和 numpy，我尝试通过迭代调用 append() 将许多 netcdf 文件中的数据收集到单个数组中。

天真地，我试图做这样的事情：

from numpy import *
from pupynere import netcdf_file

x = array([])
y = [...some list of files...]

for file in y:
    ncfile = netcdf_file(file,'r')
    xFragment = ncfile.variables["varname"][:]
    ncfile.close()
    x = append(x, xFragment)

我知道在正常情况下这是一个坏主意，因为它会在每个 append() 调用上重新分配新内存。但有两件事阻碍了 x 的预分配：

1）文件沿轴 0 的大小不一定相同（但沿后续轴的大小应该相同），因此我需要事先从每个文件中读取数组大小以预先计算最终值x 的大小。

然而...

2）据我所知，pupynere（和其他netcdf模块）在打开文件时将整个文件加载到内存中，而不仅仅是一个引用（例如其他环境中的许多netcdf模块）。因此，要预分配，我必须打开文件两次。

据我所知，有许多（>100）大（>1GB）文件，因此过度分配和重塑是不切实际的。

我的第一个问题是我是否缺少一些智能的预分配方法。

我的第二个问题更严重。上面的代码片段适用于一维数组。但如果我尝试加载矩阵，那么初始化就会成为问题。我可以将一维数组附加到空数组：

append( array([]), array([1, 2, 3]) )

但我不能将空数组附加到矩阵：

append( array([]), array([ [1, 2], [3, 4] ]), axis=0)

我相信像 x.extend(xFragment) 这样的东西会起作用，但我不认为 numpy 数组具有此功能。我还可以通过将第一个文件视为特殊情况来避免初始化问题，但如果有更好的方法，我宁愿避免这种情况。

如果有人可以提供帮助或建议，或者可以找出我的方法的问题，那么我将不胜感激。谢谢

原文

I am new to python, apologies if this has been asked already.

Using python and numpy, I am trying to gather data across many netcdf files into a single array by iteratively calling append().

Naively, I am trying to do something like this:

from numpy import *
from pupynere import netcdf_file

x = array([])
y = [...some list of files...]

for file in y:
    ncfile = netcdf_file(file,'r')
    xFragment = ncfile.variables["varname"][:]
    ncfile.close()
    x = append(x, xFragment)

I know that under normal circumstances this is a bad idea, since it reallocates new memory on each append() call. But two things discourage preallocation of x:

1) The files are not necessarily the same size along axis 0 (but should be the same size along subsequent axes), so I would need to read the array sizes from each file beforehand to precalculate the final size of x.

However...

2) From what I can tell, pupynere (and other netcdf modules) load the entire file into memory upon opening the file, rather than just a reference (such as many netcdf modules in other enviroments). So to preallocate, I'd have to open the files twice.

There are many (>100) large (>1GB) files, so overallocating and reshaping is not practical, from what I can tell.

My first question is whether I am missing some intelligent way to preallocate.

My second question is more serious. The above snippet works for a single-dimension array. But if I try to load in a matrix, then initialisation becomes a problem. I can append a one-dimensional array to an empty array:

append( array([]), array([1, 2, 3]) )

but I cannot append an empty array to a matrix:

append( array([]), array([ [1, 2], [3, 4] ]), axis=0)

Something like x.extend(xFragment) would work, I believe, but I don't think numpy arrays have this functionality. I could also avoid the initialisation problem by treating the first file as a special case, but I'd prefer to avoid that if there's a better way to do it.

If anyone can offer help or a suggestion, or can identify a problem with my approach, then I'd be grateful. Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

虚拟世界 2024-09-04 22:51:22

您可以通过首先将文件中的数组加载到数组列表中，然后使用连接以连接所有数组。像这样的东西：

x = [] # a normal python list, not np.array
y = [...some list of files...]

for file in y:
    ncfile = netcdf_file(file,'r')
    xFragment = ncfile.variables["varname"][:]
    ncfile.close()
    x.append(xFragment)

combined_array = concatenate(x, axis=0)

You can solve the two problems by first loading the arrays from the files files into a list of arrays, and then using concatenate to join all the arrays. Something like this:

x = [] # a normal python list, not np.array
y = [...some list of files...]

for file in y:
    ncfile = netcdf_file(file,'r')
    xFragment = ncfile.variables["varname"][:]
    ncfile.close()
    x.append(xFragment)

combined_array = concatenate(x, axis=0)

回复收藏 0 原文

~没有更多了~