从NETCDF文件中获取特定的单元格值，首先执行

发布于 2025-01-30 08:12:46 字数 1053 浏览 4 评论 0原文

我正在使用Xarray Python库访问NetCDF文件。我使用的特定文件是公开可用。

因此，该文件具有多个变量，对于大多数这些变量，尺寸为：时间：4314，x：700，y：562。我使用的是et_500m变量，但是其他变量的行为也相似。大块是：288，36，44。

我正在检索一个单元格并使用以下代码打印值：

import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])

根据我的理解，Xarray应该直接定位包含磁盘中相应值的块的位置并读取它。由于块的大小不应大于几个MB，因此我希望这需要几秒钟甚至更少。但是，相反，它需要超过2分钟。

如果在同一脚本中，我还是检索了另一个单元格的值，即使它位于其他块中（例如print（ds.et_500m.values [1000] [500] [500] [500]））> ），然后第二次检索只花了一些毫秒。

所以我的问题是什么是在第一次检索中导致这个开销的原因是什么？

编辑：我刚刚看到，在Xarray Open_Dataset中有可选参数缓存，该缓存根据手册：

如果为true，则在访问内存中从内存中的基础数据存储中加载的缓存数据，以免多次从基础数据存储中读取。默认为true [...]

因此，当我将其设置为false时，随后的获取也像第一个一样慢。但是我的问题仍然是。为什么这么慢，因为我只访问一个单元格。我期望Xarray直接将块定位在磁盘上，只读了几个MB。

原文

I am accessing a netcdf file using the xarray python library. The specific file that I am using is publicly available.

So, the file has several variables, and for most of these variables the dimensions are: time: 4314, x: 700, y: 562. I am using the ET_500m variable, but the behaviour is similar for the other variables as well. The chunking is: 288, 36, 44.

I am retrieving a single cell and printing the value using the following code:

import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])

According to my understanding, xarray should locate directly the position of the chunk that contains the corresponding value in disk and read it. Since the size of the chunk should not be bigger than a couple of MBs, I would expect this to take a few seconds or even less. But instead, it takes more than 2 minutes.

If, in the same script, I retrieve the value of another cell, even if it is located in a different chunk (e.g. print(ds.ET_500m.values[1000][500][500])), then this second retrieval takes only some milliseconds.

So my question is what exactly causes this overhead in the first retrieval?

EDIT: I just saw that in xarray open_dataset there is the optional parameter cache, which according to the manual:

If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True [...]

So, when I set this to False, subsequent fetches are also slow like the first one. But my question remains. Why is this so slow since I am only accessing a single cell. I was expecting that xarray directly locates the chunk on disk and only reads a couple of MBs.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖不如嘢 2025-02-06 08:12:46

与其从.values属性中选择，不如首先子集：

print(ds.ET_500m[0, 0, 0].values)

问题是 .values 将数据胁迫到numpy阵列，因此您正在加载所有数据，然后将数组置于阵列。对于Xarray来说，没有办法 - Numpy没有任何懒惰加载的概念，因此，一旦您调用.values Xarray别无选择，只能加载（或计算）所有数据。

如果数据是DASK支持的数组，则可以使用 .data 而不是.values访问dask数组并在dask数组上使用位置索引，例如ds.et_500m.data [0，0，0，0，0，0 ，0]。但是，如果数据只是一个懒惰的netcdf .data，将具有上述相同的负载 - 所有陷阱。

Rather than selecting from the .values property, subset the array first:

print(ds.ET_500m[0, 0, 0].values)

The problem is that .values coerces the data to a numpy array, so you're loading all of the data and then subsetting the array. There's no way around this for xarray - numpy doesn't have any concept of lazy loading, so as soon as you call .values xarray has no option but to load (or compute) all of your data.

If the data is a dask-backed array, you could use .data rather than .values to access the dask array and use positional indexing on the dask array, e.g. ds.ET_500m.data[0, 0, 0]. But if the data is just a lazy-loaded netCDF .data will have the same load-everything pitfall described above.

回复收藏 0 原文

~没有更多了~